Introduction

The rapid growth of the Internet has facilitated an explosion of information, exacerbating the issue of information overload. Recommender systems are critical in helping users discover items of interest across various platforms, such as E-commerce, online entertainment, online education, and social networks1. Knowledge graphs (KGs) have demonstrated significant potential in enhancing both the accuracy and interpretability of recommendations. The rich entity and relation information in KGs not only uncover diverse relationships among items but also explain user preferences. Recently, knowledge-aware recommendation has garnered substantial research interest, with graph neural networks (GNNs) emerging as the dominant models in this domain2,3,4,5,6.

GNN-based recommendation models employ an informative aggregation paradigm to integrate multi-hop neighbors into node representations, offering a robust mechanism for generating permutation-invariant aggregation on the neighbors of a node3,4,7. However, these models commonly struggle with sparse supervision signals and redundant entity relations, which can limit the beneficial effects of KGs on recommendation performance.

Like traditional collaborative filtering, GNN-based models rely on abundant user-item interaction data to capture user preferences. Severe sparsity in interactions can lead to degeneration issues, such as collapsing node embedding distributions into narrow cones, which results in indistinct node representations6,8,9,10. To address sparse supervision signals, several works adopt meta-learning strategies to learn shared prior knowledge across users and enhance model generalization11,12,13. However, these approaches often fail to incorporate auxiliary data and are solely dependent on limited interaction data, increasing susceptibility to noise and reducing robustness in representing user preferences14. Other approaches incorporate ancillary data, such as user attributes15,16 and social relationships17,18. Yet, such data are often challenging to obtain owing to privacy concerns and streamlined registration processes.

Contrastive learning, a recent self-supervised learning technique, addresses the issue of sparse supervision signals by learning discriminative embeddings from unlabelled data, maximizing the distance between negative samples while minimizing it for positive ones6,8,10. Another promising approach is intent learning, which enhances user-item connections by inserting intermediate nodes into interactions to uncover underlying intents7,19.

In contrast, the problem of redundant entity relations has received significantly less attention than the issue of sparse supervision signals. Redundant entity relations refer to those within KGs that contribute little to feature extraction for users or items, and may even introduce noise during GNN aggregation. These relations are typically identified based on long-tail distributions and subsequently removed or merged via clustering techniques5.

Despite their successes, existing approaches for tackling sparse supervision signals and redundant entity relations exhibit several limitations: (1) Most approaches only address one type of problem, limiting overall improvements in recommendation accuracy. (2) Many employ multi-view contrastive learning, resulting in substantial computational overhead. To this end, we propose a Dual-Intent-View Contrastive Learning (DIVCL) framework for knowledge-aware recommender systems. DIVCL simultaneously addresses both the challenges by combining contrastive and intent learning while reducing computational complexity. The main contributions of this study are as follows:

(1) Entity Relation Filtering Based on Intent Relevance: We propose an entity relation filtering strategy grounded in intent relevance. Intents, which are shared across user-item interactions, are employed as metrics to remove redundant relations and provide fine-grained representations of interactions. This approach jointly alleviates the issues of sparse supervision signals and redundant entity relations within the KG structure.

(2) Dual-View Representation Learning: DIVCL captures user preferences using dual-view GNN-based representation learning. The local view focuses on the user-item interaction graph (IG), while the global view considers the user-item-entity graph (IG + KG). This dual-view approach enhances user and item representations through both intent and contrastive learning. Furthermore, DIVCL is computationally lighter compared to existing multi-view contrastive learning models.

(3) Comprehensive Experimental Evaluation: We conduct extensive experiments on three benchmark datasets. Notably, we introduce a new dataset, the Fabric Mall dataset. Experimental results demonstrate that DIVCL achieves superior performance compared to state-of-the-art approaches, validating the effectiveness of our model.

The remainder of this paper is organized as follows: Sect. 2 discusses related work. Section 3 formulates the problem. The methodology is presented in Sect. 4. Section 5 provides an interpretation of experimental results, and conclusions are drawn in Sect. 6.

Related work

Knowledge-aware recommendation approaches

Knowledge-aware recommendation has evolved from embedding-based and path-based to GNN-based20. Embedding-based approaches utilize graph embedding models such as TransE21, TransH22, TransD23 and TransR24 to preprocess KGs, incorporating learned entity and relation embeddings into recommendation tasks6,25,26. However, these approaches typically achieve low recommendation accuracy, as they emphasize semantic relatedness over user preference. COMET27 improves recommendation accuracy through simultaneously modelling the high-order interaction patterns among historical interactions and embedding dimensions. Path-based approaches, on the other hand, explore various connection patterns among items within KGs to provide additional guidance for recommendations28,29. These approaches rely heavily on manually designed meta-paths, which demand extensive domain knowledge and laborious efforts.

GNN-based approaches, built upon graph convolutional networks (GCNs)30, aggregate information from neighboring nodes into the representation of the target node and incorporate high-order neighbors by stacking multiple GNN layers5. KGCN31 serves as an early GNN-based recommendation model that utilizes GCN to aggregate neighborhood information when computing the representation of a given entity in the KG, subsequently predicting user engagement with items. Numerous extensions have since been proposed, such as KGAT3 and CKAN2, which introduce attention mechanisms to differentiate the importance of neighbors. KGNN-LS32 and KNI33 combine label smoothness regularization with neighborhood interaction to enhance the information aggregation process, while MKM-SR34 integrates user micro-behaviors and item knowledge into multi-task learning. GNN-based approaches have demonstrated powerful capabilities in effectively generating local, permutation-invariant aggregation over the neighbors of a node, which positions them as the foundation for addressing sparse supervision signals and redundant entity relations in this study.

Intent-oriented recommendation approaches

The concept of user intent has become increasingly prevalent in E-commerce applications such as Taobao and Amazon. User intent is often manifested as automatically generated search suggestions, based on users’ historical behavior. Traditionally, intent generation relies heavily on domain knowledge and the extraction of handcrafted features. To automate intent generation, MEIRec35 treats queries as user intents and defines users, queries, and items in a semantic order, using meta-path-guided neighbors to generate intents from rich interaction data.

Subsequently, user intents have been considered latent and invisible and used to enrich user-item interaction features rather than to directly assist in item search. DGCF19 models user intents as fine-grained representations of user-item interactions and generates disentangled representations using a GNN model. RAISE36 models user intents as user-item pairs, which are mined from text reviews by differentiating the importance of review information with a co-attention network. KGIN7 treats each intent as an attentive combination of entity relations, promoting independence among different intents to enhance model capability and interpretability. It also employs an informative aggregation scheme for GNNs, recursively integrating relation sequences through long-range connectivity (i.e., relational paths). Furthermore, user intents have been extended to session-based recommendations, where they are treated as sequentially related37,38,39,40,41.

The concept of user intent has inspired new approaches in recommendation system design, achieving notable success in mitigating sparse supervision signals. However, intent has yet to be applied to tackle redundant entity relations, and its combination with contrastive learning remains rare. This paper aims to bridge these gaps in the research.

Contrastive learning recommendation approaches

Contrastive learning approaches derive node representations by contrasting positive pairs against negative pairs6. Typically, positive pairs are different augmentations of the same sample, while negative pairs are derived from distinct samples. The learning objective is to maximize the mutual information between positive data transformations while enhancing the discrimination of negative samples. Since contrastive learning does not rely on class labels, it effectively addresses the issue of sparse supervision signals through self-supervised learning.

In recent years, various data augmentation techniques have been introduced in recommendation models. For example, SGL8 performs contrastive learning between the original graph and a corrupted graph based on user-item interactions. Other models, such as MBCLRec42, CML43 and KMCLR44, conduct contrastive learning across multi-behavioral relationships, including page views, favorites, carts, and purchases. MSICL45, CDGCL46 and DMM-Rec47 build contrastive views using different modalities of item-side information, minimizing the dissimilarity between modalities of the same sample while maximizing it between negative samples within each modality. BUCL10 creates contrastive views from graph embedding and relation-aware attention, aiming to capture various knowledge and item representation. DGI48, HeCo49, GMI50, MVGRL51, MCCLK6, AMMCN52, KGCL53 and KRCL54 create contrastive views from multi-level user-item-entity graphs, aiming to extract comprehensive graph features and structure information in a self-supervised manner.

In terms of learning methodology, many of these models are GNN-based. For instance, MVGRL51, MCCLK6 and AMMCN52 utilize multi-view contrastive learning, while SGL8 incorporates supervised recommendation tasks within a multi-task framework. ICL55, in particular, proposes an intent contrastive learning approach for sequential recommendation, which inserts latent intent variables into each user-item interaction. ICL learns intent distributions from unlabeled user behavior sequences through contrastive learning, maximizing the agreement between a sequence and its corresponding intent. However, ICL is not a knowledge-aware recommendation approach and relies solely on interaction sequences.

In real-world scenarios, multi-behavior and multi-modal data are often unavailable, and multi-view or multi-task contrastive learning approaches typically suffer from excessive resource consumption and low training efficiency. This study aims to address these challenges by adopting a dual-view approach with KG integration and combining intent and contrastive learning to tackle both sparse supervision signals and redundant entity relations.

The comparison of DIVCL with recent approaches

A comparison of DIVCL with recent models is provided in Table 1. It can be found that the main distinctions between DIVCL and existing models lie in the multi-purpose of intent application, the dual-view contrastive learning and the intent relevance in KG filtering strategy.

Problem formulation

To ensure the availability of data, this study assumes that the recommendation model is based on user-item interactions and item knowledge graphs (KGs). User-item interactions represent the supervised signals, which may include actions such as page views, favorites, cart additions, and purchases. The item KGs consist of entities and entity relations associated with the items.

Table 1 The comparison of DIVCL with recent models.

Let U be a set of users and I a set of items. The interactions are represented as a matrix \({\mathbf{Y}} \in {{\mathbb{R}}^{\left| {\mathbf{U}} \right| \times \left| {\mathbf{I}} \right|}}\), where |U| is the size of U, and |I| is the size of I. Let yui∈Y denote the interaction between a user u∈U and an item i∈I. If there exists interaction between u and i, then yui=1; otherwise, yui=0. Y is the representation of interaction graph (IG), which consists of user and item nodes.

Let V (⊃I) be a set of entities including the items, item attributes, taxonomy, external commonsense knowledge, and Re be a set of entity relations. A KG is a graph-structured data model that contains V and Re, represented as a set of triples \(\:\mathbf{G}=\left\{\left(h,r,\text{t}\right)|h,t\in\:\mathbf{V},\:r\in\:\mathbf{R}\mathbf{e}\right\}\).

Given Y and G, the task of the recommendation system is to learn a function that can predict how likely a user would adopt an item.

Methodology

It is assumed that both the interaction graph (IG) and the knowledge graph (KG) are prepared and serve as the initial inputs to the model. The IG represents the local graph, while the combination of the IG and KG (IG + KG) forms the global graph. These graphs are treated as two initial views for GNN-based information aggregation. The workflow of the proposed DIVCL model is depicted in Fig. 1, comprising four key components: (1) Global intent view encoder, which injects intents and generates the embedding representations for the IG + KG. (2) Local intent view encoder, which injects intents and produces embedding representations for the IG. (3) Contrastive learning, which is performed between the local and global intent views to obtain discriminative node embeddings. (4) Model prediction, which estimates the likelihood of a user adopting an item based on the generated user and item embeddings.

Initially, the global and local view encoders run in parallel, and their outputs are connected through contrastive learning. Finally, the user and intent embeddings from both views are concatenated and serve as inputs to the model’s prediction component. The details of each component will be elaborated in the following subsections.

Global intent view encoder

The global intent view encoder begins with the interaction graph (IG) and knowledge graph (KG) (denoted as Y and G, respectively). It proceeds through three main stages: intent injection, KG filtering, and GNN-based information aggregation, ultimately generating the user and item embeddings. The KG filtering step introduces a novel strategy for removing redundant entity relations before aggregating information, thus improving the quality of the embeddings. The remaining components of the encoder are modeled following the approach of KGIN7, an approach whose effectiveness has been recently validated.

Fig. 1
figure 1

The workflow of the proposed DIVCL model. The user, item, entity and intent nodes are represented as yellow, blue, green and brown circles, respectively. A line between two nodes means an edge of the IG or KG. The intent nodes relate to all user nodes. A small solid line box represents a graph. A line with the label x means the line is removed from the previous graph. An arrow line represents the transformation relationship between graphs, and the text above and below the arrow line indicates the transformation name.

Intent injection

Let P be the set of intents shared by all users, an interaction pair (u, i) is decomposed into a set of triples {(u, p, i)|p∈P} by inserting P. Each intent p∈P is assigned with a distribution over KG relations, and its embedding is created by an attention mechanism as follows:

$${{\mathbf{e}}_p}=\sum\nolimits_{{r \in {\mathbf{Re}}}} {\alpha (r,p} ){{\mathbf{e}}_r}$$
(1)

where ep is the embedding representation of intent p, er is the embedding representation of relation r, and \(\alpha (r,p)\) represents the importance of r as follows:

$$\alpha (r,p)=\frac{{\exp ({w_{rp}})}}{{\sum\nolimits_{{r^\prime \in {\mathbf{Re}}}} {\exp ({w_{r^\prime}p})} }}$$
(2)

where \(\:{w}_{rp}\) is a trainable weight specific to certain relation r and certain intent p.

KG filtering

In the KG, certain entities and relations may be irrelevant to the recommendation context, thereby introducing noise into the learning process. To address this, the KG filtering strategy, which is the entity relation filtering based on intent relevance mentioned in the contributions, is proposed. In the strategy, intents serve as feasible assessment criteria for KG filtering. Specifically, the projected distance of a relation within the latent space of intents is used to evaluate the relation’s influence on the intent space, which in turn determines whether the relation should be retained or discarded.

The process begins by projecting the entity embeddings into the latent space of intents, as follows:

$${\mathbf{e}}_{v}^{{\text{p}}}={{\mathbf{e}}_v} \times {\mathbf{E}}_{{\text{P}}}^{{\text{T}}}$$
(3)

where v∈V denotes an entity in the KG, ev denotes the embedding of v, \({\mathbf{E}}_{{\text{P}}}^{{\text{T}}}\) denotes the transpose of intents embedding matrix in the global view, of which each row is an intent embedding vector \({{\mathbf{e}}_p}\), \({\mathbf{e}}_{v}^{{\text{p}}}\) denotes the projected embedding of v.

For a triple \(\:\left(h,r,t\right)\in\:\mathbf{G}\), the influence of the relation r over intent space is evaluated by the structural similarity between h and t in the latent space of the intents as follows:

$${\text{sim}}(h,t)=\frac{{{\mathbf{e}}_{h}^{{\text{p}}} \cdot {\mathbf{e}}_{t}^{{\text{p}}}}}{{\left\| {{\mathbf{e}}_{h}^{{\text{p}}}} \right\| \times \left\| {{\mathbf{e}}_{t}^{{\text{p}}}} \right\|}}$$
(4)

where sim(h,t) is the cosine similarity between h and t and negatively correlated with the influence of the relation r over intent space.

To simplify the threshold setting for the KG filtering strategy, the sim(.) is normalized as follows:

$$\text{sim}^\prime(h,t)=\frac{{1+sim(h,t)}}{2}$$
(5)

where \(\text{sim}^\prime(h,t)\) is the normalized sim(h,t).

Thereafter, the KG is filtered as follows:

$${\mathbf{G}}^{\prime} = \{ (h,r,t) | {\text{sim}}^{\prime}(h,t) \leq \theta , (h,r,t) \in {\mathbf{G}} \}$$
(6)

where \({\mathbf{G^{\prime}}}\) represents the preserved KG, and θ represents the filtering threshold, which is the hyperparameter to control the influence of relation.

GNN information aggregation

Following the GNN information aggregation on the basis of Light GCN, the representation of user u is generated by recursively integrating the intent-aware information from interacted items. Based on the interactions in the matrix Y, the representation of u is recursively aggregated as follows:

$${\mathbf{e}}_{u}^{{(l+1)}}=\frac{1}{{\left| {{N_u}} \right|}}\sum\nolimits_{{i \in {N_u}}} {\omega (u,p){{\mathbf{e}}_p} \odot {\mathbf{e}}_{i}^{{(l)}}}$$
(7)

where l∈{0,1,2,., L} is the number of layers of aggregation, eu, ep and ei are the embedding representation of u, p and i, respectively, \(\odot\) is the elementwise product,\({N_u}=\{ i|{y_{ui}} \in {\mathbf{Y}},{y_{ui}}=1\}\) is the items that u has interacted with, \(\omega (u,p)\) represents the importance of p to u which is computed as follows:

$$\omega (u,p)=\frac{{\exp ({{\mathbf{e}}_p} \cdot {\mathbf{e}}_{u}^{{\text{T}}})}}{{\sum\nolimits_{{p^\prime \in {\mathbf{P}}}} {\exp ({{\mathbf{e}}_{p^\prime}} \cdot {\mathbf{e}}_{u}^{{\text{T}}})} }}$$
(8)

where \({\mathbf{e}}_{u}^{{\text{T}}}\) is the transpose of eu.

The representation of item i is recursively aggregated as follows:

$${\mathbf{e}}_{i}^{{(l+1)}}=\frac{1}{{\left| {N_{{_{i}}}^{{\text{g}}}} \right|}}\sum\nolimits_{{v \in {N_i}}} {{{\mathbf{e}}_r} \odot {\mathbf{e}}_{v}^{{(l)}}}$$
(9)

where \(N_{{_{i}}}^{{\text{g}}}=\{ v|(i,r,v) \in {\mathbf{G}}^\prime\}\) is the neighbor set of i in the filtered KG.

Finally, the embedding of user and item under the global intent view is represented by the summation of the representation of all layers as follows:

$${\mathbf{e}}_{u}^{{\text{g}}}=\sum\limits_{{l=0}}^{L} {{e_u}^{{(l)}}}$$
(10)
$${\mathbf{e}}_{i}^{{\text{g}}}=\sum\limits_{{l=0}}^{L} {{e_i}^{{(l)}}}$$
(11)

where \({\mathbf{e}}_{u}^{{\text{g}}}\) is the embedding of user u under the global intent view, and \({\mathbf{e}}_{i}^{{\text{g}}}\) is the embedding of item i under the global intent view.

It should be noted that \({\mathbf{e}}_{u}^{{(0)}}\),\({\mathbf{e}}_{i}^{{(0)}}\) and er are trainable weight vectors.

Local intent view encoder

The local intent view encoder forms the user and item embedding from the IG. It undergoes intent injection and GNN information aggregation but not KG filtering, which exists in the global intent view encoder.

In the intent injection component, an interaction pair (u, i) is decomposed into a set of triples {(u, p, i)|p∈P} like that in the global intent view. The ep, embedding representation of intent p, is a trainable weight vector since there is only interaction relation in the IG.

The representation of u is recursively aggregated as Eq. (7), and the representation of i is recursively aggregated as follows:

$${\mathbf{e}}_{i}^{{(l+1)}}=\sum\nolimits_{{u \in N_{i}^{{\text{c}}}}} {\frac{1}{{\left| {N_{i}^{{\text{c}}}} \right|}}{\mathbf{e}}_{u}^{{(l)}}}$$
(12)

where \(N_{i}^{{\text{c}}}=\{ u|{y_{ui}} \in {\mathbf{Y}},{y_{ui}}=1\}\)is the users interacted with i.

Finally, the embedding of user and item under the local intent view is represented by the summation of the representation of all layers as follows:

$${\mathbf{e}}_{u}^{{\text{c}}}=\sum\limits_{{l=0}}^{L} {{e_u}^{{(l)}}}$$
(13)
$${\mathbf{e}}_{i}^{{\text{c}}}=\sum\limits_{{l=0}}^{L} {{e_i}^{{(l)}}}$$
(14)

where \({\mathbf{e}}_{u}^{{(0)}}\)and\({\mathbf{e}}_{i}^{{(0)}}\) are trainable weight vectors.

Contrastive learning

In the training process, the intents injected in the IG are hard to converge owing to single relational constraint, whereas the intents injected in the IG + KG are easy to prone to fall into prematurity owing to multi relational constraints. Therefore, \({\mathbf{e}}_{u}^{{\text{g}}}\), \({\mathbf{e}}_{i}^{{\text{g}}}\), \({\mathbf{e}}_{u}^{{\text{c}}}\)and\({\mathbf{e}}_{i}^{{\text{c}}}\) can be improved by contrastive learning between the global and local intent view.

Inspired by6, a sample pair is defined under the cross views. Given i∈I, j∈I, i ≠ j, the pair \((e_{i}^{c},e_{i}^{g})\) is defined as a positive item pair since they are the two embedding vectors of one item under different views, and the pair \((e_{i}^{c},e_{j}^{g})\) is defined as a negative item pair since they are the two embedding vectors of two items under different views. The definition rules are also applied to user pairs.

The contrastive loss of item i is defined as follows:

$$L_{i}^{c}= - \log \frac{{\exp (\operatorname{sim} (e_{i}^{c},e_{i}^{g})/\tau )}}{{\exp (\operatorname{sim} (e_{i}^{c},e_{i}^{g})/\tau )+\sum\nolimits_{{j \ne i}} {\exp (\operatorname{sim} (e_{i}^{c},e_{j}^{g})/\tau )} }}$$
(15)

where \(L_{i}^{c}\) is the contrastive loss under the local intent view, sim(.) is the cosine similarity, and \(\tau \in {\mathbb{R}}\) is a hyperparameter named temperature control parameter.

$$L_{i}^{g}= - \log \frac{{\exp (\operatorname{sim} (e_{i}^{g},e_{i}^{c})/\tau )}}{{\exp (\operatorname{sim} (e_{i}^{g},e_{i}^{c})/\tau )+\sum\nolimits_{{j \ne i}} {\exp (\operatorname{sim} (e_{i}^{g},e_{j}^{c})/\tau )} }}$$
(16)

where \(L_{i}^{{\text{g}}}\) is the contrastive loss under the global intent view.

The contrastive loss of user u is defined with the Eqs. (15) and (16), simply replacing i with u. The total contrastive loss is the sum of \(L_{i}^{c}\)and \(L_{i}^{{\text{g}}}\), as follows:

$${L_{CG}}=\frac{1}{{\left| {\mathbf{I}} \right|}}\sum\nolimits_{{i \in {\mathbf{I}}}} {(L_{i}^{c}+L_{i}^{g})} +\frac{1}{{\left| {\mathbf{U}} \right|}}\sum\nolimits_{{u \in {\mathbf{U}}}} {(L_{u}^{c}+L_{u}^{g})}$$
(17)

where \({L_{CG}}\) is the total contrastive loss.

Model prediction

The final embedding vector of user u (or item i) is represented by the concatenation of the embedding vector under local and global intent view. Thereafter, how likely user u would adopt item i is evaluated by the inner product on the final embedding vector of u and i as follows:

$${\hat {y}_{ui}}={({\mathbf{e}}_{u}^{g} \oplus {\mathbf{e}}_{u}^{{\text{c}}})^{\text{T}}} \bullet ({\mathbf{e}}_{i}^{g} \oplus {\mathbf{e}}_{i}^{{\text{c}}})$$
(18)

where \(\oplus\) denotes vector concatenation.

It is noted that \({\hat {y}_{ui}}\) is not a probability but a score, and used to compare which of two items is more likely to be adopted by a user.

Optimizer

The BPR56 loss is used to train the model and defined as follows:

$${L_{{\text{BPR}}}}=\sum\nolimits_{{\left( {u,i,j} \right) \in O}} { - \ln \sigma \left( {{{\hat {y}}_{ui}} - {{\hat {y}}_{uj}}} \right)}$$
(19)

where \(O=\left\{ {\left( {u,i,j} \right)\left| {\left( {u,i} \right) \in } \right.{O^+},\left| {\left( {u,j} \right) \in } \right.{O^-}} \right\}\) is the training dataset consisting of the observed interactions \({O^+}\) and unobserved counterparts \({O^-}\), and σ(·) is the sigmoid function.

By combining the BPR loss and the total contrastive loss, the following objective function is minimized to learn the model parameter:

$$L={L_{{\text{BPR}}}}+\beta {L_{{\text{CL}}}}+\lambda \left\| \Theta \right\|_{2}^{2}$$
(20)

where \(\Theta =\{ {\mathbf{e}}_{u}^{{(0)}},{\mathbf{e}}_{v}^{{(0)}},{{\mathbf{e}}_r},{{\mathbf{e}}_p},{\mathbf{w}}|u \in {\mathbf{U}},v \in {\mathbf{V}},r \in {\mathbf{Re}},p \in {\mathbf{P}}\}\) is the set of model parameters (note that \({\mathbf{V}} \subset {\mathbf{I}}\), and \({\mathbf{e}}_{u}^{{(0)}},{\mathbf{e}}_{i}^{{(0)}},{{\mathbf{e}}_p}\)are duplicated in the local and global intent view encoder), \(\beta\)and \(\lambda\) are two hyperparameters to control the contrastive loss and L2 regularization term, respectively.

Experiments

The experiments are designed to answer the following research questions:

  • RQ1: How does the proposed DIVCL model perform compared to state-of-the-art recommender models?

  • RQ2: What is the impact of the KG filtering and contrastive learning strategies on the performance improvement of the DIVCL model?

  • RQ3: Does the proposed DIVCL model significantly increase the computational load compared to baseline models?

Experimental settings

Dataset description

Three benchmark datasets for movie, book, and fabric recommendations are used to evaluate the effectiveness of the proposed DIVCL model. The movie and book datasets are MovieLens-1 M and Amazon-Book, both released by Recbole (https://recbole.io/dataset_list.html), which are widely used for validating recommendation models. To ensure the quality of interactions, the movie dataset is filtered by removing interactions with user ratings below 3. Similarly, the book dataset is filtered by removing interactions with ratings below 3, users with fewer than 10 interactions, and entities with fewer than 5 relations.

The fabric dataset is self-built and has been released at https://github.com/yzxx667/DIVCL along with the DIVCL code. This dataset was collected from a large fabric wholesale mall, which serves as the motivation for this study. The item classes primarily consist of fabrics, linings, and accessories, while the entity classes include shops, materials, patterns, colors, and weaving techniques. The relation classes correspond to these entity categories. Unlike the movie and book datasets, interactions in the fabric dataset are mostly based on browsing rather than purchases or ratings.

The sparsity of a dataset is calculated as follows:

$$S=\left( {1 - \frac{{\left| {{\mathbf{IG}}} \right|}}{{\left| {\mathbf{U}} \right| \times \left| {\mathbf{I}} \right|}}} \right) \times 100\%$$
(21)

where S is the sparsity of the dataset, a higher value indicates a sparser supervised signal, and |IG| is the size of interactions.

The statistics of the three datasets are summarized in Table 2. It is evident that the supervised signals are highly sparse across all datasets, particularly in the book and fabric datasets. Additionally, the datasets contain multiple classes of relations, some of which may be redundant, potentially affecting the model’s performance.

Table 2 Statistics of the benchmark datasets.

During the training phase, each observed user-item interaction is treated as a positive instance. While for negative sampling, an item that the user has not interacted with is randomly selected and paired with the user. For each dataset, 80% of the data is randomly allocated to the training set, 10% to the validation set, and the remaining 10% is used for testing. This split ensures a robust evaluation of the model’s performance.

Parameter settings

The proposed DIVCL model is implemented in Python, utilizing the PyTorch framework. The experiments were conducted on a workstation equipped with an Intel i7-13700 CPU @ 2.5 GHz, an NVIDIA GeForce RTX 4090 24G GPU, 32GB of DDR RAM, and running on Ubuntu Linux.

For all approaches, the size of the graph embedding was fixed at 64, the dropout rate was set to 0.6, the optimizer used was Adam, the batch size was 4096, the number of training epochs was 500, and the testing period was set at every 10 epochs. To fine-tune the model, a grid search was performed to determine the optimal hyperparameter settings for DIVCL. The range of hyperparameters and their optimal values are listed in Table 3.

Table 3 Parameter settings.

Baselines

To verify the effectiveness of the proposed DIVCL model, both non-knowledge-aware and knowledge-aware approaches are employed as baselines for comparison. The non-knowledge-aware approaches build collaborative filtering models based solely on the interaction graph (IG), as follows:

  • BPR56: This is a typical CF-based approach that uses pairwise matrix factorization for implicit feedback optimized by the BPR loss.

  • DMF57: This presents a deep neural network to learn a common low dimensional space for the representations of users and items, which then are used to predict how likely a user adopt an item.

  • DGCF19: This considers user intents as a fine-grained representation of user-item interaction, and generates disentangled representations using GNN model.

The knowledge-aware approaches are on the basis of both the IG and KG as follows:

  • CKE25: This is an embedding-based approach that combines structural, textual, and visual knowledge in one framework.

  • KGAT3: This is a GNN-based approach which iteratively integrates neighbors over IG + KG with an attention mechanism to get user/item representations.

  • KGIN7: This is a GNN-based approach with intents, which disentangles user-item interactions at the granularity of user intents, and performs GNN on the IG + KG with intents.

  • KGNNLS32: This is a GNN-based model which enriches item embedding with GNN and label smoothness regularization.

  • MCCLK6: This is a GNN-based approach with contrastive learning, which considers the IG as a collaborative view, the KG as a semantic view, and the IG + KG as a structural view, and then performs contrastive learning across three views, capturing comprehensive graph feature and structure information in a self-supervised manner.

  • VRKG5: This is a GNN-based approach with the consideration of redundant relations, which constructs virtual relational graphs by clustering relations to alleviate the negative impact of long-tail relations on information aggregation.

Evaluation metrics

The all-ranking strategy is employed during the evaluation phase. Positive and negative instances are defined consistently with those in the training phase. For each user, all items are ranked based on their predicted scores. To assess the top-K recommendation performance58, evaluation metrics such as Recall@K and HR@K (Hit Ratio) are used, where K is set to 20 and 50. The average values of these metrics across all users in the test set are reported to provide a comprehensive performance comparison.

Performance comparison (RQ1)

The overall performances over the three datasets are shown in Tables 4 and 5 for different K values. The best results are in boldface and the second best results are italics.

Table 4 Overall performance comparison over the three datasets for K = 20.
Table 5 Overall performance comparison over the three datasets for K = 50.

The key observations from these results are as follows:

  • The proposed DIVCL consistently outperforms the baseline models on Recall, which is widely regarded as the most crucial metric in recommendation systems. In addition, DIVCL achieves optimal results on half of HR, another important metric. The results demonstrate the effectiveness of the DIVCL model, with its performance improvements largely attributable to the integration of intent injection, KG filtering, and contrastive learning.

  • The knowledge-aware approaches (CKE, KGAT, KGNNLS, KGIN, MCCLK, VRKG, and DIVCL) do not consistently outperform the non-knowledge-aware approaches (BPR, DMF and DGCF). In fact, BPR achieved the best performance on HR of Fabric@20. This finding suggests that while knowledge graphs (KGs) can provide valuable information for recommendation, they also introduce noise that may negatively affect performance. This further underscores the importance of the KG filtering strategy employed in the DIVCL model, as it helps mitigate the negative impact of irrelevant or redundant relations in the KG.

  • DIVCL outperforms the KGIN on all measures. Both use intent injection. The main difference is the KG filtering and contrastive learning added in the former. The result again verifies the promotion effect of the above two strategies.

  • DIVCL does not outperform VRKG and MCCLK on the HR of Book@50 and Fabric@50. Owing to the maturity of research and the complexity of dataset, it is very difficult to improve the overall performance of the recommendation. VRKG retains some features of relations with low intent relevance by clustering relations, and thereby it achieved better performance on HR with higher K value. MCCLK addresses sparse supervised signals by multi-level cross-view contrastive learning cross, and thereby it achieved better performance on the extremely sparse dataset Fabric with high K value. Despite these, DIVCL still shows superiority in recall metrics and time complexity.

Ablation studies (RQ2)

To evaluate the effectiveness of the dual-intent-view contrastive learning structure and the KG filtering strategy, the DIVCL was compared with the following two variants:

  • DIVCL w/o CL: In this variant, the contrastive learning component is removed, and the other components remain unchanged.

  • DIVCL w/o KG: In this variant, the KG filtering strategy in the global intent view encoder is removed, and the other components remain unchanged.

The comparison among the DIVCL and two variants is shown in Fig. 2.

Fig. 2
figure 2

Comparison among the DIVCL and the two variants.

The following observations further support the effectiveness of the proposed DIVCL model:

  • DIVCL outperforms its two variants across all measures. This result underscores the effectiveness of the dual-intent-view contrastive learning structure combined with the KG filtering strategy. Together, these components significantly enhance model performance.

  • DIVCL without contrastive learning (w/o CL) outperforms DIVCL without KG filtering (w/o KG) across all measures. This can be explained by the respective importance of the interaction graph (IG) and the knowledge graph (KG) for extracting user intent. The IG, as the source of supervised signals, is inherently closer to user intent than the KG, which serves as auxiliary information. The dual-intent-view contrastive learning module primarily addresses the sparse supervised signals, making it more impactful than KG filtering. However, KG filtering remains crucial, as it has also contributed to notable performance improvements.

  • The advantage of DIVCL is most pronounced on the fabric dataset. This is explained by the unique characteristics of the dataset. The interactions in the fabric dataset are predominantly browsing-based, which are less reliable compared to purchase and rating interactions. As a result, the sparse supervised signal problem is more severe in the fabric dataset than in the other two datasets, allowing the dual-intent-view contrastive learning component to yield more substantial improvements. Additionally, the KG for the fabric dataset has a higher density compared to the other datasets, exacerbating the problem of redundant entity relations. Consequently, the KG filtering strategy contributes more significantly to the performance gains on this dataset. These results further demonstrate the combined effectiveness of the dual-intent-view contrastive learning and KG filtering strategies in addressing both sparse signals and redundant entity relations.

Computation load studies (RQ3)

The time complexity of a learning network primarily depends on its structure. However, clarifying the network structure of all baseline models is quite challenging. Therefore, an experimental approach is used to assess the computational load of the models. Setting K = 50, both the training and inference times were recorded during the experiments described in Sect. 5.2, and the average times per epoch are presented in Table 6.

Table 6 The training and inference time (unit: seconds/epoch).

Although DIVCL ranks relatively low in terms of both training and inference times, its performance remains competitive with baseline models that achieve strong recommendation results, such as KGIN and VRKG. As a result, considering the balance between recommendation performance and computational load, the computational overhead of DIVCL is deemed acceptable.

Comparing DIVCL to KGIN across all datasets, the training time for DIVCL is less than double that of KGIN, while their inference times are quite similar. This difference can be attributed to the number of intent view encoders and the inclusion of the KG filtering strategy in DIVCL. While DIVCL contains two parallel intent view encoders, KGIN only has one. However, the KG filtering strategy in DIVCL simplifies the complexity of information aggregation, which helps mitigate the computational load.

Similarly, when comparing DIVCL to VRKG across the datasets, a similar pattern emerges. This difference can be attributed to the different KG filtering strategies employed. DIVCL removes irrelevant relations based on their influence over the intent space, which only requires projection and similarity calculations. On the other hand, VRKG clusters and merges similar relations, a process that is far more complex. As a result, the computational cost of DIVCL is less than twice that of VRKG.

When comparing DIVCL to MCCLK, the computational load of MCCLK is significantly higher, by an order of magnitude, rendering it impractical under the experimental settings. MCCLK utilizes three parallel intent view encoders and three separate contrastive learning processes, theoretically making its computational complexity two to three times that of DIVCL. However, MCCLK requires information aggregation from three unfiltered graphs, consuming memory well beyond the physical capacity of the experimental machine. This results in frequent internal and external memory swapping, severely slowing down the computation speed.

Conclusion and future work

In this work, a Dual-Intent-View Contrastive Learning network (DIVCL) is proposed to address the challenges of sparse supervised signals and redundant entity relations. DIVCL enhances user and item representation learning by fully utilizing the role of intent. First, supervised signals are represented in a fine-grained manner by inserting a set of intents into each user-item interaction. Second, redundant entity relations are filtered by defining the influence of relations within the intent space. Third, the distribution of user and item embedding is more aligned with user preferences by incorporating intent into the contrastive learning process. Additionally, the computational load of DIVCL remains manageable by adopting dual-view contrastive learning instead of multi-view contrastive learning, ensuring an efficient balance between performance and resource consumption. The effectiveness of the proposed DIVCL model has been demonstrated through experimental results on three benchmark datasets, particularly the self-built fabric dataset, which highlights its strengths. Furthermore, the experimental results reveal that noise not only exists within the KG but also within the user-item interactions themselves. This finding suggests that future work should explore joint denoising strategies at both ends—interaction and KG levels—to further improve the performance of recommendation systems.