SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation

Kaiming Shen Ant GroupBeijingChina kaiming.skm@antgroup.com , Xichen Ding Ant GroupBeijingChina xichen.dxc@antgroup.com , Zixiang Zheng Ant GroupBeijingChina zhengzixiang.zzx@antgroup.com , Yuqi Gong Ant GroupBeijingChina gongyuqi.gyq@antgroup.com , Qianqian Li Ant GroupBeijingChina zixi.lqq@antgroup.com , Zhongyi Liu Ant GroupHangzhouChina zhongyi.lzy@antgroup.com and Guannan Zhang Ant GroupHangzhouChina zgn138592@antgroup.com

(2018)

Abstract.

The modeling of users’ behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users’ lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personalized search ranking (PSR) is extremely difficult due to the insufficient learning problem of ID embedding, especially when the IDs in the lifelong sequence features do not exist in the samples of training dataset. Additionally, existing target attention mechanisms struggle to learn the multi-modal representations of items in the sequence well. The distribution of multi-modal embedding (text, image and attributes) output of user’s interacted items are not properly aligned and there exist divergence across modalities. We also observe that users’ search query sequences and item browsing sequences can fully depict users’ intents and benefit from each other. To address these challenges, we propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval. Specifically, a network called Pretraining Search Unit (PSU) learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner with multiple objectives: multi-modal alignment, next query-item pair prediction, query-item relevance prediction, etc. After pretraining, the downstream model, which shares the same target attention structure with PSU, restores the pretrained embedding as initialization and finetunes the network. To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy to approximate the exact attention calculation and significantly reduce the time complexity.

Abstract.

lifelong sequence modeling, multi-modal retrieval

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: 10.1145/1122445.1122456^†^†conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NY^†^†booktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Recommender systems^†^†ccs: Information systems Data mining

1. Introduction

Users’ behavior modeling is extremely important in modern commercial recommendation systems, including online e-commerce platforms such as Amazon, Taobao, Alipay, and content platforms such as YouTube, TikTok, etc. As users spend more time on online shopping and watching short videos, the length of users’ historical behaviors has grown dramatically from a few hundreds ( $10^{2}$ ) to more than ten-thousands ( $10^{4}$ ) in recent years. A lot of recent research focuses on modeling users’ lifelong behaviors, such as Efficient Target Attention (ETA) (Chen et al., 2022), Two-Stage Interest Network (TWIN) (Chang et al., 2023), Query-Dominant Interest Network (QIN) (Guo et al., 2023), etc. These models follow a cascading two-stage paradigm, which first uses the target item or target search query as a trigger to retrieve the top-K relevant behaviors from historical behaviors. In the second stage, it uses target attention to encode the selected behaviors as users’ interest representation. This paradigm is widely adopted in many search and recommendation tasks, such as click through rate (CTR) prediction and personalized search ranking. The item representations in the sequence are computed using both the item ID feature and more generic attributes’ features. One easily neglected problem in existing lifelong behavior modeling is the insufficient learning problem of ID features in the lifelong sequence, such as historical item ID, author ID, etc. Many historical items in the lifelong sequence can’t be found in the current training dataset, which is collected from the most recent logs of exposures and clicks. These low frequency ID embeddings can’t be learned well by the limited dataset after being randomly initialized, which will harm the accuracy of target attention calculation.

The second problem in existing lifelong sequence modeling is that it can’t handle multi-modal features of items in the sequence well, such as text and image features. The norm values of vectors from different modalities vary if the modalities are not properly aligned in the same embedding space. Existing target item attention calculation uses the inner product of query and keys, which may be dominated by modality vectors with large norm values. For example, the target item will only retrieve the top-K visually relevant but semantically very different items from historical behaviors, which will deteriorate the online performance of recommendation.

To tackle these problems, we propose a new model called Search Enhanced Multi-Modal Interest Network and Approximate Retrieval (SEMINAR) to model users’ lifelong historical multi-modal behaviors. The users’ historical behaviors include heterogeneous behaviors of both the sequence of browsing item and the sequence of search query. We align users’ search query sequence with browsing item sequence together as a unified sequence of query-item pairs, which can be retrieved flexibly by target item or target search query in both the CTR prediction task and Personalized Search Ranking (PSR) task. SEMINAR proposes a Pretraining Search Unit (PSU) network to learn the lifelong behavior sequence of historical multi-modal query-item pairs. It introduces multiple pretraining tasks designed to solve the insufficient learning issue of historical ID features and the multi-modal alignment. In downstream tasks, the target attention module restores the learned item representations from PSU, using the pretrained ID embedding as initialization, and applies a projection weight matrix to get the transformed representation of the behavior sequence. During online serving, calculating exact attention using the inner product of multi-modal vectors in the lifelong sequence has the time complexity of $O(L\times M\times d)$ , which is time consuming. $L$ denotes the sequence length, $M$ denotes the number of modalities and $d$ denotes the embedding dimension. Different from existing approximate retrieval methods, such as Locally Sensitive Hash (LSH) and Hierarchical Navigable Small World (HNSW), we exploit an approximate strategy of Product Quantization in a multi-modal setting and express the multi-modal item representations as discrete integer codes using the quantization codebooks, and sum the inner product of centroids in sub-vectors to approximate the exact attention calculation. During online serving, the attention calculation is equivalent to pre-computed distance table lookup and summation operations, which can be conducted efficiently.

In summary, the main contributions of our work are as follows:

•

We identify the insufficient learning problem of ID features in lifelong behavior modeling and observe that target attention calculation is dominated by multi-modal features with large norm values. And we novelly propose SEMINAR framework, which includes the Pretraining Search Unit to effectively alleviate the insufficient learning problems of ID embedding and multi-modal alignment.
•

We exploit a product quantization approximation strategy in a multi-modal setting, which can reduce the time complexity during online serving of retrieval using target query item pair from historical behaviors.
•

We conduct extensive experiments on real-world datasets to demonstrate the effectiveness of our proposed model. And we also released the code of SEMINAR in this repository ¹¹1https://github.com/paper-submission-coder/SEMINAR to encourage further research.

2. Related Work

Refer to caption — Figure 1. Illustration of SEMINAR Model Architecture. $S_{i}$ denotes the i-th behavior of query and item pair in the lifelong sequence. Each behavior has multiple channels of query and multi-modal features of text, image and attributes. PSU denotes the pretraining search unit. GSU and ESU denote the general and exact search unit respectively as the two stage paradigm.

2.1. Long-Term Lifelong User Behavior Modeling

Long-term lifelong user behavior modeling has attracted much research attention in recent years. Typical works include SIM (Pi et al., 2020), ETA (Chen et al., 2022), TWIN (Chang et al., 2023), QIN (Guo et al., 2023), etc. SIM (Pi et al., 2020) introduces the General Search Unit to retrieve the top-K most relevant items from historical behaviors using the target item as a trigger, and the Exact Search Unit (ESU) to calculate the multi-head target attention (MHTA). ETA (Chen et al., 2022) uses a set of hash functions to express the item representation as binary hash embedding and calculates the Hamming distance to approximate the inner product calculation. TWIN (Chang et al., 2023) introduces the CP-GSU as a consistency-preserved lifelong user behavior modeling module to increase the relevance calculation consistency between the two cascading stages. QIN (Guo et al., 2023) uses the search query as a trigger to retrieve the most relevant items from the historical behaviors in the first stage of the cascading models in Personalized Search Ranking. Different from the existing work, we propose the pretraining search unit (PSU) to alleviate the insufficient learning problem of ID features and multi-modal alignment in attention calculation. Furthermore, there is an increasing trend of modeling search and recommendation tasks jointly in a unified framework, such as USER (Yao et al., 2021), SESRec (Si et al., 2023), S&R Foundation (Gong et al., 2023), etc. To model the lifelong behaviors, we align the historical search query sequence and browsing item sequence as a unified sequence of query-item pairs, which can be applied to both CTR prediction in recommendation and personalized search ranking.

2.2. Multi-Modal Alignment in Recommendation and Item Quantization

Multi-modal alignment is a prevalent topic, which aligns the multi-modal features such as text and image in a unified embedding space in a contrastive learning manner. Typical works include CLIP (Radford et al., 2021), etc. Some researchers have focused on modeling multi-modal user sequences in recommendation. M5 (Zhao, 2023) applies a multi-modal embedding layer to extract both ID embeddings of show ID and content-graph embeddings initialized from a meta-path pretrained model. To better increase the generalization of ID embeddings, some research is proposed to express item representations as quantized vectors in discrete codes, including Product Quantization (Jégou et al., 2011), VQ-VAE (van den Oord et al., 2017), RQ-VAE (Zeghidour et al., 2021), etc. VQ-Rec (Hou et al., 2023) proposes to encode text as discrete codes using product quantization techniques and use transformer to learn cross-domain data in recommendation. TIGER (Rajput et al., 2023) learns the semantic ID from the content information and learns RQ-VAE (Zeghidour et al., 2021) representations for generative retrieval.

Much research focuses on making fast retrieval of relevant items from an embedding database. Common methods include approximate nearest neighbor (ANN) search using HNSW (Malkov and Yashunin, 2018), Product Quantization (Jégou et al., 2011), etc. Product quantization is a technique to transform a d-dimensional vector to a low-dimensional N-bit integer vector of centroid ids in the codebook. It first splits a vector $\mathbb{x}\in\mathbb{R}^{D}$ into $N_{bit}$ sub-vectors and applies quantization function $q(x)$ to assign each sub-vector $\mathbb{x}_{i}$ to the nearest centroid $c_{i}$ from a codebook $\mathcal{C}$ as $\mathbb{x}=[\mathbb{x}_{i}]_{1:N_{bit}}\rightarrow[q(\mathbb{x}_{i})]_{1:N_{% bit}}=[c_{i}]_{1:N_{bit}}$ . And the quantization function is $q(\mathbb{x}_{i})=\arg\min_{c_{i}\in\mathcal{C}}d(\mathbb{x}_{i},e_{c_{i}})$ and $e_{c_{i}}$ denotes the centroid embedding of the $c_{i}$ -th centroid.

3. Proposed Model

3.1. Problem Formulation

We can split the sequence of users’ behaviors into several heterogeneous sub-sequences, including the sequence of search queries $\mathcal{Q}=\{q_{1},q_{2},..,q_{|\mathcal{Q}|}\}$ of explicit intents and the sequence of browsing recommended items $\mathcal{B}=\{i_{1},i_{2},...,i_{|\mathcal{B}|}\}$ . For search behaviors, users input a query $q\in\mathcal{Q}$ and interact (click or view) with a few items related to the query, resulting in the aligned sequence of query and item pairs $(q_{l},i_{l})$ . For the behavior of browsing recommended items, users browse a sequence of items without explicit search intent, and we pad an empty search query $q=\emptyset$ to each item to obtain the query-item pair as $(q_{l}=\emptyset,i_{l})$ . Finally, we construct a unified sequence of aligned query-item pairs in chronological order with length $L$ , denoted as $\{(q_{l},i_{l})\}_{l=1:L}$ . In some recommendation scenarios, such as short video recommendations of YouTube and TikTok, each item has multi-modal features such as text (title of video), image, and attributes (authors and categories). We further split the sequence of browsed items $\mathcal{B}$ into $M$ multi-modal sub-sequences, including a sequence of text features $\mathcal{T}=\{T_{1},T_{2},...,T_{L}\}$ , a sequence of image features $\mathcal{I}=\{I_{1},I_{2},...,I_{L}\}$ , and a sequence of attribute features $\mathcal{A}=\{A_{1},A_{2},...,A_{L}\}$ . Finally, we let $[\mathcal{Q},\mathcal{T},\mathcal{I},\mathcal{A}]\in\mathbb{R}^{(M+1)\times L% \times d}$ denote the input sequence of multi-modal query-item pairs to the SEMINAR model and $d$ denotes the dimension of aligned representations.

3.2. Aligned Lifelong Sequence of Multi-Modal Query-Item Pairs

The aligned sequence of multi modal query-item pairs pass the embedding layers. We let $[\mathbb{x}_{l}=(\mathbb{x}^{query}_{l},\mathbb{x}^{item}_{l})]_{l=1:L}$ denote the historical sequence of query and item pairs. $\mathbb{x}^{query}_{l}\in\mathbb{R}^{d},\mathbb{x}^{item}_{l}=(\mathbb{x}^{% text}_{l},\mathbb{x}^{image}_{l},\mathbb{x}^{attributes}_{l})\in\mathbb{R}^{M% \times d}$ . In CTR prediction, target attention (TA) is a structure which uses target item to retrieve the most relevant items from the sequence of historical behaviors. We extend TA from target item to target query-item pair to retrieve most relevant top $K$ pairs from historical sequence. We denote the target query-item pair as $\mathbb{x}_{t}=(\mathbb{x}^{query}_{t},\mathbb{x}^{text}_{t},\mathbb{x}^{image% }_{t},\mathbb{x}^{attributes}_{t})$ .

3.3. SEMINAR Model Architecture

Our proposed model SEMINAR in Figure 1 introduces a new network Pretraining Search Unit (PSU) to pretrain using dataset of the lifelong sequence of multi-modal query-item pairs. Section 3.3.1 introduces the PSU and corresponding pretraining tasks. Section 3.3.2 introduces how the recommendation model restores the pretrained query and item representations from PSU as initialization and applies a projection matrix to get the transformed representation of the sequence. Top-K relevant pairs are retrieved by the target pair and participate in the multi-head target attention (MHTA) calculation. Section 3.3.3 introduces the multi-modal product quantization approximation.

3.3.1. Pretraining Search Unit

The input to PSU is the aligned sequence of query-item pairs as $[\mathbb{x}_{l}]_{l=1:L}$ . $L$ denotes the length of the aligned sequence and $\mathbb{x}_{l}=(\mathbb{x}^{query}_{l},\mathbb{x}^{text}_{l},\mathbb{x}^{image% }_{l},\mathbb{x}^{attributes}_{l})$ represents the $l$ -th behavior in the sequence, which consists of the query embedding and multi-modal embedding. The query $q\in\mathcal{Q}$ passes through the query feature encoder $f(.)$ , resulting in $Q=f(\mathbb{x}^{query}_{1:L})\in\mathbb{R}^{L\times d_{query}}$ . Following the multi-modal alignment literature such as CLIP (Radford et al., 2021), we use Transformer (Vaswani et al., 2023) to encode the text feature as $T=\text{Encoder}_{text}(\mathbb{x}^{text}_{1:L})\in\mathbb{R}^{L\times d_{text}}$ and ViT (Dosovitskiy et al., 2020) to encode the image features as $I=\text{Encoder}_{image}(\mathbb{x}^{image}_{1:L})\in\mathbb{R}^{L\times d_{% image}}$ . Additionally, we encode features of the attributes using the function $g(.)$ . $A=g(\mathbb{x}^{attribute}_{1:L})\in\mathbb{R}^{L\times d_{attribute}}$ is treated as one channel of the sub-sequence which participates in the multi-modal alignment of the item sequence. To project the representations of different channels to the same dimension $d$ , we further multiply them by linear weight matrix $\{W_{q},W_{t},W_{i},W_{a}\}$ and get the stacked input sequence of multi-modal query-item pairs as follows: $\mathbb{x}=[QW_{q},TW_{t},IW_{i},AW_{a}]\in\mathbb{R}^{(M+1)\times L\times d}$ .

Next Pair Prediction and Multi-Head Target Attention

The intuition behind PSU is to design a pretraining network to learn from the lifelong behavior sequence, and the pretraining network should share the same structure of multi-head target attention with the cascading two-stage downstream model, such as ETA (Chen et al., 2022) and TWIN (Chang et al., 2023). The downstream model restores the pretrained query and item embeddings as initialization of parameters and fine-tunes the network. Different from the masked language model (MLM) in BERT (Devlin et al., 2019), which uses tokens from the context window to predict the masked token, we use next-pair prediction as a pretraining task to predict the correct last query and item pair. We intentionally leave out the last query-item pair in the sequence $\mathbb{x}_{L}=[\mathbb{x}^{(m)}_{L}],m\in[Q,T,I,A]$ , and treat it as the target query-item pair to retrieve from the previous $(L-1)$ sequence using multi-head target attention. To pad the sequence length from $L-1$ to $L$ , we further add a special token $<EOS>$ to the end of the previous $L-1$ items in the sequence $\mathbb{x}_{1:L-1}$ . The next query-item pair prediction task is formulated as classification tasks: $y=p(\mathbb{x}^{1:M+1}_{L}|\mathbb{x}^{1:M+1}_{1:L-1};\mathbb{x}^{<EOS>})$ with the loss $\mathcal{L}^{pair}_{next}$ . Positive label is assigned to the correct last pair, and negative labels are assigned to negatively sampled query-item pairs.

To better represent the historical behaviors and target query-item pair, we need to fuse the query and multi-modal item representations into a single vector as:

\mathbb{x}=\lambda\mathbb{x}^{query}+(1-\lambda)\sum_{i}w_{i}{\mathbb{x}^{item% }}^{(i)}=\sum_{m\in M+1}\gamma_{m}\mathbb{x}^{(m)}

$\lambda$ and $(1-\lambda)$ denote the weight to merge representations of query and item vectors respectively as $\lambda\in[0,1]$ , and $w_{i}$ denotes the weight to merge multi-modal item representations. To simplify the notations, we use a single vector $[\gamma_{m}]^{1:M+1}\in\mathbb{R}^{M+1}$ to represent the weight of all $(M+1)$ channels and the sum of the weight equals to 1 as $\sum\gamma_{m}=1$ . The weight vector $\gamma_{m}$ can be learned dynamically as the softmax output of a gating network. The attention is calculated as the inner product of queries and keys of the merged multi-channel representations. The final attention score will be dominated by the modals with large norm values $|x^{(m)}|$ and large weight $\gamma_{m}$ , and the information from other modals will be easily ignored. So we specifically decompose the attention score calculation into the norm value part $|x^{(m)}|$ and unit vector part $\hat{x}^{(m)}$ .

We let $q_{t}$ denote the representation of target query-item pair as $q_{t}=\sum_{i}\gamma_{i}\mathbb{x}^{(i)}_{t}=\sum_{i}\gamma_{i}|\mathbb{x}^{(i% )}_{t}|\hat{\mathbb{x}}^{(i)}_{t}$ . Note that the $|\mathbb{x}^{(i)}_{t}|$ denotes the norm value of the i-th channel of target item and $\hat{\mathbb{x}}^{(i)}_{t}$ is a unit vector. Similarly, we can express the $l-th$ historical behavior $k_{l}\in K$ as $k_{l}=\sum_{j}\gamma_{j}\mathbb{x}^{(j)}_{l}=\sum_{i}\gamma_{j}|\mathbb{x}^{(j% )}_{l}|\hat{\mathbb{x}}^{(j)}_{l}$ . Note that the unit vectors of multi-modal sequence representations will participate in the multi-modal alignment task in the next section.

The $h$ -th head in the multi-head attention is represented as $head^{PSU_{h}}=\text{Attention}_{h}(q_{t},K^{PSU},V^{PSU})$ , and the attention score $a^{PSU}_{h}$ is calculated as inner product of d-dimensional vector query and keys multiplied by a scaling factor $\frac{1}{\sqrt{d}}$ .

\alpha_{h}^{PSU}=\frac{(q_{t}W^{PSU_{Q}}_{h})(K^{PSU}W^{PSU_{K}}_{h})^{T}}{% \sqrt{d}}

=[\sum_{i}\sum_{j}\gamma_{ij}(\mathbb{\hat{x}}^{PSU(i)}_{t}W^{PSU_{Q}}_{h})({% \mathbb{\hat{x}}^{PSU(j)}_{l}}W^{PSU_{K}}_{h})^{T}]^{L}_{l=1}

\gamma_{ij}=\gamma_{i}\gamma_{j}|\mathbb{x}^{PSU(i)}_{t}||\mathbb{x}^{PSU(j)}_% {l}|

In this formulation, $[\mathbb{x}^{PSU(1:M+1)}_{l}]^{L}_{l=1}\in\mathbb{R}^{L\times(M+1)\times d}$ denotes the multi-modal embedding of items in the sequence of PSU and $\mathbb{x}^{PSU(1:M+1)}=[Q,T,I,A]$ . And $K^{PSU}=[\sum_{i}\gamma_{i}\mathbb{x}^{PSU(i)}_{l}]^{L}_{l=1}\in\mathbb{R}^{L% \times d}$ denotes the merged representations of input sequence. $W^{PSU_{Q}}_{h}\in\mathbb{R}^{d\times d}$ and $W^{PSU_{K}}_{h}\in\mathbb{R}^{d\times d}$ denote the projection weight matrix of query and keys in $h$ -th head, and $\gamma_{ij}$ denotes the weight of cross-modal interaction of unit query vector and unit key vector in the sequence. $\gamma_{ij}$ equals to the scalar product of $\gamma_{i}$ , $\gamma_{j}$ , the norm value of query vector $|\mathbb{x}^{PSU(i)}_{t}|$ and the norm value of key vector $|\mathbb{x}^{PSU(j)}_{l}|$ .

Multi-Modal Alignment and Query-Item Relevance

Multi-modal alignment is a crucial task, which learns the multi-modal representation in a same embedding space. Typical alignment models, such as CLIP (Radford et al., 2021), maximize the cosine similarity of the correct $N$ (text-image) pairs and minimize the cosine similarity of the incorrect $N^{2}-N$ mismatch pairs. We simultaneously train multi-modal alignment tasks, including text-image, image-attributes, text-attributes with the cross entropy loss of N pairs.

\mathcal{L}_{align}=\sum_{i\in M}\sum_{j\in M\neq i}\mathcal{L}_{\text{CLIP}}(% \mathbb{\hat{x}}^{(i)}_{1:L},\mathbb{\hat{x}}^{(j)}_{1:L}),(i,j)\in\{T,I,A\}

Sequence length $L$ is usually large and the alignment has complexity of $O(L^{2})$ . To reduce the complexity, we further split the sequence into $N_{ch}$ chunks. Each chunk is a sub-sequence with length $L_{sub}=\frac{L}{N_{ch}}$ . The alignment loss is the sum of multiple losses within chunks as $\mathcal{L}_{\text{CLIP}}(\mathbb{\hat{x}}^{(i)}_{1:L},\mathbb{\hat{x}}^{(j)}_% {1:L})=\sum_{k\in N_{ch}}\mathcal{L}_{CLIP}(\mathbb{\hat{x}}^{(i)}_{L_{k}:L_{k% +1}},\mathbb{\hat{x}}^{(j)}_{L_{k}:L_{k+1}})$ with complexity reduced to $O(L^{2}/N_{ch})$ .

Additionally, query item relevance prediction is a typical search task, usually modelled as binary classification to predict the correct query-item pair from irrelevant query-item pairs. Each pair of query and item is represented as $[\mathbb{x}^{query}_{l};\mathbb{x}^{item}_{l}=\sum_{m\in M}\gamma_{m}\mathbb{x% }^{item^{(m)}}_{l}]$ . Loss for query-item relevance binary classification task is $\mathcal{L}_{query-item}=\sum\mathcal{L}_{ce}(y^{qi}_{l};\mathbb{x}^{query}_{l% },\sum_{m\in M}\gamma_{m}\mathbb{x}^{item^{(m)}}_{l})$ . $y^{qi}_{l}$ denotes relevance label of the l-th pair in the sequence. Positive label is assigned to the correct query-item pair and negative label is assigned to randomly sampled irrelevant query-item pair.

Loss of Pretraining Search Unit

The objective of Pretraining Search Unit (PSU) consists of three parts, the next query-item pair prediction loss $\mathcal{L}^{pair}_{next}$ , multi-modal alignment loss $\mathcal{L}_{align}$ and the query-item relevance prediction loss $\mathcal{L}_{query-item}$ . $\mathcal{L}_{PSU}=\mathcal{L}^{pair}_{next}+\mathcal{L}_{align}+\mathcal{L}_{% query-item}$ .

3.3.2. Fine-tuning the projection weight

Existing lifelong sequence modeling methods follow a cascading two-stage paradigm. In the first stage, target item or query is used as trigger to retrieve the most relevant top-K items from the users’ long behaviors sequence and reduce the sequence length from $L$ to $K$ , such as the General Search Unit (GSU) in SIM (Pi et al., 2020), TWIN (Chang et al., 2023), and Relevance Search Unit (RSU) in QIN (Guo et al., 2023). In the second stage, a multi head target attention (MHTA) unit in Exact Search Unit (ESU) is applied to encode the selected $K$ relevant items as the representation of users’ behavior sequence. However, existing cascading two-stage paradigm suffers from the insufficient learning problem of ID embedding in the lifelong sequence ( $[\mathbb{x}^{query},\mathbb{x}^{text},\mathbb{x}^{image},^{attributes}]$ ). The downstream model, e.g. CTR prediction, lacks of enough training data to learn the embedding in the sequence well. Especially when some low-frequency items in the sequence exist a long time ago (more than one year) and don’t exist in the training data, which are collected from most recently users’ logs.

To help alleviate the insufficient learning problem of ID embedding in the lifelong sequence, the general search unit (GSU) in our proposed SEMINAR model shares the same multi-head target attention structure $head^{GSU_{h}}=\text{Attention}_{h}(q_{t},K^{GSU},V^{GSU})$ with the structure in PSU as $head^{PSU_{h}}=\text{Attention}(q_{t},K^{PSU},V^{PSU})$ , restores the pretrained embedding from PSU and applies specific projection weight matrix $G^{(j)}\in\mathbb{R}^{d\times d}$ to the pretrained embedding. After the first stage retrieval, the sequence length is reduced from $L$ to $K$ , the second stage ESU also shares the same multi-head target attention structure $head^{ESU_{h}}=\text{Attention}(q_{t},K^{ESU},V^{ESU})$ with GSU and PSU, and has specific projection weight matrix $W^{Q}_{h},W^{K}_{h},W^{V}_{h}$ of each head.

GSU restores the pretrained query item multi-modal embedding $[E^{PSU(Q)},E^{PSU(T)},E^{PSU(I)},E^{PSU(A)}]$ from PSU, and applies projection matrix $G^{(j)}\in\mathbb{R}^{d\times d}$ to get the projected embedding in GSU as $\mathbb{x}^{GSU(j)}$ . $E^{PSU(*)}$ denotes the pretrained multi-modal embedding. And the attention score $\alpha_{h}^{GSU}$ in the $h$ -th head of GSU’s multi-head target attention is calculated as:

\alpha_{h}^{GSU}=\frac{(q_{t}W^{GSU_{Q}}_{h})(K^{GSU}W^{GSU_{K}}_{h})^{T}}{% \sqrt{d}}

\mathbb{x}^{GSU(j)}=\mathbb{x}^{PSU(j)}G_{j},\forall j\in M+1

Comparing the GSU attention $\alpha^{GSU}_{h}$ with the pretrained PSU attention $\alpha^{PSU}_{h}$ , we can see that the structures of multi-head target attention are exactly the same. The projection weights $W^{GSU_{Q}}_{h}$ and $W^{GSU_{K}}_{h}$ of queries and keys for each head in multi head attention are different from $W^{PSU_{Q}}_{h}$ and $W^{PSU_{K}}_{h}$ . And the embedding projection weight matrix $G_{j}$ is unique to GSU.

In the second stage, the top-K relevant query-item pairs are selected from GSU and fed to Exact Search Unit (ESU) as $head^{ESU_{h}}=\text{Attention}_{h}(q_{t},K^{ESU},V^{ESU})$ .

In ESU, $K^{ESU}=\text{TopK}(K^{GSU})\in\mathbb{R}^{(M+1)\times K\times D}$ represents the sequence of retrieved top-K representations from $K^{GSU}\in\mathbb{R}^{(M+1)\times L\times D}$ . The attention score in ESU is denoted as $\alpha^{ESU}_{h}$ and the ID embedding in ESU is denoted as $\mathbb{x}^{ESU(j)}$ .

\alpha^{ESU}_{h}=\frac{(q_{t}W^{ESU_{Q}}_{h})(K^{ESU}W^{ESU_{K}}_{h})^{T}}{% \sqrt{d}}

\mathbb{x}^{ESU(j)}=\mathbb{x}^{GSU(j)}=\mathbb{x}^{PSU(j)}G^{(j)},\forall j% \in M+1

Finally, users’ lifelong sequence representation $\mathbb{x}_{\text{lifelong\_seq}}$ is calculated as: $\mathbb{x}_{\text{lifelong\_seq}}=\text{Concat}(\text{head}^{ESU_{1}},...,% \text{head}^{ESU_{H}})W^{ESU}$ . And $\mathbb{x}_{\text{lifelong\_seq}}$ is concatenated with other user, item, user-item interaction (u2i) and context features and participate in CTR prediction. $\hat{y}_{i}=f_{\theta_{i}}(\mathbb{x}_{\text{lifelong\_seq}},\mathbb{x}_{u},% \mathbb{x}_{i},\mathbb{x}_{\text{u2i}},\mathbb{x}_{\text{context}})$ denotes the predicted value and $y_{i}$ denote the actual label value. And the final loss of CTR prediction is $\mathcal{L}_{ctr}=\sum_{i}\mathcal{L}_{ce}(y_{i},\hat{y_{i}})$ .

3.3.3. Approximate Retrieval of Multi-Modal Query-Item Pair

The exact calculation of the attention score between the target query-item pair $q_{t}$ and the $l$ -th query-item behavior $k_{l}$ is the inner product of the weighted sum of multiple vectors as:

{q_{t}}^{T}k_{l}=(\sum_{i\in M+1}\gamma_{i}{x}^{(i)}_{t})^{T}(\sum_{j\in M+1}% \gamma_{j}{x}^{(j)}_{l}),\forall l\in\{1,2,...L\}

The exact calculation has the time complexity of $O(L\times M\times d)$ . $L$ denotes the sequence length, $M$ denotes the number of weighted sum operations of multi-modal embedding vectors of dimension $d$ . The calculation becomes time-consuming when $L$ is very large ( $10^{4}$ ) in the lifelong sequence of multi-modal query-item pairs setting.

One straightforward method of fast retrieval $K$ nearest vectors given an input query vector $q$ is to build an embedding index, such as HNSW (Malkov and Yashunin, 2018), and conduct ANN (Approximate Nearest Neighbors) search. However, there are difficulties in building an embedding index to retrieve the target query-item pair from the sequence of multi-modal query-item pairs. To search the vectors of behaviors given the input target query-item pair $q_{t}$ as in the exact attention calculation, we build a vector index which assigns a primary key to represent each vector, such as Item ID, Query ID, etc. However, in our aligned sequence of query-item pairs, each merged query-item representation have the joint key of (query_id, item_id), and the required amount of storage increases from the item set size $|\mathcal{B}|$ to the cartesian product of the query set size $|\mathcal{Q}|$ and the item set size $|\mathcal{B}|$ as $|\mathcal{Q}||\mathcal{B}|$ , which is almost infeasible to store the merged query-item pair in a single index directly.

An alternative cascading cross-modal strategy is considered to retrieve top-K relevant query-item pairs. Firstly, we build two separate vector indexes of the query set with size $|\mathcal{Q}|$ and the item set with size $|\mathcal{B}|$ . During the online retrieval of target query-item pairs, we conduct vector retrieval four times, including query-to-item, query-to-query, item-to-query, and item-to-item. Each retrieval keeps the top-K items with the maximum inner product. The filter in the first-stage cross-modal retrieval is $L\rightarrow 4K$ . Given the potential $4K$ items, we conduct an exact attention calculation on these items to obtain the final top-K items, and the filter is $4K\rightarrow K$ . The problem with the cascading cross-modal retrieval strategy is that it may achieve a suboptimal solution compared to exact full attention calculation. This is because the final inner product is a weighted average of all modalities. Additionally, top-K relevant items from one modality (e.g., query-to-query relevance) may have very low relevance in other modalities, such as query-to-item (text) or query-to-item (image), thus the overall inner product score is not optimal. $Recall@K$ can evaluate the performance of the greedy strategy compared to exact calculation.

To help increase the recall performance while considering the retrieval speed, the key is to reduce the cardinality of the query set $\mathcal{Q}$ and the item set $\mathcal{B}$ . We argue that product quantization is a good approximation strategy, which splits vectors into $N_{bit}$ sub-vectors, assigns each sub-vector to the nearest centroid, and reduces the cardinality. In our formulation, we first use a set of separate $N_{bit}$ quantization function $[q^{(m)}_{1},q^{(m)}_{2},…,q^{(m)}_{N_{bit}}]$ to encode embedding of vectors from the $m$ -th modal channel $\mathbb{x}^{(m)}$ as integer vectors of ${N_{bit}}$ -dimension, $q({\mathbb{x}}^{(m)})=[c^{(m)}_{1},c^{(m)}_{2},...,c^{(m)}_{N_{bit}}]\in% \mathbb{R}^{N_{bit}}$ . Each representation of multi-modal query-item pair is expressed as:

[{\mathbb{x}}^{(1)},...,{\mathbb{x}}^{(M)}]\rightarrow[q({\mathbb{x}}^{(1)}),.% ..,q({\mathbb{x}}^{(M)})]\in\mathbb{R}^{M\times N_{bit}}

We pre-compute the inner product between different pairs of centroids and store the values in memory. The space complexity of the storage is $O(M^{2}|\mathcal{C}|^{2}N_{bit})$ , where $M$ denotes the size of multi-modals, $|\mathcal{C}|$ denotes the number of centroids, and $N_{bit}$ denotes the number of subvectors split in the codebook of modal $m$ . During online serving, the inner product of ${q_{t}}^{T}k_{l}$ is equivalent to $O(M^{2}N_{bit})$ distance lookup operations, and the final score is calculated as the weighted sum of these distances. Here, $c^{(i)}_{b}$ and $c^{(j)}_{b}$ denotes the centroids IDs of the $b$ -th subvector of $\mathbb{x}^{(i)}_{t}$ and $\mathbb{x}^{(j)}_{l}$ respectively.

{q_{t}}^{T}k_{l}=\sum_{i}\sum_{j}\gamma_{i}\gamma_{j}\mathbb{x}^{(i)}_{t}% \mathbb{x}^{(j)}_{l}\approx\sum_{i}\sum_{j}\gamma_{i}\gamma_{j}\sum_{b\in N_{% bit}}\text{dist}(c^{(i)}_{b},c^{(j)}_{b})

Our proposed multi-modal product quantization strategy works quite well in real-world settings. We also compare the time complexity of different strategies, such as cascading ANN (HNSW), Locality-sensitive hashing (LSH) and our proposed Multi-Modal Product Quantization approximation. Our proposed multi-modal PQ method has the time complexity of $O(L\times M^{2}\times N_{bit})$ . In each attention calculation, there are $M^{2}N_{bit}$ distance look-up operations of $O(1)$ , and the final score is calculated as the sum of these distances, which is far less than the exact calculation of the inner product of multiple vectors $O(L\times M\times d)$ . As for the two stage cascading ANN (HNSW) method of retrieving query-item pairs with two filters, the first stage retrieve the $M^{2}K$ cross-modal candidates from $L$ sequence as $L\rightarrow M^{2}K$ , and the second stage retrieve the final top $K$ items from first stage as $M^{2}K\rightarrow K$ . Total time complexity of cascading ANN method is $O(M^{2}\log(L)d+M^{2}Kd)$ , which is faster than our PQ strategy but may achieve sub-optimal recall performance in multiple experiments as reported in Figure 2.

4. Experiment

4.1. Experimental Settings

Dataset We evaluate our proposed SEMINAR model on three datasets: two public datasets including Amazon review dataset (Movies and TV subset) ²²2https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ and the KuaiSAR ³³3https://zenodo.org/records/8181109 search and recommendation dataset, one industrial dataset Alipay short video dataset. The average length of users’ sequence has the magnitude of $L=2000,1000,100$ for the Alipay, KuaiSAR and Amazon datasets. The detailed statistics can be found in Table 1.

•

Amazon Reviews We select Movies and TV subset of the public Amazon reviews dataset for experiment. The meta-information of items is also provided in the dataset. We use the image thumbnails as the inputs to the sequence of image modal. To get the aligned query sequence, we generate a query relevant to each item from its description in the meta information as in (Ai et al., 2017) and (Guo et al., 2023).
•

KuaiSAR (Sun et al., 2023) KuaiSAR is a real-world public large scale dataset containing both search and recommendation behaviors collected from Kuaishou⁴⁴4https://www.kuaishou.com/en, a leading short-video app. We construct a unified sequence of query and item pairs to compare different lifelong behaviors sequences models.
•

Alipay Short Video The Alipay short video dataset is a real-world industrial dataset collected from exposures and clicks logs of short-video recommendation and search ranking scenario of Alipay app. We convert the title of the short video as the input to the text modal, and the image thumbnails to the image modal. Users’ search queries are collected and aligned to corresponding viewed items.

We process the datasets of Amazon and KuaiSAR as in literature (Zhou et al., 2020) and repo ⁵⁵5https://github.com/RUCAIBox/CIKM2020-S3Rec. User with $N$ actions will generate N-1 samples. We use the first $i-1$ actions to predict whether the user will interact with the $i$ -th item ( $0<i<=N$ ). Additionally, we apply the leave-one-out strategy, using the $(N-1)$ -th action as the validation set, the $N$ -th action as positive in test set and randomly sampled negatives in the test set. The remaining samples are used as training and pretraining set. In the industrial Alipay short video dataset, exposed clicks are treated as positive samples and exposed non-clicks are considered as negative samples. The training and validation sets are randomly split using data from past [0,T-1] days (T=60), and the test set come from the $T$ -th day.

Table 1. Statistics of the Amazon Movies and TV, the Alipay Short Video and KuaiSAR datasets. K denotes thousand.

Dataset	User	Item	Query	U-I
Amazon Movies & TV	297 K	181 K	-	3,293 K
Alipay Short Video	35,065 K	1,132 K	51 K	62,948 K
KuaiSAR	25,877	6,890,707	453,667	19,664,885

Table 2. Results of lifelong behavior sequence modeling of KuaiSAR dataset, Amazon Review dataset and Alipay short video recommendation dataset.* indicates best performing model.

	KuaiSAR			Amazon Movies and TV			Alipay Short Video
Method	NDCG@5	NDCG@10	NDCG@50	NDCG@5	NDCG@10	NDCG@50	AUC
SIM	0.2523	0.2661	0.3293	0.3573	0.3959	0.4577	0.7382
QIN	0.2535	0.2672	0.3312	0.3650	0.4038	0.4630	0.7239
ETA	0.2642	0.2756	0.3313	0.3626	0.4008	0.4607	0.7262
TWIN	0.2558	0.2709	0.3294	0.3627	0.4017	0.4605	0.7376
SEMINAR	*0.2816	*0.2969	*0.3457	*0.3661	*0.4041	*0.4636	*0.7503
Absolute Impr.	+0.0292	+0.0308	+0.0164	+0.0088	+0.0082	+0.0059	+0.0264

Table 3. Ablation Studies of Different PSU pretraining tasks on KuaiSAR Dataset. N@K denotes NDCG@K.

Method	N@5	N@10	N@50
SEMINAR	0.2816	0.2969	0.3457
w/o pretraining	0.2564	0.2738	0.3310
w. align, w/o next-predict,q-i relev.	0.2702	0.2832	0.3420
w. next-predict, w/o align,q-i relev.	0.2675	0.2813	0.3408
w. q-i relev., w/o align, next-predict	0.2633	0.2754	0.3357

To evaluate the recall performance of different approximation fast retrieval methods, we conduct experiments on two datasets: the multi-modal embedding of the Alipay short video dataset with sequence length $L=2,000$ and a synthetic dataset. The purpose of the synthetic dataset is to test the performance of different retrieval methods on extremely long sequence (e.g. $L=10,000$ ), which is not available in public datasets. The synthetic dataset consists of query, text, image, and attribute vectors generated by i.i.d. normal distribution $N(\mu,\sigma^{2})$ with different values of mean $\mu$ and variance $\sigma^{2}$ , to imitate various norm values of multi-modal vectors of query-item pairs.

Table 4. Recall@K Evaluation of Approximate Retrieval Methods on Alipay Short Video Recommendation Dataset

Method	R@32	R@64	R@128	R@256
ANN (HNSW)	0.7881	0.8603	0.9288	0.9409
LSH	0.7528	0.8175	0.8721	0.9257
RQ-VAE	0.8225	0.8422	0.8633	0.8995
Multi-Modal PQ	0.9638	0.9769	0.9797	0.9874

Comparison Methods

We compared several strong lifelong sequence modeling baselines with our proposed SEMINAR model:

•

SIM (Pi et al., 2020) SIM adopts cascading search unit GSU and ESU to extract the relevant behaviors of the candidate item and applies multi-head target attention to model users’ interest.
•

ETA (Chen et al., 2022) Efficient Target Attention encodes query and keys as binary hash vectors using a multi-round random projection matrix. The retrieval is calculated as the Hamming distance between the target item and the items in the sequence.
•

TWIN (Chang et al., 2023) Two-Stage Interest Network adopts the same relevance metric between the target behavior and historical behaviors as the target attention in two cascading stages GSU and ESU.
•

QIN (Guo et al., 2023) QIN network uses the query as first trigger to retrieve top $K1$ behaviors, and target item as the second trigger to retrieve top $K2$ relevant items afterwards.

The input features to all baseline models are the same, including the query and multi-modal item features in all datasets.

For the online approximate retrieval performance, we compared our proposed multi-modal product quantization (Jégou et al., 2011) strategy with some widely adopted vector retrieval methods in the query-item multi-modal pairs retrieval setting, including:

•

HNSW (Malkov and Yashunin, 2018) Navigable Small World Graphs
•

LSH Locality Sensitive Hashing
•

RQ-VAE (Zeghidour et al., 2021) Residual Vector Quantization VAE (RQ-VAE) follows an encoder-decoder structure and uses the multi-stage vector quantizer to regress original inputs, with multi-scale spectral reconstruction loss as constraints.

Evaluation Metrics We use NDCG@K to evaluate the recommendation performance on Amazon review and KuaiSAR dataset, and AUC (Area Under the Curve) to evaluate the CTR prediction performance of exposures and clicks in Alipay short video dataset. Secondly, to evaluate the performance of multi-modal query-item retrieval, we calculate the exact attention on all the items in the sequence using the target query-item pair as a trigger and regard the real top-K relevant items as ground truth. Different fast retrieval strategies are evaluated by Recall@K at different $K$ levels, which measures how many top-K relevant ground truth items are recalled by the approximation strategy.

Implementation Details We implement the baseline methods and our proposed SEMINAR model using PyTorch. Secondly, the baselines of different approximate retrieval methods, ANN(HNSW), and our Multi-Modal Product Quantization are implemented using the Python library faiss (Douze et al., 2024) ⁶⁶6https://github.com/facebookresearch/faiss, and the RQ-VAE (Zeghidour et al., 2021) is implemented using the Python library vector_quantize_pytorch ⁷⁷7https://github.com/lucidrains/vector-quantize-pytorch. The code of SEMINAR is available at the repo: https://github.com/paper-submission-coder/SEMINAR and the public datasets Amazon and KuaiSAR can be downloaded following the instructions in the README file.

For the hyperparameter settings of recommendation models, we set the sequence length $L$ to 2000,1000 and 100 for Alipay Short Video, KuaiSAR, and Amazon Review datasets and retrieve $K=200,200,50$ most relevant items, based on users’ average interaction length in different datasets. The embedding of multi-modal text and image channels are outputs from pretrained ViT-B/32 ⁸⁸8https://github.com/openai/CLIP model of CLIP with original dimension 512, then linearly projected to dimension 64. And the weight of query representation $\lambda$ in section 3.3.1 is set to 0.5 to fuse query embedding and multi-modal item embedding. We also compare different $\lambda$ values ( $\lambda=0.1,0.3,0.5,0.7,0.9$ ) in the following section of ablation study. For the multi-head target attention, we set number of heads as 4. The batch size is set to 256 and we are using Adam optimizer with learning rate set to 0.001. The number of pretraining epochs is set to $5,1,1$ on KuaiSAR, Amazon and Alipay datasets respectively, and the number of training epochs are the same for all models in comparison. The checkpoint is exported by best NDCG metrics on evaluation dataset. For the implementation of our multi-modal production quantization, number of modals $M$ is set to 4. And the original 64-dimension dense embedding vectors are expressed as $N_{bit}=8$ bits vectors of integer codes. Each bit of the integer vectors represents the codebook assignment of centroids $c^{(m)}_{i}\in\{1,2,…,|\mathcal{C}_{m}|\}$ . The cardinality of each dimension $|\mathcal{C}_{m}|$ is set to 512. To generate the synthetic dataset with extremely long sequence $L=10,000$ , We generate multi-modal embedding of sequence with different norm values across modals as normally distributed variables $N(\mu,\sigma^{2})$ . We set $\mu=0.25,0.5,1.0,2.0$ and $\sigma=1.0$ to query, attribute, text and image modals respectively. To investigate the influence of different fusion weight $\gamma_{m}$ of multi-modal embedding, we conduct different experiments of equal weights $\gamma_{m}=[0.25,0.25,0.25,0.25]$ and different weights $\gamma_{m}=[0.1,0.2,0.3,0.4]$ .

4.2. Experimental Results

4.2.1. Lifelong User Behavior Modeling

We report the performance on different datasets from multiple domains, including NDCG@K on KuaiSAR dataset, the Movie and TV subset of the Amazon review dataset and AUC performance on the Alipay short video recommendation dataset in Table 2. The asterisk (*) denotes the best performance achieved in each task. We can see that SEMINAR achieved the best performance on the KuaiSAR dataset with improvement of +0.0292, +0.0308, +0.0164 in $\text{NDCG@K}=5,10,50$ and improvement of +0.0088, +0.0082, +0.0059 in $\text{NDCG@K}=5,10,50$ on Amazon dataset compared to SIM. Additionally, SEMINAR also achieved the best AUC performance on the Alipay short video recommendation dataset with improvement of +0.0264 compared to multiple strong SOTA baselines.

4.2.2. Multi-Modal Query-Item Pairs Approximate Retrieval

To compare multi-modal query-item approximate retrieval methods, we report the Recall@K performance of the industrial Alipay Short Video dataset in Table 4 and the performance of the synthetic dataset in Figure 2. From the result of Alipay Short Video dataset, we observe that our proposed Multi-Modal Product Quantization strategy achieves the highest Recall@K compared to other approximation methods under different values of $K=[32,64,128,256]$ and $L=2000$ . Secondly, we observe that in the synthetic dataset, experimental groups are designed with different sequence length $L=2000,5000,10000$ , different settings of norm values $|\mathbb{x}^{(m)}|$ across modalities, and different settings of weight $\gamma_{m}$ across modalities as in section 4 of detailed implementation. Our proposed method of multi-modal product quantization strategy consistently achieves the best Recall@K at different $K$ levels ( $K=[32,64,128,256]$ ), with only a few exceptions of falling behind the cascading ANN (HNSW) method for large $K$ values under $L=2000,5000$ . For the first method, cascading ANN (HNSW), we observe that the greedy strategy of cascading ANN achieves poor results at small values of $K$ (e.g., L=10000, Recall@32=0.2719), and the performance increases dramatically as $K$ increases to 256 (L=10000, Recall@256=0.7289). This aligns with our expectation that in the setting of weighted sum of multiple vectors, as $K$ increases, the real top-K relevant pairs to the target pair have higher probability of being recalled by the greedy strategy of $M^{2}$ cascading cross-modal ANN retrieval.

To analyze the effect of different variables for approximate retrieval, e.g. merging weights $\gamma_{m}$ of modalities, different norm values $|\mathbb{x}^{(m)}|$ , we plot the line chart of Recall@K as $K$ increases in Figure 2. From the chart, we can observe that different norm values of multi-modal vectors influence the overall Recall@K dramatically. Under the same sequence length $L=10000$ and $K=256$ , the Recall@K of the group with different norm values is on average -0.14 below the group with the same norm values. The varied norm values of multi-modal vectors make it more challenging to achieve high Recall@K compared to the equal norm values counterpart.

4.3. Discussion and Ablation Study

4.3.1. Influence of Pretraining Epochs Number and Ablation of Pretraining Tasks

To investigate the influence of different pretraining epochs and without pretraining of the SEMINAR model, we reported the NDCG@K performance on KuaiSAR dataset with sequence length 1000 in Figure 3. We can see that the SEMINAR model achieve largest improvement compared to the group of without pretraining in the first 5 pretraining epochs, and additional pretraining epochs up to 10 contribute only marginally to the performance.

As for the ablation study of different pretraining tasks of SEMINAR, we trained different models on KuaiSAR dataset, including without pretraining, with only one pretraining task and without the other two tasks (e.g., w. alignment and w/o next pair prediction, query-item relevance). The results of the ablation study are reported in Table 3. Compared to the group of SEMINAR without pretraining, multi-modal alignment task contributes largest to the performance improvement, followed by next pair prediction and query-item relevance.

4.3.2. Query-Item Representation Fusion Weight

To investigate the influence of different weight $\lambda$ to fusion query and item representation, we conduct different experiments $\lambda=[0.1,0.3,0.5,0.7,0.9]$ of SEMINAR model on KuaiSAR dataset. The results are reported in Figure 4. The best performance is achieved at $\lambda=0.3$ . We speculate that optimal value of fusion weight $\lambda$ depends on the distribution of search and recommendation behaviors in the unified sequence of query-item pair. For example, in KuaiSAR dataset search actions consist of 25.7% of overall users’ actions, and recommendation actions consist of 74.3% of total actions as in (Sun et al., 2023) . The optimal value of $\lambda$ may vary across different domains and datasets, which need further investigation in future research.

5. Conclusion

In this paper, we proposed SEMINAR to model users’ lifelong behavior sequence of query and item pairs. We introduced the Pretraining Search Unit to help alleviate the issues of insufficient learning of ID embeddings in lifelong sequence and multi-modal alignment. For online fast approximate retrieval, a multi-modal product-quantization based strategy is also proposed. Extensive evaluations on multiple datasets demonstrate the effectiveness of our method.

References

(1)
Ai et al. (2017) Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and W. Bruce Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, USA, 645–654. https://doi.org/10.1145/3077136.3080813
Chang et al. (2023) Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (¡conf-loc¿, ¡city¿Long Beach¡/city¿, ¡state¿CA¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 3785–3794. https://doi.org/10.1145/3580305.3599922
Chen et al. (2022) Qiwei Chen, Yue Xu, Changhua Pei, Shanshan Lv, Tao Zhuang, and Junfeng Ge. 2022. Efficient Long Sequential User Data Modeling for Click-Through Rate Prediction. arXiv:2209.12212 [cs.IR]
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). arXiv:2010.11929 https://arxiv.org/abs/2010.11929
Douze et al. (2024) Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
Gong et al. (2023) Yuqi Gong, Xichen Ding, Yehui Su, Kaiming Shen, Zhongyi Liu, and Guannan Zhang. 2023. An Unified Search and Recommendation Foundation Model for Cold-Start Scenario. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 4595–4601. https://doi.org/10.1145/3583780.3614657
Guo et al. (2023) Tong Guo, Xuanping Li, Haitao Yang, Xiao Liang, Yong Yuan, Jingyou Hou, Bingqing Ke, Chao Zhang, Junlin He, Shunyu Zhang, Enyun Yu, and Wenwu Ou. 2023. Query-dominant User Interest Network for Large-Scale Search Ranking. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 629–638. https://doi.org/10.1145/3583780.3615022
Hou et al. (2023) Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. arXiv:2210.12316 [cs.IR]
Jégou et al. (2011) Herve Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57
Malkov and Yashunin (2018) Yu. A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320 [cs.DS]
Pi et al. (2020) Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2685–2692. https://doi.org/10.1145/3340531.3412744
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
Rajput et al. (2023) Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. 2023. Recommender Systems with Generative Retrieval. arXiv:2305.05065 [cs.IR]
Si et al. (2023) Zihua Si, Zhongxiang Sun, Xiao Zhang, Jun Xu, Xiaoxue Zang, Yang Song, Kun Gai, and Ji-Rong Wen. 2023. When Search Meets Recommendation: Learning Disentangled Search Representation for Recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1313–1323. https://doi.org/10.1145/3539618.3591786
Sun et al. (2023) Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Dewei Leng, Yanan Niu, Yang Song, Xiao Zhang, and Jun Xu. 2023. KuaiSAR: A Unified Search And Recommendation Dataset. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 5407–5411. https://doi.org/10.1145/3583780.3615123
van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6309–6318.
Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
Yao et al. (2021) Jing Yao, Zhicheng Dou, Ruobing Xie, Yanxiong Lu, Zhiping Wang, and Ji-Rong Wen. 2021. USER: A Unified Information Search and Recommendation Model based on Integrated Behavior Sequence. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 2373–2382. https://doi.org/10.1145/3459637.3482489
Zeghidour et al. (2021) Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312 [cs.SD]
Zhao (2023) Pengyu et al. Zhao. 2023. M5: Multi-Modal Multi-Interest Multi-Scenario Matching for Over-the-Top Recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (¡conf-loc¿, ¡city¿Long Beach¡/city¿, ¡state¿CA¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 3785–3794. https://doi.org/10.1145/3580305.3599863
Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 1893–1902. https://doi.org/10.1145/3340531.3411954