Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SEMINAR: Search Enhanced Multi-modal Interest Network and Approximate Retrieval for Lifelong Sequential Recommendation

Kaiming Shen Ant GroupBeijingChina kaiming.skm@antgroup.com Xichen Ding Ant GroupBeijingChina xichen.dxc@antgroup.com Zixiang Zheng Ant GroupBeijingChina zhengzixiang.zzx@antgroup.com Yuqi Gong Ant GroupBeijingChina gongyuqi.gyq@antgroup.com Qianqian Li Ant GroupBeijingChina zixi.lqq@antgroup.com Zhongyi Liu Ant GroupHangzhouChina zhongyi.lzy@antgroup.com  and  Guannan Zhang Ant GroupHangzhouChina zgn138592@antgroup.com
(2018)
Abstract.

The modeling of users’ behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users’ lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personalized search ranking (PSR) is extremely difficult due to the insufficient learning problem of ID embedding, especially when the IDs in the lifelong sequence features do not exist in the samples of training dataset. Additionally, existing target attention mechanisms struggle to learn the multi-modal representations of items in the sequence well. The distribution of multi-modal embedding (text, image and attributes) output of user’s interacted items are not properly aligned and there exist divergence across modalities. We also observe that users’ search query sequences and item browsing sequences can fully depict users’ intents and benefit from each other. To address these challenges, we propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval. Specifically, a network called Pretraining Search Unit (PSU) learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner with multiple objectives: multi-modal alignment, next query-item pair prediction, query-item relevance prediction, etc. After pretraining, the downstream model, which shares the same target attention structure with PSU, restores the pretrained embedding as initialization and finetunes the network. To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy to approximate the exact attention calculation and significantly reduce the time complexity.

Abstract.

The modeling of users’ behaviors is crucial in modern recommendation systems. A lot of research focuses on modeling users’ lifelong sequences, which can be extremely long and sometimes exceed thousands of items. These models use the target item to search for the most relevant items from the historical sequence. However, training lifelong sequences in click through rate (CTR) prediction or personalized search ranking (PSR) is extremely difficult due to the insufficient learning problem of ID embedding, especially when the IDs in the lifelong sequence features do not exist in the samples of training dataset. Additionally, existing target attention mechanisms struggle to learn the multi-modal representations of items in the sequence well. The distribution of multi-modal embedding (text, image and attributes) output of user’s interacted items are not properly aligned and there exist divergence across modalities. We also observe that users’ search query sequences and item browsing sequences can fully depict users’ intents and benefit from each other. To address these challenges, we propose a unified lifelong multi-modal sequence model called SEMINAR-Search Enhanced Multi-Modal Interest Network and Approximate Retrieval. Specifically, a network called Pretraining Search Unit (PSU) learns the lifelong sequences of multi-modal query-item pairs in a pretraining-finetuning manner with multiple objectives: multi-modal alignment, next query-item pair prediction, query-item relevance prediction, etc. After pretraining, the downstream model, which shares the same target attention structure with PSU, restores the pretrained embedding as initialization and finetunes the network. To accelerate the online retrieval speed of multi-modal embedding, we propose a multi-modal codebook-based product quantization strategy to approximate the exact attention calculation and significantly reduce the time complexity.

lifelong sequence modeling, multi-modal retrieval
copyright: acmcopyrightjournalyear: 2018doi: 10.1145/1122445.1122456conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NYbooktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06ccs: Information systems Recommender systemsccs: Information systems Data mining

1. Introduction

Users’ behavior modeling is extremely important in modern commercial recommendation systems, including online e-commerce platforms such as Amazon, Taobao, Alipay, and content platforms such as YouTube, TikTok, etc. As users spend more time on online shopping and watching short videos, the length of users’ historical behaviors has grown dramatically from a few hundreds (102superscript10210^{2}10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) to more than ten-thousands (104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT) in recent years. A lot of recent research focuses on modeling users’ lifelong behaviors, such as Efficient Target Attention (ETA) (Chen et al., 2022), Two-Stage Interest Network (TWIN) (Chang et al., 2023), Query-Dominant Interest Network (QIN) (Guo et al., 2023), etc. These models follow a cascading two-stage paradigm, which first uses the target item or target search query as a trigger to retrieve the top-K relevant behaviors from historical behaviors. In the second stage, it uses target attention to encode the selected behaviors as users’ interest representation. This paradigm is widely adopted in many search and recommendation tasks, such as click through rate (CTR) prediction and personalized search ranking. The item representations in the sequence are computed using both the item ID feature and more generic attributes’ features. One easily neglected problem in existing lifelong behavior modeling is the insufficient learning problem of ID features in the lifelong sequence, such as historical item ID, author ID, etc. Many historical items in the lifelong sequence can’t be found in the current training dataset, which is collected from the most recent logs of exposures and clicks. These low frequency ID embeddings can’t be learned well by the limited dataset after being randomly initialized, which will harm the accuracy of target attention calculation.

The second problem in existing lifelong sequence modeling is that it can’t handle multi-modal features of items in the sequence well, such as text and image features. The norm values of vectors from different modalities vary if the modalities are not properly aligned in the same embedding space. Existing target item attention calculation uses the inner product of query and keys, which may be dominated by modality vectors with large norm values. For example, the target item will only retrieve the top-K visually relevant but semantically very different items from historical behaviors, which will deteriorate the online performance of recommendation.

To tackle these problems, we propose a new model called Search Enhanced Multi-Modal Interest Network and Approximate Retrieval (SEMINAR) to model users’ lifelong historical multi-modal behaviors. The users’ historical behaviors include heterogeneous behaviors of both the sequence of browsing item and the sequence of search query. We align users’ search query sequence with browsing item sequence together as a unified sequence of query-item pairs, which can be retrieved flexibly by target item or target search query in both the CTR prediction task and Personalized Search Ranking (PSR) task. SEMINAR proposes a Pretraining Search Unit (PSU) network to learn the lifelong behavior sequence of historical multi-modal query-item pairs. It introduces multiple pretraining tasks designed to solve the insufficient learning issue of historical ID features and the multi-modal alignment. In downstream tasks, the target attention module restores the learned item representations from PSU, using the pretrained ID embedding as initialization, and applies a projection weight matrix to get the transformed representation of the behavior sequence. During online serving, calculating exact attention using the inner product of multi-modal vectors in the lifelong sequence has the time complexity of O(L×M×d)𝑂𝐿𝑀𝑑O(L\times M\times d)italic_O ( italic_L × italic_M × italic_d ), which is time consuming. L𝐿Litalic_L denotes the sequence length, M𝑀Mitalic_M denotes the number of modalities and d𝑑ditalic_d denotes the embedding dimension. Different from existing approximate retrieval methods, such as Locally Sensitive Hash (LSH) and Hierarchical Navigable Small World (HNSW), we exploit an approximate strategy of Product Quantization in a multi-modal setting and express the multi-modal item representations as discrete integer codes using the quantization codebooks, and sum the inner product of centroids in sub-vectors to approximate the exact attention calculation. During online serving, the attention calculation is equivalent to pre-computed distance table lookup and summation operations, which can be conducted efficiently.

In summary, the main contributions of our work are as follows:

  • We identify the insufficient learning problem of ID features in lifelong behavior modeling and observe that target attention calculation is dominated by multi-modal features with large norm values. And we novelly propose SEMINAR framework, which includes the Pretraining Search Unit to effectively alleviate the insufficient learning problems of ID embedding and multi-modal alignment.

  • We exploit a product quantization approximation strategy in a multi-modal setting, which can reduce the time complexity during online serving of retrieval using target query item pair from historical behaviors.

  • We conduct extensive experiments on real-world datasets to demonstrate the effectiveness of our proposed model. And we also released the code of SEMINAR in this repository 111https://github.com/paper-submission-coder/SEMINAR to encourage further research.

2. Related Work

Refer to caption
Figure 1. Illustration of SEMINAR Model Architecture. Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the i-th behavior of query and item pair in the lifelong sequence. Each behavior has multiple channels of query and multi-modal features of text, image and attributes. PSU denotes the pretraining search unit. GSU and ESU denote the general and exact search unit respectively as the two stage paradigm.

2.1. Long-Term Lifelong User Behavior Modeling

Long-term lifelong user behavior modeling has attracted much research attention in recent years. Typical works include SIM (Pi et al., 2020), ETA (Chen et al., 2022), TWIN (Chang et al., 2023), QIN (Guo et al., 2023), etc. SIM (Pi et al., 2020) introduces the General Search Unit to retrieve the top-K most relevant items from historical behaviors using the target item as a trigger, and the Exact Search Unit (ESU) to calculate the multi-head target attention (MHTA). ETA (Chen et al., 2022) uses a set of hash functions to express the item representation as binary hash embedding and calculates the Hamming distance to approximate the inner product calculation. TWIN (Chang et al., 2023) introduces the CP-GSU as a consistency-preserved lifelong user behavior modeling module to increase the relevance calculation consistency between the two cascading stages. QIN (Guo et al., 2023) uses the search query as a trigger to retrieve the most relevant items from the historical behaviors in the first stage of the cascading models in Personalized Search Ranking. Different from the existing work, we propose the pretraining search unit (PSU) to alleviate the insufficient learning problem of ID features and multi-modal alignment in attention calculation. Furthermore, there is an increasing trend of modeling search and recommendation tasks jointly in a unified framework, such as USER (Yao et al., 2021), SESRec (Si et al., 2023), S&R Foundation (Gong et al., 2023), etc. To model the lifelong behaviors, we align the historical search query sequence and browsing item sequence as a unified sequence of query-item pairs, which can be applied to both CTR prediction in recommendation and personalized search ranking.

2.2. Multi-Modal Alignment in Recommendation and Item Quantization

Multi-modal alignment is a prevalent topic, which aligns the multi-modal features such as text and image in a unified embedding space in a contrastive learning manner. Typical works include CLIP (Radford et al., 2021), etc. Some researchers have focused on modeling multi-modal user sequences in recommendation. M5 (Zhao, 2023) applies a multi-modal embedding layer to extract both ID embeddings of show ID and content-graph embeddings initialized from a meta-path pretrained model. To better increase the generalization of ID embeddings, some research is proposed to express item representations as quantized vectors in discrete codes, including Product Quantization (Jégou et al., 2011), VQ-VAE (van den Oord et al., 2017), RQ-VAE (Zeghidour et al., 2021), etc. VQ-Rec (Hou et al., 2023) proposes to encode text as discrete codes using product quantization techniques and use transformer to learn cross-domain data in recommendation. TIGER (Rajput et al., 2023) learns the semantic ID from the content information and learns RQ-VAE (Zeghidour et al., 2021) representations for generative retrieval.

Much research focuses on making fast retrieval of relevant items from an embedding database. Common methods include approximate nearest neighbor (ANN) search using HNSW (Malkov and Yashunin, 2018), Product Quantization (Jégou et al., 2011), etc. Product quantization is a technique to transform a d-dimensional vector to a low-dimensional N-bit integer vector of centroid ids in the codebook. It first splits a vector 𝕩D𝕩superscript𝐷\mathbb{x}\in\mathbb{R}^{D}blackboard_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT into Nbitsubscript𝑁𝑏𝑖𝑡N_{bit}italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT sub-vectors and applies quantization function q(x)𝑞𝑥q(x)italic_q ( italic_x ) to assign each sub-vector 𝕩isubscript𝕩𝑖\mathbb{x}_{i}blackboard_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the nearest centroid cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from a codebook 𝒞𝒞\mathcal{C}caligraphic_C as 𝕩=[𝕩i]1:Nbit[q(𝕩i)]1:Nbit=[ci]1:Nbit𝕩subscriptdelimited-[]subscript𝕩𝑖:1subscript𝑁𝑏𝑖𝑡subscriptdelimited-[]𝑞subscript𝕩𝑖:1subscript𝑁𝑏𝑖𝑡subscriptdelimited-[]subscript𝑐𝑖:1subscript𝑁𝑏𝑖𝑡\mathbb{x}=[\mathbb{x}_{i}]_{1:N_{bit}}\rightarrow[q(\mathbb{x}_{i})]_{1:N_{% bit}}=[c_{i}]_{1:N_{bit}}blackboard_x = [ blackboard_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 : italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT → [ italic_q ( blackboard_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT 1 : italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = [ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 : italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. And the quantization function is q(𝕩i)=argminci𝒞d(𝕩i,eci)𝑞subscript𝕩𝑖subscriptsubscript𝑐𝑖𝒞𝑑subscript𝕩𝑖subscript𝑒subscript𝑐𝑖q(\mathbb{x}_{i})=\arg\min_{c_{i}\in\mathcal{C}}d(\mathbb{x}_{i},e_{c_{i}})italic_q ( blackboard_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_arg roman_min start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C end_POSTSUBSCRIPT italic_d ( blackboard_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and ecisubscript𝑒subscript𝑐𝑖e_{c_{i}}italic_e start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the centroid embedding of the cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-th centroid.

3. Proposed Model

3.1. Problem Formulation

We can split the sequence of users’ behaviors into several heterogeneous sub-sequences, including the sequence of search queries 𝒬={q1,q2,..,q|𝒬|}\mathcal{Q}=\{q_{1},q_{2},..,q_{|\mathcal{Q}|}\}caligraphic_Q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , . . , italic_q start_POSTSUBSCRIPT | caligraphic_Q | end_POSTSUBSCRIPT } of explicit intents and the sequence of browsing recommended items ={i1,i2,,i||}subscript𝑖1subscript𝑖2subscript𝑖\mathcal{B}=\{i_{1},i_{2},...,i_{|\mathcal{B}|}\}caligraphic_B = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT | caligraphic_B | end_POSTSUBSCRIPT }. For search behaviors, users input a query q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q and interact (click or view) with a few items related to the query, resulting in the aligned sequence of query and item pairs (ql,il)subscript𝑞𝑙subscript𝑖𝑙(q_{l},i_{l})( italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). For the behavior of browsing recommended items, users browse a sequence of items without explicit search intent, and we pad an empty search query q=𝑞q=\emptysetitalic_q = ∅ to each item to obtain the query-item pair as (ql=,il)subscript𝑞𝑙subscript𝑖𝑙(q_{l}=\emptyset,i_{l})( italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∅ , italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). Finally, we construct a unified sequence of aligned query-item pairs in chronological order with length L𝐿Litalic_L, denoted as {(ql,il)}l=1:Lsubscriptsubscript𝑞𝑙subscript𝑖𝑙:𝑙1𝐿\{(q_{l},i_{l})\}_{l=1:L}{ ( italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_l = 1 : italic_L end_POSTSUBSCRIPT. In some recommendation scenarios, such as short video recommendations of YouTube and TikTok, each item has multi-modal features such as text (title of video), image, and attributes (authors and categories). We further split the sequence of browsed items \mathcal{B}caligraphic_B into M𝑀Mitalic_M multi-modal sub-sequences, including a sequence of text features 𝒯={T1,T2,,TL}𝒯subscript𝑇1subscript𝑇2subscript𝑇𝐿\mathcal{T}=\{T_{1},T_{2},...,T_{L}\}caligraphic_T = { italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_T start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }, a sequence of image features ={I1,I2,,IL}subscript𝐼1subscript𝐼2subscript𝐼𝐿\mathcal{I}=\{I_{1},I_{2},...,I_{L}\}caligraphic_I = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }, and a sequence of attribute features 𝒜={A1,A2,,AL}𝒜subscript𝐴1subscript𝐴2subscript𝐴𝐿\mathcal{A}=\{A_{1},A_{2},...,A_{L}\}caligraphic_A = { italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_A start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }. Finally, we let [𝒬,𝒯,,𝒜](M+1)×L×d𝒬𝒯𝒜superscript𝑀1𝐿𝑑[\mathcal{Q},\mathcal{T},\mathcal{I},\mathcal{A}]\in\mathbb{R}^{(M+1)\times L% \times d}[ caligraphic_Q , caligraphic_T , caligraphic_I , caligraphic_A ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + 1 ) × italic_L × italic_d end_POSTSUPERSCRIPT denote the input sequence of multi-modal query-item pairs to the SEMINAR model and d𝑑ditalic_d denotes the dimension of aligned representations.

3.2. Aligned Lifelong Sequence of Multi-Modal Query-Item Pairs

The aligned sequence of multi modal query-item pairs pass the embedding layers. We let [𝕩l=(𝕩lquery,𝕩litem)]l=1:Lsubscriptdelimited-[]subscript𝕩𝑙subscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦𝑙subscriptsuperscript𝕩𝑖𝑡𝑒𝑚𝑙:𝑙1𝐿[\mathbb{x}_{l}=(\mathbb{x}^{query}_{l},\mathbb{x}^{item}_{l})]_{l=1:L}[ blackboard_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_i italic_t italic_e italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_l = 1 : italic_L end_POSTSUBSCRIPT denote the historical sequence of query and item pairs. 𝕩lqueryd,𝕩litem=(𝕩ltext,𝕩limage,𝕩lattributes)M×dformulae-sequencesubscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦𝑙superscript𝑑subscriptsuperscript𝕩𝑖𝑡𝑒𝑚𝑙subscriptsuperscript𝕩𝑡𝑒𝑥𝑡𝑙subscriptsuperscript𝕩𝑖𝑚𝑎𝑔𝑒𝑙subscriptsuperscript𝕩𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠𝑙superscript𝑀𝑑\mathbb{x}^{query}_{l}\in\mathbb{R}^{d},\mathbb{x}^{item}_{l}=(\mathbb{x}^{% text}_{l},\mathbb{x}^{image}_{l},\mathbb{x}^{attributes}_{l})\in\mathbb{R}^{M% \times d}blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_i italic_t italic_e italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( blackboard_x start_POSTSUPERSCRIPT italic_t italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_d end_POSTSUPERSCRIPT. In CTR prediction, target attention (TA) is a structure which uses target item to retrieve the most relevant items from the sequence of historical behaviors. We extend TA from target item to target query-item pair to retrieve most relevant top K𝐾Kitalic_K pairs from historical sequence. We denote the target query-item pair as 𝕩t=(𝕩tquery,𝕩ttext,𝕩timage,𝕩tattributes)subscript𝕩𝑡subscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦𝑡subscriptsuperscript𝕩𝑡𝑒𝑥𝑡𝑡subscriptsuperscript𝕩𝑖𝑚𝑎𝑔𝑒𝑡subscriptsuperscript𝕩𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠𝑡\mathbb{x}_{t}=(\mathbb{x}^{query}_{t},\mathbb{x}^{text}_{t},\mathbb{x}^{image% }_{t},\mathbb{x}^{attributes}_{t})blackboard_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_t italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

3.3. SEMINAR Model Architecture

Our proposed model SEMINAR in Figure 1 introduces a new network Pretraining Search Unit (PSU) to pretrain using dataset of the lifelong sequence of multi-modal query-item pairs. Section 3.3.1 introduces the PSU and corresponding pretraining tasks. Section 3.3.2 introduces how the recommendation model restores the pretrained query and item representations from PSU as initialization and applies a projection matrix to get the transformed representation of the sequence. Top-K relevant pairs are retrieved by the target pair and participate in the multi-head target attention (MHTA) calculation. Section 3.3.3 introduces the multi-modal product quantization approximation.

3.3.1. Pretraining Search Unit

The input to PSU is the aligned sequence of query-item pairs as [𝕩l]l=1:Lsubscriptdelimited-[]subscript𝕩𝑙:𝑙1𝐿[\mathbb{x}_{l}]_{l=1:L}[ blackboard_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_l = 1 : italic_L end_POSTSUBSCRIPT. L𝐿Litalic_L denotes the length of the aligned sequence and 𝕩l=(𝕩lquery,𝕩ltext,𝕩limage,𝕩lattributes)subscript𝕩𝑙subscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦𝑙subscriptsuperscript𝕩𝑡𝑒𝑥𝑡𝑙subscriptsuperscript𝕩𝑖𝑚𝑎𝑔𝑒𝑙subscriptsuperscript𝕩𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠𝑙\mathbb{x}_{l}=(\mathbb{x}^{query}_{l},\mathbb{x}^{text}_{l},\mathbb{x}^{image% }_{l},\mathbb{x}^{attributes}_{l})blackboard_x start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_t italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) represents the l𝑙litalic_l-th behavior in the sequence, which consists of the query embedding and multi-modal embedding. The query q𝒬𝑞𝒬q\in\mathcal{Q}italic_q ∈ caligraphic_Q passes through the query feature encoder f(.)f(.)italic_f ( . ), resulting in Q=f(𝕩1:Lquery)L×dquery𝑄𝑓subscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦:1𝐿superscript𝐿subscript𝑑𝑞𝑢𝑒𝑟𝑦Q=f(\mathbb{x}^{query}_{1:L})\in\mathbb{R}^{L\times d_{query}}italic_Q = italic_f ( blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Following the multi-modal alignment literature such as CLIP (Radford et al., 2021), we use Transformer (Vaswani et al., 2023) to encode the text feature as T=Encodertext(𝕩1:Ltext)L×dtext𝑇subscriptEncoder𝑡𝑒𝑥𝑡subscriptsuperscript𝕩𝑡𝑒𝑥𝑡:1𝐿superscript𝐿subscript𝑑𝑡𝑒𝑥𝑡T=\text{Encoder}_{text}(\mathbb{x}^{text}_{1:L})\in\mathbb{R}^{L\times d_{text}}italic_T = Encoder start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( blackboard_x start_POSTSUPERSCRIPT italic_t italic_e italic_x italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and ViT (Dosovitskiy et al., 2020) to encode the image features as I=Encoderimage(𝕩1:Limage)L×dimage𝐼subscriptEncoder𝑖𝑚𝑎𝑔𝑒subscriptsuperscript𝕩𝑖𝑚𝑎𝑔𝑒:1𝐿superscript𝐿subscript𝑑𝑖𝑚𝑎𝑔𝑒I=\text{Encoder}_{image}(\mathbb{x}^{image}_{1:L})\in\mathbb{R}^{L\times d_{% image}}italic_I = Encoder start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( blackboard_x start_POSTSUPERSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Additionally, we encode features of the attributes using the function g(.)g(.)italic_g ( . ). A=g(𝕩1:Lattribute)L×dattribute𝐴𝑔subscriptsuperscript𝕩𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒:1𝐿superscript𝐿subscript𝑑𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒A=g(\mathbb{x}^{attribute}_{1:L})\in\mathbb{R}^{L\times d_{attribute}}italic_A = italic_g ( blackboard_x start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d start_POSTSUBSCRIPT italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is treated as one channel of the sub-sequence which participates in the multi-modal alignment of the item sequence. To project the representations of different channels to the same dimension d𝑑ditalic_d, we further multiply them by linear weight matrix {Wq,Wt,Wi,Wa}subscript𝑊𝑞subscript𝑊𝑡subscript𝑊𝑖subscript𝑊𝑎\{W_{q},W_{t},W_{i},W_{a}\}{ italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT } and get the stacked input sequence of multi-modal query-item pairs as follows: 𝕩=[QWq,TWt,IWi,AWa](M+1)×L×d𝕩𝑄subscript𝑊𝑞𝑇subscript𝑊𝑡𝐼subscript𝑊𝑖𝐴subscript𝑊𝑎superscript𝑀1𝐿𝑑\mathbb{x}=[QW_{q},TW_{t},IW_{i},AW_{a}]\in\mathbb{R}^{(M+1)\times L\times d}blackboard_x = [ italic_Q italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_T italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_I italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A italic_W start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + 1 ) × italic_L × italic_d end_POSTSUPERSCRIPT.

Next Pair Prediction and Multi-Head Target Attention

The intuition behind PSU is to design a pretraining network to learn from the lifelong behavior sequence, and the pretraining network should share the same structure of multi-head target attention with the cascading two-stage downstream model, such as ETA (Chen et al., 2022) and TWIN (Chang et al., 2023). The downstream model restores the pretrained query and item embeddings as initialization of parameters and fine-tunes the network. Different from the masked language model (MLM) in BERT (Devlin et al., 2019), which uses tokens from the context window to predict the masked token, we use next-pair prediction as a pretraining task to predict the correct last query and item pair. We intentionally leave out the last query-item pair in the sequence 𝕩L=[𝕩L(m)],m[Q,T,I,A]formulae-sequencesubscript𝕩𝐿delimited-[]subscriptsuperscript𝕩𝑚𝐿𝑚𝑄𝑇𝐼𝐴\mathbb{x}_{L}=[\mathbb{x}^{(m)}_{L}],m\in[Q,T,I,A]blackboard_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = [ blackboard_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] , italic_m ∈ [ italic_Q , italic_T , italic_I , italic_A ], and treat it as the target query-item pair to retrieve from the previous (L1)𝐿1(L-1)( italic_L - 1 ) sequence using multi-head target attention. To pad the sequence length from L1𝐿1L-1italic_L - 1 to L𝐿Litalic_L, we further add a special token <EOS>expectation𝐸𝑂𝑆<EOS>< italic_E italic_O italic_S > to the end of the previous L1𝐿1L-1italic_L - 1 items in the sequence 𝕩1:L1subscript𝕩:1𝐿1\mathbb{x}_{1:L-1}blackboard_x start_POSTSUBSCRIPT 1 : italic_L - 1 end_POSTSUBSCRIPT. The next query-item pair prediction task is formulated as classification tasks: y=p(𝕩L1:M+1|𝕩1:L11:M+1;𝕩<EOS>)𝑦𝑝conditionalsubscriptsuperscript𝕩:1𝑀1𝐿subscriptsuperscript𝕩:1𝑀1:1𝐿1superscript𝕩expectation𝐸𝑂𝑆y=p(\mathbb{x}^{1:M+1}_{L}|\mathbb{x}^{1:M+1}_{1:L-1};\mathbb{x}^{<EOS>})italic_y = italic_p ( blackboard_x start_POSTSUPERSCRIPT 1 : italic_M + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT | blackboard_x start_POSTSUPERSCRIPT 1 : italic_M + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L - 1 end_POSTSUBSCRIPT ; blackboard_x start_POSTSUPERSCRIPT < italic_E italic_O italic_S > end_POSTSUPERSCRIPT ) with the loss nextpairsubscriptsuperscript𝑝𝑎𝑖𝑟𝑛𝑒𝑥𝑡\mathcal{L}^{pair}_{next}caligraphic_L start_POSTSUPERSCRIPT italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT. Positive label is assigned to the correct last pair, and negative labels are assigned to negatively sampled query-item pairs.

To better represent the historical behaviors and target query-item pair, we need to fuse the query and multi-modal item representations into a single vector as:

𝕩=λ𝕩query+(1λ)iwi𝕩item(i)=mM+1γm𝕩(m)𝕩𝜆superscript𝕩𝑞𝑢𝑒𝑟𝑦1𝜆subscript𝑖subscript𝑤𝑖superscriptsuperscript𝕩𝑖𝑡𝑒𝑚𝑖subscript𝑚𝑀1subscript𝛾𝑚superscript𝕩𝑚\mathbb{x}=\lambda\mathbb{x}^{query}+(1-\lambda)\sum_{i}w_{i}{\mathbb{x}^{item% }}^{(i)}=\sum_{m\in M+1}\gamma_{m}\mathbb{x}^{(m)}blackboard_x = italic_λ blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT + ( 1 - italic_λ ) ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT italic_i italic_t italic_e italic_m end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M + 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT

λ𝜆\lambdaitalic_λ and (1λ)1𝜆(1-\lambda)( 1 - italic_λ ) denote the weight to merge representations of query and item vectors respectively as λ[0,1]𝜆01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ], and wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the weight to merge multi-modal item representations. To simplify the notations, we use a single vector [γm]1:M+1M+1superscriptdelimited-[]subscript𝛾𝑚:1𝑀1superscript𝑀1[\gamma_{m}]^{1:M+1}\in\mathbb{R}^{M+1}[ italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT 1 : italic_M + 1 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_M + 1 end_POSTSUPERSCRIPT to represent the weight of all (M+1)𝑀1(M+1)( italic_M + 1 ) channels and the sum of the weight equals to 1 as γm=1subscript𝛾𝑚1\sum\gamma_{m}=1∑ italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 1. The weight vector γmsubscript𝛾𝑚\gamma_{m}italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT can be learned dynamically as the softmax output of a gating network. The attention is calculated as the inner product of queries and keys of the merged multi-channel representations. The final attention score will be dominated by the modals with large norm values |x(m)|superscript𝑥𝑚|x^{(m)}|| italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | and large weight γmsubscript𝛾𝑚\gamma_{m}italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and the information from other modals will be easily ignored. So we specifically decompose the attention score calculation into the norm value part |x(m)|superscript𝑥𝑚|x^{(m)}|| italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | and unit vector part x^(m)superscript^𝑥𝑚\hat{x}^{(m)}over^ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT.

We let qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the representation of target query-item pair as qt=iγi𝕩t(i)=iγi|𝕩t(i)|𝕩^t(i)subscript𝑞𝑡subscript𝑖subscript𝛾𝑖subscriptsuperscript𝕩𝑖𝑡subscript𝑖subscript𝛾𝑖subscriptsuperscript𝕩𝑖𝑡subscriptsuperscript^𝕩𝑖𝑡q_{t}=\sum_{i}\gamma_{i}\mathbb{x}^{(i)}_{t}=\sum_{i}\gamma_{i}|\mathbb{x}^{(i% )}_{t}|\hat{\mathbb{x}}^{(i)}_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | blackboard_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Note that the |𝕩t(i)|subscriptsuperscript𝕩𝑖𝑡|\mathbb{x}^{(i)}_{t}|| blackboard_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | denotes the norm value of the i-th channel of target item and 𝕩^t(i)subscriptsuperscript^𝕩𝑖𝑡\hat{\mathbb{x}}^{(i)}_{t}over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a unit vector. Similarly, we can express the lth𝑙𝑡l-thitalic_l - italic_t italic_h historical behavior klKsubscript𝑘𝑙𝐾k_{l}\in Kitalic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_K as kl=jγj𝕩l(j)=iγj|𝕩l(j)|𝕩^l(j)subscript𝑘𝑙subscript𝑗subscript𝛾𝑗subscriptsuperscript𝕩𝑗𝑙subscript𝑖subscript𝛾𝑗subscriptsuperscript𝕩𝑗𝑙subscriptsuperscript^𝕩𝑗𝑙k_{l}=\sum_{j}\gamma_{j}\mathbb{x}^{(j)}_{l}=\sum_{i}\gamma_{j}|\mathbb{x}^{(j% )}_{l}|\hat{\mathbb{x}}^{(j)}_{l}italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | blackboard_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Note that the unit vectors of multi-modal sequence representations will participate in the multi-modal alignment task in the next section.

The hhitalic_h-th head in the multi-head attention is represented as headPSUh=Attentionh(qt,KPSU,VPSU)𝑒𝑎superscript𝑑𝑃𝑆subscript𝑈subscriptAttentionsubscript𝑞𝑡superscript𝐾𝑃𝑆𝑈superscript𝑉𝑃𝑆𝑈head^{PSU_{h}}=\text{Attention}_{h}(q_{t},K^{PSU},V^{PSU})italic_h italic_e italic_a italic_d start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = Attention start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT ), and the attention score ahPSUsubscriptsuperscript𝑎𝑃𝑆𝑈a^{PSU}_{h}italic_a start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is calculated as inner product of d-dimensional vector query and keys multiplied by a scaling factor 1d1𝑑\frac{1}{\sqrt{d}}divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG.

αhPSU=(qtWhPSUQ)(KPSUWhPSUK)Tdsuperscriptsubscript𝛼𝑃𝑆𝑈subscript𝑞𝑡subscriptsuperscript𝑊𝑃𝑆subscript𝑈𝑄superscriptsuperscript𝐾𝑃𝑆𝑈subscriptsuperscript𝑊𝑃𝑆subscript𝑈𝐾𝑇𝑑\alpha_{h}^{PSU}=\frac{(q_{t}W^{PSU_{Q}}_{h})(K^{PSU}W^{PSU_{K}}_{h})^{T}}{% \sqrt{d}}italic_α start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT = divide start_ARG ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( italic_K start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG
=[ijγij(𝕩^tPSU(i)WhPSUQ)(𝕩^lPSU(j)WhPSUK)T]l=1Labsentsubscriptsuperscriptdelimited-[]subscript𝑖subscript𝑗subscript𝛾𝑖𝑗subscriptsuperscript^𝕩𝑃𝑆𝑈𝑖𝑡subscriptsuperscript𝑊𝑃𝑆subscript𝑈𝑄superscriptsubscriptsuperscript^𝕩𝑃𝑆𝑈𝑗𝑙subscriptsuperscript𝑊𝑃𝑆subscript𝑈𝐾𝑇𝐿𝑙1=[\sum_{i}\sum_{j}\gamma_{ij}(\mathbb{\hat{x}}^{PSU(i)}_{t}W^{PSU_{Q}}_{h})({% \mathbb{\hat{x}}^{PSU(j)}_{l}}W^{PSU_{K}}_{h})^{T}]^{L}_{l=1}= [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT
γij=γiγj|𝕩tPSU(i)||𝕩lPSU(j)|subscript𝛾𝑖𝑗subscript𝛾𝑖subscript𝛾𝑗subscriptsuperscript𝕩𝑃𝑆𝑈𝑖𝑡subscriptsuperscript𝕩𝑃𝑆𝑈𝑗𝑙\gamma_{ij}=\gamma_{i}\gamma_{j}|\mathbb{x}^{PSU(i)}_{t}||\mathbb{x}^{PSU(j)}_% {l}|italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | | blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT |

In this formulation, [𝕩lPSU(1:M+1)]l=1LL×(M+1)×dsubscriptsuperscriptdelimited-[]subscriptsuperscript𝕩𝑃𝑆𝑈:1𝑀1𝑙𝐿𝑙1superscript𝐿𝑀1𝑑[\mathbb{x}^{PSU(1:M+1)}_{l}]^{L}_{l=1}\in\mathbb{R}^{L\times(M+1)\times d}[ blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( 1 : italic_M + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × ( italic_M + 1 ) × italic_d end_POSTSUPERSCRIPT denotes the multi-modal embedding of items in the sequence of PSU and 𝕩PSU(1:M+1)=[Q,T,I,A]superscript𝕩𝑃𝑆𝑈:1𝑀1𝑄𝑇𝐼𝐴\mathbb{x}^{PSU(1:M+1)}=[Q,T,I,A]blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( 1 : italic_M + 1 ) end_POSTSUPERSCRIPT = [ italic_Q , italic_T , italic_I , italic_A ]. And KPSU=[iγi𝕩lPSU(i)]l=1LL×dsuperscript𝐾𝑃𝑆𝑈subscriptsuperscriptdelimited-[]subscript𝑖subscript𝛾𝑖subscriptsuperscript𝕩𝑃𝑆𝑈𝑖𝑙𝐿𝑙1superscript𝐿𝑑K^{PSU}=[\sum_{i}\gamma_{i}\mathbb{x}^{PSU(i)}_{l}]^{L}_{l=1}\in\mathbb{R}^{L% \times d}italic_K start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT = [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_d end_POSTSUPERSCRIPT denotes the merged representations of input sequence. WhPSUQd×dsubscriptsuperscript𝑊𝑃𝑆subscript𝑈𝑄superscript𝑑𝑑W^{PSU_{Q}}_{h}\in\mathbb{R}^{d\times d}italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT and WhPSUKd×dsubscriptsuperscript𝑊𝑃𝑆subscript𝑈𝐾superscript𝑑𝑑W^{PSU_{K}}_{h}\in\mathbb{R}^{d\times d}italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT denote the projection weight matrix of query and keys in hhitalic_h-th head, and γijsubscript𝛾𝑖𝑗\gamma_{ij}italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT denotes the weight of cross-modal interaction of unit query vector and unit key vector in the sequence. γijsubscript𝛾𝑖𝑗\gamma_{ij}italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT equals to the scalar product of γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, γjsubscript𝛾𝑗\gamma_{j}italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the norm value of query vector |𝕩tPSU(i)|subscriptsuperscript𝕩𝑃𝑆𝑈𝑖𝑡|\mathbb{x}^{PSU(i)}_{t}|| blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | and the norm value of key vector |𝕩lPSU(j)|subscriptsuperscript𝕩𝑃𝑆𝑈𝑗𝑙|\mathbb{x}^{PSU(j)}_{l}|| blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT |.

Multi-Modal Alignment and Query-Item Relevance

Multi-modal alignment is a crucial task, which learns the multi-modal representation in a same embedding space. Typical alignment models, such as CLIP (Radford et al., 2021), maximize the cosine similarity of the correct N𝑁Nitalic_N (text-image) pairs and minimize the cosine similarity of the incorrect N2Nsuperscript𝑁2𝑁N^{2}-Nitalic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_N mismatch pairs. We simultaneously train multi-modal alignment tasks, including text-image, image-attributes, text-attributes with the cross entropy loss of N pairs.

align=iMjMiCLIP(𝕩^1:L(i),𝕩^1:L(j)),(i,j){T,I,A}formulae-sequencesubscript𝑎𝑙𝑖𝑔𝑛subscript𝑖𝑀subscript𝑗𝑀𝑖subscriptCLIPsubscriptsuperscript^𝕩𝑖:1𝐿subscriptsuperscript^𝕩𝑗:1𝐿𝑖𝑗𝑇𝐼𝐴\mathcal{L}_{align}=\sum_{i\in M}\sum_{j\in M\neq i}\mathcal{L}_{\text{CLIP}}(% \mathbb{\hat{x}}^{(i)}_{1:L},\mathbb{\hat{x}}^{(j)}_{1:L}),(i,j)\in\{T,I,A\}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_M end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ italic_M ≠ italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT ( over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT , over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ) , ( italic_i , italic_j ) ∈ { italic_T , italic_I , italic_A }

Sequence length L𝐿Litalic_L is usually large and the alignment has complexity of O(L2)𝑂superscript𝐿2O(L^{2})italic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). To reduce the complexity, we further split the sequence into Nchsubscript𝑁𝑐N_{ch}italic_N start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT chunks. Each chunk is a sub-sequence with length Lsub=LNchsubscript𝐿𝑠𝑢𝑏𝐿subscript𝑁𝑐L_{sub}=\frac{L}{N_{ch}}italic_L start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT = divide start_ARG italic_L end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT end_ARG. The alignment loss is the sum of multiple losses within chunks as CLIP(𝕩^1:L(i),𝕩^1:L(j))=kNchCLIP(𝕩^Lk:Lk+1(i),𝕩^Lk:Lk+1(j))subscriptCLIPsubscriptsuperscript^𝕩𝑖:1𝐿subscriptsuperscript^𝕩𝑗:1𝐿subscript𝑘subscript𝑁𝑐subscript𝐶𝐿𝐼𝑃subscriptsuperscript^𝕩𝑖:subscript𝐿𝑘subscript𝐿𝑘1subscriptsuperscript^𝕩𝑗:subscript𝐿𝑘subscript𝐿𝑘1\mathcal{L}_{\text{CLIP}}(\mathbb{\hat{x}}^{(i)}_{1:L},\mathbb{\hat{x}}^{(j)}_% {1:L})=\sum_{k\in N_{ch}}\mathcal{L}_{CLIP}(\mathbb{\hat{x}}^{(i)}_{L_{k}:L_{k% +1}},\mathbb{\hat{x}}^{(j)}_{L_{k}:L_{k+1}})caligraphic_L start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT ( over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT , over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_k ∈ italic_N start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_C italic_L italic_I italic_P end_POSTSUBSCRIPT ( over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : italic_L start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG blackboard_x end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT : italic_L start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) with complexity reduced to O(L2/Nch)𝑂superscript𝐿2subscript𝑁𝑐O(L^{2}/N_{ch})italic_O ( italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_N start_POSTSUBSCRIPT italic_c italic_h end_POSTSUBSCRIPT ).

Additionally, query item relevance prediction is a typical search task, usually modelled as binary classification to predict the correct query-item pair from irrelevant query-item pairs. Each pair of query and item is represented as [𝕩lquery;𝕩litem=mMγm𝕩litem(m)]delimited-[]subscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦𝑙subscriptsuperscript𝕩𝑖𝑡𝑒𝑚𝑙subscript𝑚𝑀subscript𝛾𝑚subscriptsuperscript𝕩𝑖𝑡𝑒superscript𝑚𝑚𝑙[\mathbb{x}^{query}_{l};\mathbb{x}^{item}_{l}=\sum_{m\in M}\gamma_{m}\mathbb{x% }^{item^{(m)}}_{l}][ blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; blackboard_x start_POSTSUPERSCRIPT italic_i italic_t italic_e italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT italic_i italic_t italic_e italic_m start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ]. Loss for query-item relevance binary classification task is queryitem=ce(ylqi;𝕩lquery,mMγm𝕩litem(m))subscript𝑞𝑢𝑒𝑟𝑦𝑖𝑡𝑒𝑚subscript𝑐𝑒subscriptsuperscript𝑦𝑞𝑖𝑙subscriptsuperscript𝕩𝑞𝑢𝑒𝑟𝑦𝑙subscript𝑚𝑀subscript𝛾𝑚subscriptsuperscript𝕩𝑖𝑡𝑒superscript𝑚𝑚𝑙\mathcal{L}_{query-item}=\sum\mathcal{L}_{ce}(y^{qi}_{l};\mathbb{x}^{query}_{l% },\sum_{m\in M}\gamma_{m}\mathbb{x}^{item^{(m)}}_{l})caligraphic_L start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y - italic_i italic_t italic_e italic_m end_POSTSUBSCRIPT = ∑ caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_q italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ; blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_m ∈ italic_M end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT italic_i italic_t italic_e italic_m start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ). ylqisubscriptsuperscript𝑦𝑞𝑖𝑙y^{qi}_{l}italic_y start_POSTSUPERSCRIPT italic_q italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes relevance label of the l-th pair in the sequence. Positive label is assigned to the correct query-item pair and negative label is assigned to randomly sampled irrelevant query-item pair.

Loss of Pretraining Search Unit

The objective of Pretraining Search Unit (PSU) consists of three parts, the next query-item pair prediction loss nextpairsubscriptsuperscript𝑝𝑎𝑖𝑟𝑛𝑒𝑥𝑡\mathcal{L}^{pair}_{next}caligraphic_L start_POSTSUPERSCRIPT italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT, multi-modal alignment loss alignsubscript𝑎𝑙𝑖𝑔𝑛\mathcal{L}_{align}caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT and the query-item relevance prediction loss queryitemsubscript𝑞𝑢𝑒𝑟𝑦𝑖𝑡𝑒𝑚\mathcal{L}_{query-item}caligraphic_L start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y - italic_i italic_t italic_e italic_m end_POSTSUBSCRIPT. PSU=nextpair+align+queryitemsubscript𝑃𝑆𝑈subscriptsuperscript𝑝𝑎𝑖𝑟𝑛𝑒𝑥𝑡subscript𝑎𝑙𝑖𝑔𝑛subscript𝑞𝑢𝑒𝑟𝑦𝑖𝑡𝑒𝑚\mathcal{L}_{PSU}=\mathcal{L}^{pair}_{next}+\mathcal{L}_{align}+\mathcal{L}_{% query-item}caligraphic_L start_POSTSUBSCRIPT italic_P italic_S italic_U end_POSTSUBSCRIPT = caligraphic_L start_POSTSUPERSCRIPT italic_p italic_a italic_i italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y - italic_i italic_t italic_e italic_m end_POSTSUBSCRIPT.

3.3.2. Fine-tuning the projection weight

Existing lifelong sequence modeling methods follow a cascading two-stage paradigm. In the first stage, target item or query is used as trigger to retrieve the most relevant top-K items from the users’ long behaviors sequence and reduce the sequence length from L𝐿Litalic_L to K𝐾Kitalic_K, such as the General Search Unit (GSU) in SIM (Pi et al., 2020), TWIN (Chang et al., 2023), and Relevance Search Unit (RSU) in QIN (Guo et al., 2023). In the second stage, a multi head target attention (MHTA) unit in Exact Search Unit (ESU) is applied to encode the selected K𝐾Kitalic_K relevant items as the representation of users’ behavior sequence. However, existing cascading two-stage paradigm suffers from the insufficient learning problem of ID embedding in the lifelong sequence ([𝕩query,𝕩text,𝕩image,attributes][\mathbb{x}^{query},\mathbb{x}^{text},\mathbb{x}^{image},^{attributes}][ blackboard_x start_POSTSUPERSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUPERSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_t italic_e italic_x italic_t end_POSTSUPERSCRIPT , blackboard_x start_POSTSUPERSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUPERSCRIPT , start_POSTSUPERSCRIPT italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s end_POSTSUPERSCRIPT ]). The downstream model, e.g. CTR prediction, lacks of enough training data to learn the embedding in the sequence well. Especially when some low-frequency items in the sequence exist a long time ago (more than one year) and don’t exist in the training data, which are collected from most recently users’ logs.

To help alleviate the insufficient learning problem of ID embedding in the lifelong sequence, the general search unit (GSU) in our proposed SEMINAR model shares the same multi-head target attention structure headGSUh=Attentionh(qt,KGSU,VGSU)𝑒𝑎superscript𝑑𝐺𝑆subscript𝑈subscriptAttentionsubscript𝑞𝑡superscript𝐾𝐺𝑆𝑈superscript𝑉𝐺𝑆𝑈head^{GSU_{h}}=\text{Attention}_{h}(q_{t},K^{GSU},V^{GSU})italic_h italic_e italic_a italic_d start_POSTSUPERSCRIPT italic_G italic_S italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = Attention start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT ) with the structure in PSU as headPSUh=Attention(qt,KPSU,VPSU)𝑒𝑎superscript𝑑𝑃𝑆subscript𝑈Attentionsubscript𝑞𝑡superscript𝐾𝑃𝑆𝑈superscript𝑉𝑃𝑆𝑈head^{PSU_{h}}=\text{Attention}(q_{t},K^{PSU},V^{PSU})italic_h italic_e italic_a italic_d start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = Attention ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT ), restores the pretrained embedding from PSU and applies specific projection weight matrix G(j)d×dsuperscript𝐺𝑗superscript𝑑𝑑G^{(j)}\in\mathbb{R}^{d\times d}italic_G start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT to the pretrained embedding. After the first stage retrieval, the sequence length is reduced from L𝐿Litalic_L to K𝐾Kitalic_K, the second stage ESU also shares the same multi-head target attention structure headESUh=Attention(qt,KESU,VESU)𝑒𝑎superscript𝑑𝐸𝑆subscript𝑈Attentionsubscript𝑞𝑡superscript𝐾𝐸𝑆𝑈superscript𝑉𝐸𝑆𝑈head^{ESU_{h}}=\text{Attention}(q_{t},K^{ESU},V^{ESU})italic_h italic_e italic_a italic_d start_POSTSUPERSCRIPT italic_E italic_S italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = Attention ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT ) with GSU and PSU, and has specific projection weight matrix WhQ,WhK,WhVsubscriptsuperscript𝑊𝑄subscriptsuperscript𝑊𝐾subscriptsuperscript𝑊𝑉W^{Q}_{h},W^{K}_{h},W^{V}_{h}italic_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of each head.

GSU restores the pretrained query item multi-modal embedding [EPSU(Q),EPSU(T),EPSU(I),EPSU(A)]superscript𝐸𝑃𝑆𝑈𝑄superscript𝐸𝑃𝑆𝑈𝑇superscript𝐸𝑃𝑆𝑈𝐼superscript𝐸𝑃𝑆𝑈𝐴[E^{PSU(Q)},E^{PSU(T)},E^{PSU(I)},E^{PSU(A)}][ italic_E start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_Q ) end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_T ) end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_I ) end_POSTSUPERSCRIPT , italic_E start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_A ) end_POSTSUPERSCRIPT ] from PSU, and applies projection matrix G(j)d×dsuperscript𝐺𝑗superscript𝑑𝑑G^{(j)}\in\mathbb{R}^{d\times d}italic_G start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT to get the projected embedding in GSU as 𝕩GSU(j)superscript𝕩𝐺𝑆𝑈𝑗\mathbb{x}^{GSU(j)}blackboard_x start_POSTSUPERSCRIPT italic_G italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT. EPSU()superscript𝐸𝑃𝑆𝑈E^{PSU(*)}italic_E start_POSTSUPERSCRIPT italic_P italic_S italic_U ( ∗ ) end_POSTSUPERSCRIPT denotes the pretrained multi-modal embedding. And the attention score αhGSUsuperscriptsubscript𝛼𝐺𝑆𝑈\alpha_{h}^{GSU}italic_α start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT in the hhitalic_h-th head of GSU’s multi-head target attention is calculated as:

αhGSU=(qtWhGSUQ)(KGSUWhGSUK)Tdsuperscriptsubscript𝛼𝐺𝑆𝑈subscript𝑞𝑡subscriptsuperscript𝑊𝐺𝑆subscript𝑈𝑄superscriptsuperscript𝐾𝐺𝑆𝑈subscriptsuperscript𝑊𝐺𝑆subscript𝑈𝐾𝑇𝑑\alpha_{h}^{GSU}=\frac{(q_{t}W^{GSU_{Q}}_{h})(K^{GSU}W^{GSU_{K}}_{h})^{T}}{% \sqrt{d}}italic_α start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT = divide start_ARG ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_G italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( italic_K start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_G italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG
𝕩GSU(j)=𝕩PSU(j)Gj,jM+1formulae-sequencesuperscript𝕩𝐺𝑆𝑈𝑗superscript𝕩𝑃𝑆𝑈𝑗subscript𝐺𝑗for-all𝑗𝑀1\mathbb{x}^{GSU(j)}=\mathbb{x}^{PSU(j)}G_{j},\forall j\in M+1blackboard_x start_POSTSUPERSCRIPT italic_G italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT = blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ italic_M + 1

Comparing the GSU attention αhGSUsubscriptsuperscript𝛼𝐺𝑆𝑈\alpha^{GSU}_{h}italic_α start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT with the pretrained PSU attention αhPSUsubscriptsuperscript𝛼𝑃𝑆𝑈\alpha^{PSU}_{h}italic_α start_POSTSUPERSCRIPT italic_P italic_S italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, we can see that the structures of multi-head target attention are exactly the same. The projection weights WhGSUQsubscriptsuperscript𝑊𝐺𝑆subscript𝑈𝑄W^{GSU_{Q}}_{h}italic_W start_POSTSUPERSCRIPT italic_G italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and WhGSUKsubscriptsuperscript𝑊𝐺𝑆subscript𝑈𝐾W^{GSU_{K}}_{h}italic_W start_POSTSUPERSCRIPT italic_G italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of queries and keys for each head in multi head attention are different from WhPSUQsubscriptsuperscript𝑊𝑃𝑆subscript𝑈𝑄W^{PSU_{Q}}_{h}italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and WhPSUKsubscriptsuperscript𝑊𝑃𝑆subscript𝑈𝐾W^{PSU_{K}}_{h}italic_W start_POSTSUPERSCRIPT italic_P italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. And the embedding projection weight matrix Gjsubscript𝐺𝑗G_{j}italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is unique to GSU.

In the second stage, the top-K relevant query-item pairs are selected from GSU and fed to Exact Search Unit (ESU) as headESUh=Attentionh(qt,KESU,VESU)𝑒𝑎superscript𝑑𝐸𝑆subscript𝑈subscriptAttentionsubscript𝑞𝑡superscript𝐾𝐸𝑆𝑈superscript𝑉𝐸𝑆𝑈head^{ESU_{h}}=\text{Attention}_{h}(q_{t},K^{ESU},V^{ESU})italic_h italic_e italic_a italic_d start_POSTSUPERSCRIPT italic_E italic_S italic_U start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = Attention start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT ).

In ESU, KESU=TopK(KGSU)(M+1)×K×Dsuperscript𝐾𝐸𝑆𝑈TopKsuperscript𝐾𝐺𝑆𝑈superscript𝑀1𝐾𝐷K^{ESU}=\text{TopK}(K^{GSU})\in\mathbb{R}^{(M+1)\times K\times D}italic_K start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT = TopK ( italic_K start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + 1 ) × italic_K × italic_D end_POSTSUPERSCRIPT represents the sequence of retrieved top-K representations from KGSU(M+1)×L×Dsuperscript𝐾𝐺𝑆𝑈superscript𝑀1𝐿𝐷K^{GSU}\in\mathbb{R}^{(M+1)\times L\times D}italic_K start_POSTSUPERSCRIPT italic_G italic_S italic_U end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_M + 1 ) × italic_L × italic_D end_POSTSUPERSCRIPT. The attention score in ESU is denoted as αhESUsubscriptsuperscript𝛼𝐸𝑆𝑈\alpha^{ESU}_{h}italic_α start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and the ID embedding in ESU is denoted as 𝕩ESU(j)superscript𝕩𝐸𝑆𝑈𝑗\mathbb{x}^{ESU(j)}blackboard_x start_POSTSUPERSCRIPT italic_E italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT.

αhESU=(qtWhESUQ)(KESUWhESUK)Tdsubscriptsuperscript𝛼𝐸𝑆𝑈subscript𝑞𝑡subscriptsuperscript𝑊𝐸𝑆subscript𝑈𝑄superscriptsuperscript𝐾𝐸𝑆𝑈subscriptsuperscript𝑊𝐸𝑆subscript𝑈𝐾𝑇𝑑\alpha^{ESU}_{h}=\frac{(q_{t}W^{ESU_{Q}}_{h})(K^{ESU}W^{ESU_{K}}_{h})^{T}}{% \sqrt{d}}italic_α start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = divide start_ARG ( italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT italic_E italic_S italic_U start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ( italic_K start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_E italic_S italic_U start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG
𝕩ESU(j)=𝕩GSU(j)=𝕩PSU(j)G(j),jM+1formulae-sequencesuperscript𝕩𝐸𝑆𝑈𝑗superscript𝕩𝐺𝑆𝑈𝑗superscript𝕩𝑃𝑆𝑈𝑗superscript𝐺𝑗for-all𝑗𝑀1\mathbb{x}^{ESU(j)}=\mathbb{x}^{GSU(j)}=\mathbb{x}^{PSU(j)}G^{(j)},\forall j% \in M+1blackboard_x start_POSTSUPERSCRIPT italic_E italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT = blackboard_x start_POSTSUPERSCRIPT italic_G italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT = blackboard_x start_POSTSUPERSCRIPT italic_P italic_S italic_U ( italic_j ) end_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT , ∀ italic_j ∈ italic_M + 1

.

Finally, users’ lifelong sequence representation 𝕩lifelong_seqsubscript𝕩lifelong_seq\mathbb{x}_{\text{lifelong\_seq}}blackboard_x start_POSTSUBSCRIPT lifelong_seq end_POSTSUBSCRIPT is calculated as: 𝕩lifelong_seq=Concat(headESU1,,headESUH)WESUsubscript𝕩lifelong_seqConcatsuperscripthead𝐸𝑆subscript𝑈1superscripthead𝐸𝑆subscript𝑈𝐻superscript𝑊𝐸𝑆𝑈\mathbb{x}_{\text{lifelong\_seq}}=\text{Concat}(\text{head}^{ESU_{1}},...,% \text{head}^{ESU_{H}})W^{ESU}blackboard_x start_POSTSUBSCRIPT lifelong_seq end_POSTSUBSCRIPT = Concat ( head start_POSTSUPERSCRIPT italic_E italic_S italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , head start_POSTSUPERSCRIPT italic_E italic_S italic_U start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_E italic_S italic_U end_POSTSUPERSCRIPT. And 𝕩lifelong_seqsubscript𝕩lifelong_seq\mathbb{x}_{\text{lifelong\_seq}}blackboard_x start_POSTSUBSCRIPT lifelong_seq end_POSTSUBSCRIPT is concatenated with other user, item, user-item interaction (u2i) and context features and participate in CTR prediction. y^i=fθi(𝕩lifelong_seq,𝕩u,𝕩i,𝕩u2i,𝕩context)subscript^𝑦𝑖subscript𝑓subscript𝜃𝑖subscript𝕩lifelong_seqsubscript𝕩𝑢subscript𝕩𝑖subscript𝕩u2isubscript𝕩context\hat{y}_{i}=f_{\theta_{i}}(\mathbb{x}_{\text{lifelong\_seq}},\mathbb{x}_{u},% \mathbb{x}_{i},\mathbb{x}_{\text{u2i}},\mathbb{x}_{\text{context}})over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( blackboard_x start_POSTSUBSCRIPT lifelong_seq end_POSTSUBSCRIPT , blackboard_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , blackboard_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , blackboard_x start_POSTSUBSCRIPT u2i end_POSTSUBSCRIPT , blackboard_x start_POSTSUBSCRIPT context end_POSTSUBSCRIPT ) denotes the predicted value and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the actual label value. And the final loss of CTR prediction is ctr=ice(yi,yi^)subscript𝑐𝑡𝑟subscript𝑖subscript𝑐𝑒subscript𝑦𝑖^subscript𝑦𝑖\mathcal{L}_{ctr}=\sum_{i}\mathcal{L}_{ce}(y_{i},\hat{y_{i}})caligraphic_L start_POSTSUBSCRIPT italic_c italic_t italic_r end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ).

3.3.3. Approximate Retrieval of Multi-Modal Query-Item Pair

The exact calculation of the attention score between the target query-item pair qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the l𝑙litalic_l-th query-item behavior klsubscript𝑘𝑙k_{l}italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the inner product of the weighted sum of multiple vectors as:

qtTkl=(iM+1γixt(i))T(jM+1γjxl(j)),l{1,2,L}formulae-sequencesuperscriptsubscript𝑞𝑡𝑇subscript𝑘𝑙superscriptsubscript𝑖𝑀1subscript𝛾𝑖subscriptsuperscript𝑥𝑖𝑡𝑇subscript𝑗𝑀1subscript𝛾𝑗subscriptsuperscript𝑥𝑗𝑙for-all𝑙12𝐿{q_{t}}^{T}k_{l}=(\sum_{i\in M+1}\gamma_{i}{x}^{(i)}_{t})^{T}(\sum_{j\in M+1}% \gamma_{j}{x}^{(j)}_{l}),\forall l\in\{1,2,...L\}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_i ∈ italic_M + 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j ∈ italic_M + 1 end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , ∀ italic_l ∈ { 1 , 2 , … italic_L }

The exact calculation has the time complexity of O(L×M×d)𝑂𝐿𝑀𝑑O(L\times M\times d)italic_O ( italic_L × italic_M × italic_d ). L𝐿Litalic_L denotes the sequence length, M𝑀Mitalic_M denotes the number of weighted sum operations of multi-modal embedding vectors of dimension d𝑑ditalic_d. The calculation becomes time-consuming when L𝐿Litalic_L is very large (104superscript10410^{4}10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT) in the lifelong sequence of multi-modal query-item pairs setting.

One straightforward method of fast retrieval K𝐾Kitalic_K nearest vectors given an input query vector q𝑞qitalic_q is to build an embedding index, such as HNSW (Malkov and Yashunin, 2018), and conduct ANN (Approximate Nearest Neighbors) search. However, there are difficulties in building an embedding index to retrieve the target query-item pair from the sequence of multi-modal query-item pairs. To search the vectors of behaviors given the input target query-item pair qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as in the exact attention calculation, we build a vector index which assigns a primary key to represent each vector, such as Item ID, Query ID, etc. However, in our aligned sequence of query-item pairs, each merged query-item representation have the joint key of (query_id, item_id), and the required amount of storage increases from the item set size |||\mathcal{B}|| caligraphic_B | to the cartesian product of the query set size |𝒬|𝒬|\mathcal{Q}|| caligraphic_Q | and the item set size |||\mathcal{B}|| caligraphic_B | as |𝒬|||𝒬|\mathcal{Q}||\mathcal{B}|| caligraphic_Q | | caligraphic_B |, which is almost infeasible to store the merged query-item pair in a single index directly.

An alternative cascading cross-modal strategy is considered to retrieve top-K relevant query-item pairs. Firstly, we build two separate vector indexes of the query set with size |𝒬|𝒬|\mathcal{Q}|| caligraphic_Q | and the item set with size |||\mathcal{B}|| caligraphic_B |. During the online retrieval of target query-item pairs, we conduct vector retrieval four times, including query-to-item, query-to-query, item-to-query, and item-to-item. Each retrieval keeps the top-K items with the maximum inner product. The filter in the first-stage cross-modal retrieval is L4K𝐿4𝐾L\rightarrow 4Kitalic_L → 4 italic_K. Given the potential 4K4𝐾4K4 italic_K items, we conduct an exact attention calculation on these items to obtain the final top-K items, and the filter is 4KK4𝐾𝐾4K\rightarrow K4 italic_K → italic_K. The problem with the cascading cross-modal retrieval strategy is that it may achieve a suboptimal solution compared to exact full attention calculation. This is because the final inner product is a weighted average of all modalities. Additionally, top-K relevant items from one modality (e.g., query-to-query relevance) may have very low relevance in other modalities, such as query-to-item (text) or query-to-item (image), thus the overall inner product score is not optimal. Recall@K𝑅𝑒𝑐𝑎𝑙𝑙@𝐾Recall@Kitalic_R italic_e italic_c italic_a italic_l italic_l @ italic_K can evaluate the performance of the greedy strategy compared to exact calculation.

To help increase the recall performance while considering the retrieval speed, the key is to reduce the cardinality of the query set 𝒬𝒬\mathcal{Q}caligraphic_Q and the item set \mathcal{B}caligraphic_B. We argue that product quantization is a good approximation strategy, which splits vectors into Nbitsubscript𝑁𝑏𝑖𝑡N_{bit}italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT sub-vectors, assigns each sub-vector to the nearest centroid, and reduces the cardinality. In our formulation, we first use a set of separate Nbitsubscript𝑁𝑏𝑖𝑡N_{bit}italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT quantization function [q1(m),q2(m),,qNbit(m)]subscriptsuperscript𝑞𝑚1subscriptsuperscript𝑞𝑚2subscriptsuperscript𝑞𝑚subscript𝑁𝑏𝑖𝑡[q^{(m)}_{1},q^{(m)}_{2},…,q^{(m)}_{N_{bit}}][ italic_q start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] to encode embedding of vectors from the m𝑚mitalic_m-th modal channel 𝕩(m)superscript𝕩𝑚\mathbb{x}^{(m)}blackboard_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT as integer vectors of Nbitsubscript𝑁𝑏𝑖𝑡{N_{bit}}italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT-dimension, q(𝕩(m))=[c1(m),c2(m),,cNbit(m)]Nbit𝑞superscript𝕩𝑚subscriptsuperscript𝑐𝑚1subscriptsuperscript𝑐𝑚2subscriptsuperscript𝑐𝑚subscript𝑁𝑏𝑖𝑡superscriptsubscript𝑁𝑏𝑖𝑡q({\mathbb{x}}^{(m)})=[c^{(m)}_{1},c^{(m)}_{2},...,c^{(m)}_{N_{bit}}]\in% \mathbb{R}^{N_{bit}}italic_q ( blackboard_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) = [ italic_c start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Each representation of multi-modal query-item pair is expressed as:

[𝕩(1),,𝕩(M)][q(𝕩(1)),,q(𝕩(M))]M×Nbitsuperscript𝕩1superscript𝕩𝑀𝑞superscript𝕩1𝑞superscript𝕩𝑀superscript𝑀subscript𝑁𝑏𝑖𝑡[{\mathbb{x}}^{(1)},...,{\mathbb{x}}^{(M)}]\rightarrow[q({\mathbb{x}}^{(1)}),.% ..,q({\mathbb{x}}^{(M)})]\in\mathbb{R}^{M\times N_{bit}}[ blackboard_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , blackboard_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ] → [ italic_q ( blackboard_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , … , italic_q ( blackboard_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

.

We pre-compute the inner product between different pairs of centroids and store the values in memory. The space complexity of the storage is O(M2|𝒞|2Nbit)𝑂superscript𝑀2superscript𝒞2subscript𝑁𝑏𝑖𝑡O(M^{2}|\mathcal{C}|^{2}N_{bit})italic_O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | caligraphic_C | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT ), where M𝑀Mitalic_M denotes the size of multi-modals, |𝒞|𝒞|\mathcal{C}|| caligraphic_C | denotes the number of centroids, and Nbitsubscript𝑁𝑏𝑖𝑡N_{bit}italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT denotes the number of subvectors split in the codebook of modal m𝑚mitalic_m. During online serving, the inner product of qtTklsuperscriptsubscript𝑞𝑡𝑇subscript𝑘𝑙{q_{t}}^{T}k_{l}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is equivalent to O(M2Nbit)𝑂superscript𝑀2subscript𝑁𝑏𝑖𝑡O(M^{2}N_{bit})italic_O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT ) distance lookup operations, and the final score is calculated as the weighted sum of these distances. Here, cb(i)subscriptsuperscript𝑐𝑖𝑏c^{(i)}_{b}italic_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT and cb(j)subscriptsuperscript𝑐𝑗𝑏c^{(j)}_{b}italic_c start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the centroids IDs of the b𝑏bitalic_b-th subvector of 𝕩t(i)subscriptsuperscript𝕩𝑖𝑡\mathbb{x}^{(i)}_{t}blackboard_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝕩l(j)subscriptsuperscript𝕩𝑗𝑙\mathbb{x}^{(j)}_{l}blackboard_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT respectively.

qtTkl=ijγiγj𝕩t(i)𝕩l(j)ijγiγjbNbitdist(cb(i),cb(j))superscriptsubscript𝑞𝑡𝑇subscript𝑘𝑙subscript𝑖subscript𝑗subscript𝛾𝑖subscript𝛾𝑗subscriptsuperscript𝕩𝑖𝑡subscriptsuperscript𝕩𝑗𝑙subscript𝑖subscript𝑗subscript𝛾𝑖subscript𝛾𝑗subscript𝑏subscript𝑁𝑏𝑖𝑡distsubscriptsuperscript𝑐𝑖𝑏subscriptsuperscript𝑐𝑗𝑏{q_{t}}^{T}k_{l}=\sum_{i}\sum_{j}\gamma_{i}\gamma_{j}\mathbb{x}^{(i)}_{t}% \mathbb{x}^{(j)}_{l}\approx\sum_{i}\sum_{j}\gamma_{i}\gamma_{j}\sum_{b\in N_{% bit}}\text{dist}(c^{(i)}_{b},c^{(j)}_{b})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_x start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_b ∈ italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT dist ( italic_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_c start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT )

Our proposed multi-modal product quantization strategy works quite well in real-world settings. We also compare the time complexity of different strategies, such as cascading ANN (HNSW), Locality-sensitive hashing (LSH) and our proposed Multi-Modal Product Quantization approximation. Our proposed multi-modal PQ method has the time complexity of O(L×M2×Nbit)𝑂𝐿superscript𝑀2subscript𝑁𝑏𝑖𝑡O(L\times M^{2}\times N_{bit})italic_O ( italic_L × italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT ). In each attention calculation, there are M2Nbitsuperscript𝑀2subscript𝑁𝑏𝑖𝑡M^{2}N_{bit}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT distance look-up operations of O(1)𝑂1O(1)italic_O ( 1 ), and the final score is calculated as the sum of these distances, which is far less than the exact calculation of the inner product of multiple vectors O(L×M×d)𝑂𝐿𝑀𝑑O(L\times M\times d)italic_O ( italic_L × italic_M × italic_d ). As for the two stage cascading ANN (HNSW) method of retrieving query-item pairs with two filters, the first stage retrieve the M2Ksuperscript𝑀2𝐾M^{2}Kitalic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K cross-modal candidates from L𝐿Litalic_L sequence as LM2K𝐿superscript𝑀2𝐾L\rightarrow M^{2}Kitalic_L → italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K, and the second stage retrieve the final top K𝐾Kitalic_K items from first stage as M2KKsuperscript𝑀2𝐾𝐾M^{2}K\rightarrow Kitalic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K → italic_K. Total time complexity of cascading ANN method is O(M2log(L)d+M2Kd)𝑂superscript𝑀2𝐿𝑑superscript𝑀2𝐾𝑑O(M^{2}\log(L)d+M^{2}Kd)italic_O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_L ) italic_d + italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_K italic_d ), which is faster than our PQ strategy but may achieve sub-optimal recall performance in multiple experiments as reported in Figure 2.

4. Experiment

4.1. Experimental Settings

Dataset We evaluate our proposed SEMINAR model on three datasets: two public datasets including Amazon review dataset (Movies and TV subset) 222https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/ and the KuaiSAR 333https://zenodo.org/records/8181109 search and recommendation dataset, one industrial dataset Alipay short video dataset. The average length of users’ sequence has the magnitude of L=2000,1000,100𝐿20001000100L=2000,1000,100italic_L = 2000 , 1000 , 100 for the Alipay, KuaiSAR and Amazon datasets. The detailed statistics can be found in Table 1.

  • Amazon Reviews We select Movies and TV subset of the public Amazon reviews dataset for experiment. The meta-information of items is also provided in the dataset. We use the image thumbnails as the inputs to the sequence of image modal. To get the aligned query sequence, we generate a query relevant to each item from its description in the meta information as in (Ai et al., 2017) and (Guo et al., 2023).

  • KuaiSAR (Sun et al., 2023) KuaiSAR is a real-world public large scale dataset containing both search and recommendation behaviors collected from Kuaishou444https://www.kuaishou.com/en, a leading short-video app. We construct a unified sequence of query and item pairs to compare different lifelong behaviors sequences models.

  • Alipay Short Video The Alipay short video dataset is a real-world industrial dataset collected from exposures and clicks logs of short-video recommendation and search ranking scenario of Alipay app. We convert the title of the short video as the input to the text modal, and the image thumbnails to the image modal. Users’ search queries are collected and aligned to corresponding viewed items.

We process the datasets of Amazon and KuaiSAR as in literature (Zhou et al., 2020) and repo 555https://github.com/RUCAIBox/CIKM2020-S3Rec. User with N𝑁Nitalic_N actions will generate N-1 samples. We use the first i1𝑖1i-1italic_i - 1 actions to predict whether the user will interact with the i𝑖iitalic_i-th item (0<i<=N0𝑖𝑁0<i<=N0 < italic_i < = italic_N). Additionally, we apply the leave-one-out strategy, using the (N1)𝑁1(N-1)( italic_N - 1 )-th action as the validation set, the N𝑁Nitalic_N-th action as positive in test set and randomly sampled negatives in the test set. The remaining samples are used as training and pretraining set. In the industrial Alipay short video dataset, exposed clicks are treated as positive samples and exposed non-clicks are considered as negative samples. The training and validation sets are randomly split using data from past [0,T-1] days (T=60), and the test set come from the T𝑇Titalic_T-th day.

Table 1. Statistics of the Amazon Movies and TV, the Alipay Short Video and KuaiSAR datasets. K denotes thousand.
Dataset User Item Query U-I
Amazon Movies & TV 297 K 181 K - 3,293 K
Alipay Short Video 35,065 K 1,132 K 51 K 62,948 K
KuaiSAR 25,877 6,890,707 453,667 19,664,885
Table 2. Results of lifelong behavior sequence modeling of KuaiSAR dataset, Amazon Review dataset and Alipay short video recommendation dataset.* indicates best performing model.
KuaiSAR Amazon Movies and TV Alipay Short Video
Method NDCG@5 NDCG@10 NDCG@50 NDCG@5 NDCG@10 NDCG@50 AUC
SIM 0.2523 0.2661 0.3293 0.3573 0.3959 0.4577 0.7382
QIN 0.2535 0.2672 0.3312 0.3650 0.4038 0.4630 0.7239
ETA 0.2642 0.2756 0.3313 0.3626 0.4008 0.4607 0.7262
TWIN 0.2558 0.2709 0.3294 0.3627 0.4017 0.4605 0.7376
SEMINAR *0.2816 *0.2969 *0.3457 *0.3661 *0.4041 *0.4636 *0.7503
Absolute Impr. +0.0292 +0.0308 +0.0164 +0.0088 +0.0082 +0.0059 +0.0264
Table 3. Ablation Studies of Different PSU pretraining tasks on KuaiSAR Dataset. N@K denotes NDCG@K.
Method N@5 N@10 N@50
SEMINAR 0.2816 0.2969 0.3457
w/o pretraining 0.2564 0.2738 0.3310
w. align, w/o next-predict,q-i relev. 0.2702 0.2832 0.3420
w. next-predict, w/o align,q-i relev. 0.2675 0.2813 0.3408
w. q-i relev., w/o align, next-predict 0.2633 0.2754 0.3357

To evaluate the recall performance of different approximation fast retrieval methods, we conduct experiments on two datasets: the multi-modal embedding of the Alipay short video dataset with sequence length L=2,000𝐿2000L=2,000italic_L = 2 , 000 and a synthetic dataset. The purpose of the synthetic dataset is to test the performance of different retrieval methods on extremely long sequence (e.g. L=10,000𝐿10000L=10,000italic_L = 10 , 000), which is not available in public datasets. The synthetic dataset consists of query, text, image, and attribute vectors generated by i.i.d. normal distribution N(μ,σ2)𝑁𝜇superscript𝜎2N(\mu,\sigma^{2})italic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with different values of mean μ𝜇\muitalic_μ and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, to imitate various norm values of multi-modal vectors of query-item pairs.

Table 4. Recall@K Evaluation of Approximate Retrieval Methods on Alipay Short Video Recommendation Dataset
Method R@32 R@64 R@128 R@256
ANN (HNSW) 0.7881 0.8603 0.9288 0.9409
LSH 0.7528 0.8175 0.8721 0.9257
RQ-VAE 0.8225 0.8422 0.8633 0.8995
Multi-Modal PQ 0.9638 0.9769 0.9797 0.9874

Comparison Methods

We compared several strong lifelong sequence modeling baselines with our proposed SEMINAR model:

  • SIM (Pi et al., 2020) SIM adopts cascading search unit GSU and ESU to extract the relevant behaviors of the candidate item and applies multi-head target attention to model users’ interest.

  • ETA (Chen et al., 2022) Efficient Target Attention encodes query and keys as binary hash vectors using a multi-round random projection matrix. The retrieval is calculated as the Hamming distance between the target item and the items in the sequence.

  • TWIN (Chang et al., 2023) Two-Stage Interest Network adopts the same relevance metric between the target behavior and historical behaviors as the target attention in two cascading stages GSU and ESU.

  • QIN (Guo et al., 2023) QIN network uses the query as first trigger to retrieve top K1𝐾1K1italic_K 1 behaviors, and target item as the second trigger to retrieve top K2𝐾2K2italic_K 2 relevant items afterwards.

The input features to all baseline models are the same, including the query and multi-modal item features in all datasets.

For the online approximate retrieval performance, we compared our proposed multi-modal product quantization (Jégou et al., 2011) strategy with some widely adopted vector retrieval methods in the query-item multi-modal pairs retrieval setting, including:

  • HNSW (Malkov and Yashunin, 2018) Navigable Small World Graphs

  • LSH Locality Sensitive Hashing

  • RQ-VAE (Zeghidour et al., 2021) Residual Vector Quantization VAE (RQ-VAE) follows an encoder-decoder structure and uses the multi-stage vector quantizer to regress original inputs, with multi-scale spectral reconstruction loss as constraints.

Evaluation Metrics We use NDCG@K to evaluate the recommendation performance on Amazon review and KuaiSAR dataset, and AUC (Area Under the Curve) to evaluate the CTR prediction performance of exposures and clicks in Alipay short video dataset. Secondly, to evaluate the performance of multi-modal query-item retrieval, we calculate the exact attention on all the items in the sequence using the target query-item pair as a trigger and regard the real top-K relevant items as ground truth. Different fast retrieval strategies are evaluated by Recall@K at different K𝐾Kitalic_K levels, which measures how many top-K relevant ground truth items are recalled by the approximation strategy.

Implementation Details We implement the baseline methods and our proposed SEMINAR model using PyTorch. Secondly, the baselines of different approximate retrieval methods, ANN(HNSW), and our Multi-Modal Product Quantization are implemented using the Python library faiss (Douze et al., 2024) 666https://github.com/facebookresearch/faiss, and the RQ-VAE (Zeghidour et al., 2021) is implemented using the Python library vector_quantize_pytorch 777https://github.com/lucidrains/vector-quantize-pytorch. The code of SEMINAR is available at the repo: https://github.com/paper-submission-coder/SEMINAR and the public datasets Amazon and KuaiSAR can be downloaded following the instructions in the README file.

For the hyperparameter settings of recommendation models, we set the sequence length L𝐿Litalic_L to 2000,1000 and 100 for Alipay Short Video, KuaiSAR, and Amazon Review datasets and retrieve K=200,200,50𝐾20020050K=200,200,50italic_K = 200 , 200 , 50 most relevant items, based on users’ average interaction length in different datasets. The embedding of multi-modal text and image channels are outputs from pretrained ViT-B/32 888https://github.com/openai/CLIP model of CLIP with original dimension 512, then linearly projected to dimension 64. And the weight of query representation λ𝜆\lambdaitalic_λ in section 3.3.1 is set to 0.5 to fuse query embedding and multi-modal item embedding. We also compare different λ𝜆\lambdaitalic_λ values (λ=0.1,0.3,0.5,0.7,0.9𝜆0.10.30.50.70.9\lambda=0.1,0.3,0.5,0.7,0.9italic_λ = 0.1 , 0.3 , 0.5 , 0.7 , 0.9) in the following section of ablation study. For the multi-head target attention, we set number of heads as 4. The batch size is set to 256 and we are using Adam optimizer with learning rate set to 0.001. The number of pretraining epochs is set to 5,1,15115,1,15 , 1 , 1 on KuaiSAR, Amazon and Alipay datasets respectively, and the number of training epochs are the same for all models in comparison. The checkpoint is exported by best NDCG metrics on evaluation dataset. For the implementation of our multi-modal production quantization, number of modals M𝑀Mitalic_M is set to 4. And the original 64-dimension dense embedding vectors are expressed as Nbit=8subscript𝑁𝑏𝑖𝑡8N_{bit}=8italic_N start_POSTSUBSCRIPT italic_b italic_i italic_t end_POSTSUBSCRIPT = 8 bits vectors of integer codes. Each bit of the integer vectors represents the codebook assignment of centroids ci(m){1,2,,|𝒞m|}subscriptsuperscript𝑐𝑚𝑖12subscript𝒞𝑚c^{(m)}_{i}\in\{1,2,…,|\mathcal{C}_{m}|\}italic_c start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , 2 , … , | caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | }. The cardinality of each dimension |𝒞m|subscript𝒞𝑚|\mathcal{C}_{m}|| caligraphic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT | is set to 512. To generate the synthetic dataset with extremely long sequence L=10,000𝐿10000L=10,000italic_L = 10 , 000, We generate multi-modal embedding of sequence with different norm values across modals as normally distributed variables N(μ,σ2)𝑁𝜇superscript𝜎2N(\mu,\sigma^{2})italic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We set μ=0.25,0.5,1.0,2.0𝜇0.250.51.02.0\mu=0.25,0.5,1.0,2.0italic_μ = 0.25 , 0.5 , 1.0 , 2.0 and σ=1.0𝜎1.0\sigma=1.0italic_σ = 1.0 to query, attribute, text and image modals respectively. To investigate the influence of different fusion weight γmsubscript𝛾𝑚\gamma_{m}italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of multi-modal embedding, we conduct different experiments of equal weights γm=[0.25,0.25,0.25,0.25]subscript𝛾𝑚0.250.250.250.25\gamma_{m}=[0.25,0.25,0.25,0.25]italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ 0.25 , 0.25 , 0.25 , 0.25 ] and different weights γm=[0.1,0.2,0.3,0.4]subscript𝛾𝑚0.10.20.30.4\gamma_{m}=[0.1,0.2,0.3,0.4]italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = [ 0.1 , 0.2 , 0.3 , 0.4 ].

4.2. Experimental Results

Refer to caption
Figure 2. Recall@K Evaluation of Different Approximate Fast Retrieval Methods on Synthetic Dataset of the Multi-Modal Lifelong Sequence. Plots in the first row denote the group of same norm |x(m)|superscript𝑥𝑚|x^{(m)}|| italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | same weight |γm|subscript𝛾𝑚|\gamma_{m}|| italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT |, plots in the second row denote the group of different norm |x(m)|superscript𝑥𝑚|x^{(m)}|| italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | and same weight |γm|subscript𝛾𝑚|\gamma_{m}|| italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT |, and plots in the third row denote the group of the same norm |x(m)|superscript𝑥𝑚|x^{(m)}|| italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | and different weight |γm|subscript𝛾𝑚|\gamma_{m}|| italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT |.

4.2.1. Lifelong User Behavior Modeling

We report the performance on different datasets from multiple domains, including NDCG@K on KuaiSAR dataset, the Movie and TV subset of the Amazon review dataset and AUC performance on the Alipay short video recommendation dataset in Table 2. The asterisk (*) denotes the best performance achieved in each task. We can see that SEMINAR achieved the best performance on the KuaiSAR dataset with improvement of +0.0292, +0.0308, +0.0164 in NDCG@K=5,10,50NDCG@K51050\text{NDCG@K}=5,10,50NDCG@K = 5 , 10 , 50 and improvement of +0.0088, +0.0082, +0.0059 in NDCG@K=5,10,50NDCG@K51050\text{NDCG@K}=5,10,50NDCG@K = 5 , 10 , 50 on Amazon dataset compared to SIM. Additionally, SEMINAR also achieved the best AUC performance on the Alipay short video recommendation dataset with improvement of +0.0264 compared to multiple strong SOTA baselines.

4.2.2. Multi-Modal Query-Item Pairs Approximate Retrieval

To compare multi-modal query-item approximate retrieval methods, we report the Recall@K performance of the industrial Alipay Short Video dataset in Table 4 and the performance of the synthetic dataset in Figure 2. From the result of Alipay Short Video dataset, we observe that our proposed Multi-Modal Product Quantization strategy achieves the highest Recall@K compared to other approximation methods under different values of K=[32,64,128,256]𝐾3264128256K=[32,64,128,256]italic_K = [ 32 , 64 , 128 , 256 ] and L=2000𝐿2000L=2000italic_L = 2000. Secondly, we observe that in the synthetic dataset, experimental groups are designed with different sequence length L=2000,5000,10000𝐿2000500010000L=2000,5000,10000italic_L = 2000 , 5000 , 10000, different settings of norm values |𝕩(m)|superscript𝕩𝑚|\mathbb{x}^{(m)}|| blackboard_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | across modalities, and different settings of weight γmsubscript𝛾𝑚\gamma_{m}italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT across modalities as in section 4 of detailed implementation. Our proposed method of multi-modal product quantization strategy consistently achieves the best Recall@K at different K𝐾Kitalic_K levels (K=[32,64,128,256]𝐾3264128256K=[32,64,128,256]italic_K = [ 32 , 64 , 128 , 256 ]), with only a few exceptions of falling behind the cascading ANN (HNSW) method for large K𝐾Kitalic_K values under L=2000,5000𝐿20005000L=2000,5000italic_L = 2000 , 5000. For the first method, cascading ANN (HNSW), we observe that the greedy strategy of cascading ANN achieves poor results at small values of K𝐾Kitalic_K (e.g., L=10000, Recall@32=0.2719), and the performance increases dramatically as K𝐾Kitalic_K increases to 256 (L=10000, Recall@256=0.7289). This aligns with our expectation that in the setting of weighted sum of multiple vectors, as K𝐾Kitalic_K increases, the real top-K relevant pairs to the target pair have higher probability of being recalled by the greedy strategy of M2superscript𝑀2M^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT cascading cross-modal ANN retrieval.

To analyze the effect of different variables for approximate retrieval, e.g. merging weights γmsubscript𝛾𝑚\gamma_{m}italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of modalities, different norm values |𝕩(m)|superscript𝕩𝑚|\mathbb{x}^{(m)}|| blackboard_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT |, we plot the line chart of Recall@K as K𝐾Kitalic_K increases in Figure 2. From the chart, we can observe that different norm values of multi-modal vectors influence the overall Recall@K dramatically. Under the same sequence length L=10000𝐿10000L=10000italic_L = 10000 and K=256𝐾256K=256italic_K = 256, the Recall@K of the group with different norm values is on average -0.14 below the group with the same norm values. The varied norm values of multi-modal vectors make it more challenging to achieve high Recall@K compared to the equal norm values counterpart.

Refer to caption
Figure 3. Influence of Number of Pretraining Epochs on NDCG@K performance of KuaiSAR Dataset
Refer to caption
Figure 4. Influence of Query and Item Representation Fusion Weight λ𝜆\lambdaitalic_λ on NDCG@K performance of KuaiSAR Dataset

4.3. Discussion and Ablation Study

4.3.1. Influence of Pretraining Epochs Number and Ablation of Pretraining Tasks

To investigate the influence of different pretraining epochs and without pretraining of the SEMINAR model, we reported the NDCG@K performance on KuaiSAR dataset with sequence length 1000 in Figure 3. We can see that the SEMINAR model achieve largest improvement compared to the group of without pretraining in the first 5 pretraining epochs, and additional pretraining epochs up to 10 contribute only marginally to the performance.

As for the ablation study of different pretraining tasks of SEMINAR, we trained different models on KuaiSAR dataset, including without pretraining, with only one pretraining task and without the other two tasks (e.g., w. alignment and w/o next pair prediction, query-item relevance). The results of the ablation study are reported in Table 3. Compared to the group of SEMINAR without pretraining, multi-modal alignment task contributes largest to the performance improvement, followed by next pair prediction and query-item relevance.

4.3.2. Query-Item Representation Fusion Weight

To investigate the influence of different weight λ𝜆\lambdaitalic_λ to fusion query and item representation, we conduct different experiments λ=[0.1,0.3,0.5,0.7,0.9]𝜆0.10.30.50.70.9\lambda=[0.1,0.3,0.5,0.7,0.9]italic_λ = [ 0.1 , 0.3 , 0.5 , 0.7 , 0.9 ] of SEMINAR model on KuaiSAR dataset. The results are reported in Figure 4. The best performance is achieved at λ=0.3𝜆0.3\lambda=0.3italic_λ = 0.3. We speculate that optimal value of fusion weight λ𝜆\lambdaitalic_λ depends on the distribution of search and recommendation behaviors in the unified sequence of query-item pair. For example, in KuaiSAR dataset search actions consist of 25.7% of overall users’ actions, and recommendation actions consist of 74.3% of total actions as in (Sun et al., 2023) . The optimal value of λ𝜆\lambdaitalic_λ may vary across different domains and datasets, which need further investigation in future research.

5. Conclusion

In this paper, we proposed SEMINAR to model users’ lifelong behavior sequence of query and item pairs. We introduced the Pretraining Search Unit to help alleviate the issues of insufficient learning of ID embeddings in lifelong sequence and multi-modal alignment. For online fast approximate retrieval, a multi-modal product-quantization based strategy is also proposed. Extensive evaluations on multiple datasets demonstrate the effectiveness of our method.

References

  • (1)
  • Ai et al. (2017) Qingyao Ai, Yongfeng Zhang, Keping Bi, Xu Chen, and W. Bruce Croft. 2017. Learning a Hierarchical Embedding Model for Personalized Product Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (Shinjuku, Tokyo, Japan) (SIGIR ’17). Association for Computing Machinery, New York, NY, USA, 645–654. https://doi.org/10.1145/3077136.3080813
  • Chang et al. (2023) Jianxin Chang, Chenbin Zhang, Zhiyi Fu, Xiaoxue Zang, Lin Guan, Jing Lu, Yiqun Hui, Dewei Leng, Yanan Niu, Yang Song, and Kun Gai. 2023. TWIN: TWo-stage Interest Network for Lifelong User Behavior Modeling in CTR Prediction at Kuaishou. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (¡conf-loc¿, ¡city¿Long Beach¡/city¿, ¡state¿CA¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 3785–3794. https://doi.org/10.1145/3580305.3599922
  • Chen et al. (2022) Qiwei Chen, Yue Xu, Changhua Pei, Shanshan Lv, Tao Zhuang, and Junfeng Ge. 2022. Efficient Long Sequential User Data Modeling for Click-Through Rate Prediction. arXiv:2209.12212 [cs.IR]
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. CoRR abs/2010.11929 (2020). arXiv:2010.11929 https://arxiv.org/abs/2010.11929
  • Douze et al. (2024) Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2024. The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
  • Gong et al. (2023) Yuqi Gong, Xichen Ding, Yehui Su, Kaiming Shen, Zhongyi Liu, and Guannan Zhang. 2023. An Unified Search and Recommendation Foundation Model for Cold-Start Scenario. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 4595–4601. https://doi.org/10.1145/3583780.3614657
  • Guo et al. (2023) Tong Guo, Xuanping Li, Haitao Yang, Xiao Liang, Yong Yuan, Jingyou Hou, Bingqing Ke, Chao Zhang, Junlin He, Shunyu Zhang, Enyun Yu, and Wenwu Ou. 2023. Query-dominant User Interest Network for Large-Scale Search Ranking. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 629–638. https://doi.org/10.1145/3583780.3615022
  • Hou et al. (2023) Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. arXiv:2210.12316 [cs.IR]
  • Jégou et al. (2011) Herve Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57
  • Malkov and Yashunin (2018) Yu. A. Malkov and D. A. Yashunin. 2018. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. arXiv:1603.09320 [cs.DS]
  • Pi et al. (2020) Qi Pi, Guorui Zhou, Yujing Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 2685–2692. https://doi.org/10.1145/3340531.3412744
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
  • Rajput et al. (2023) Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. 2023. Recommender Systems with Generative Retrieval. arXiv:2305.05065 [cs.IR]
  • Si et al. (2023) Zihua Si, Zhongxiang Sun, Xiao Zhang, Jun Xu, Xiaoxue Zang, Yang Song, Kun Gai, and Ji-Rong Wen. 2023. When Search Meets Recommendation: Learning Disentangled Search Representation for Recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 1313–1323. https://doi.org/10.1145/3539618.3591786
  • Sun et al. (2023) Zhongxiang Sun, Zihua Si, Xiaoxue Zang, Dewei Leng, Yanan Niu, Yang Song, Xiao Zhang, and Jun Xu. 2023. KuaiSAR: A Unified Search And Recommendation Dataset. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (¡conf-loc¿, ¡city¿Birmingham¡/city¿, ¡country¿United Kingdom¡/country¿, ¡/conf-loc¿) (CIKM ’23). Association for Computing Machinery, New York, NY, USA, 5407–5411. https://doi.org/10.1145/3583780.3615123
  • van den Oord et al. (2017) Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6309–6318.
  • Vaswani et al. (2023) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs.CL]
  • Yao et al. (2021) Jing Yao, Zhicheng Dou, Ruobing Xie, Yanxiong Lu, Zhiping Wang, and Ji-Rong Wen. 2021. USER: A Unified Information Search and Recommendation Model based on Integrated Behavior Sequence. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (Virtual Event, Queensland, Australia) (CIKM ’21). Association for Computing Machinery, New York, NY, USA, 2373–2382. https://doi.org/10.1145/3459637.3482489
  • Zeghidour et al. (2021) Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. 2021. SoundStream: An End-to-End Neural Audio Codec. arXiv:2107.03312 [cs.SD]
  • Zhao (2023) Pengyu et al. Zhao. 2023. M5: Multi-Modal Multi-Interest Multi-Scenario Matching for Over-the-Top Recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (¡conf-loc¿, ¡city¿Long Beach¡/city¿, ¡state¿CA¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (KDD ’23). Association for Computing Machinery, New York, NY, USA, 3785–3794. https://doi.org/10.1145/3580305.3599863
  • Zhou et al. (2020) Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (Virtual Event, Ireland) (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 1893–1902. https://doi.org/10.1145/3340531.3411954