Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Towards Llm-Recsys Alignment With Textual Id Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Towards LLM-RecSys Alignment with Textual ID Learning

Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li and Yongfeng Zhang
Department of Computer Science, Rutgers University, NJ 08854, US
juntao.tan,shuyuan.xu,wenyue.hua,yingqiang.ge,zelong.li,yongfeng.zhang@rutgers.edu

ABSTRACT 1 INTRODUCTION
Generative recommendation based on Large Language Models Generative models with LLMs pre-trained on extensive amounts
(LLMs) have transformed the traditional ranking-based recommen- of information [1, 2, 27, 31] are continually revolutionizing the
dation style into a text-to-text generation paradigm. This approach field of machine learning. Due to their successful in understanding
has attracted significant attention. However, in contrast to standard complex instructions and generate creative, contextually relevant
natural language processing (NLP) tasks that inherently operate predictions, their usage has quickly extended beyond NLP tasks to
arXiv:2403.19021v1 [cs.IR] 27 Mar 2024

on human vocabulary, current research in generative recommenda- provide a foundation for applications across diverse research areas.
tions struggles to effectively encode recommendation items within One such example is generative recommendation.
the text-to-text framework using concise yet meaningful ID repre- While traditional methods treat recommendation as a retrieval
sentations. Due to this unresolved issue, the potential of LLM-based (candidate selection) and ranking process, generative recommenda-
generative recommendation systems remains largely unexplored. tion interprets it as a direct text-to-text generation task: a user’s
To better align LLMs with recommendation needs, we propose history is expressed as a textual prompt, and the target recommen-
IDGenRec, representing each item as a unique, concise, seman- dation is generated in natural language form. However, unlike NLP
tically rich, platform-agnostic textual ID using human language tasks that are solely reading and generating human language to-
tokens. This is achieved by training a textual ID generator along- kens, items in recommendation platforms are individual entities
side the LLM-based recommender, enabling seamless integration of in an ever-growing universe. Therefore, how to encode items as
personalized recommendations into natural language generation. language tokens (i.e., Item IDs) that can be easily integrated into
Notably, as user history is expressed in natural language and decou- the text-to-text paradigm is a unique and crucial problem within
pled from the original dataset, our approach suggests the potential generative recommendation research.
for a foundational generative recommendation model. Experiments A few attempts have been made to tackle this problem. P5 [8],
show that our framework consistently surpasses existing models in one of the first works in generative recommendation, proposes
sequential recommendation under standard experimental setting. allocating out-of-vocabulary (OOV) tokens to items within the
Then, we explore the possibility of training a foundation recom- recommendation platform. These assigned IDs are fixed-length
mendation model with the proposed method on data collected from numerical tokens roughly created based on their sequential ap-
19 different datasets and tested its recommendation performance pearance in the dataset (e.g., 1001 for the first item, 1002 for the
on 6 unseen datasets across different platforms under a completely second item). Later, some follow-up research [14] evaluates how
zero-shot setting. The results show that the zero-shot performance recommendation performance can be further improved by using
of the pre-trained foundation model is comparable to or even better different strategies to initialize the numerical IDs.
than some traditional recommendation models based on supervised However, while these pioneering works achieve considerable
training, showing the potential of the IDGenRec paradigm serving performance in standard recommendation settings, the true capabil-
as the foundation model for generative recommendation. Code and ity of LLMs in recommendation remains largely unexplored. First,
data are open-sourced at https://github.com/agiresearch/IDGenRec. these methods overlook the wealth of semantic information con-
tained in textual descriptions of items, which undermines one of
CCS CONCEPTS the primary motivations for using LLMs—harnessing the semantic
• Information systems → Recommender systems; • Comput- knowledge gained during their pre-training phase. Second, the num-
ing methodologies → Natural language generation; Machine bers assigned to items are meaningless tokens that lack any real
learning. contextual meaning. Training such models with a recommendation
objective does not lead them to learn the general characteristics of
KEYWORDS the items. Instead, they merely learn the co-occurrence patterns
of these IDs within each dataset. Therefore, although these mod-
Recommender System, Information Retrieval, Neural Language
els present themselves as text-to-text, in essence, they do nothing
Processing
more than learning representations of each item in a traditional key-
Permission to make digital or hard copies of all or part of this work for personal or value dictionary style, which imposes a ceiling on the quality of the
classroom use is granted without fee provided that copies are not made or distributed generated recommendations. Simultaneously, since the learned ID
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM representations lack general meanings, the knowledge they acquire
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, is non-transferable across datasets. This means that pre-trained
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
recommendation models lack any zero-shot recommendation abil-
SIGIR’24, July 14–18, 2024, Washington D.C., USA. ity on unseen data. Consequently, a foundational recommendation
© 2024 Association for Computing Machinery. model, which has long been pursued in the recommendation com-
ACM ISBN 978-1-4503-XXXX-X/18/06. . . $15.00
https://doi.org/XXXXXXX.XXXXXXX
munity, cannot be achieved by any of the aforementioned methods.
SIGIR’24, July 14–18, 2024, Washington D.C., USA. Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li and Yongfeng Zhang

We propose that the limitations mentioned above primarily arise Table 1: Comparison of LLM-based recommendation mod-
from inadequate item encoding. Consider tasks in human language, els: P5 and its variant versions are generative models but
such as question answering and machine translation, where every not foundation models due to the use of OOV tokens. UniS-
piece of knowledge is represented within a finite set of tokens. This Rec and Recformer have encoder-only structures and are,
allows a foundational model to easily learn universally applicable therefore, not generative models. Additionally, Recformer
knowledge from training on large text corpus and to adapt effort- employs a rigorously defined item text template that is specif-
lessly to any downstream task. If, however, items in recommenda- ically designed for the Amazon dataset only, and is thus clas-
tion systems were also fully represented using human vocabulary, sified as partly a foundation model.
with each item described by a specific set of natural language to- P5 P5-variants UniSRec Recformer IDGenRec
kens, then the capabilities of LLMs could more closely align with Generative Model ✓ ✓ ✓
the requirements of recommendation systems. In this way, by train- Foundation Model ✓ ✓(Partly) ✓
ing on recommendation-specific corpora, LLMs would be able to
learn genuine recommendation-related knowledge, which could
significantly improve the models’ accuracy and generalizability in
recommendation tasks.
Hence, we suggest that the ideal IDs in generative recommen-
dation should possess the following properties: 1) They should be
textual IDs composed of tokens originally processed by the pre-
trained LLMs; 2) They should be meaningful, informative, and suit-
able for recommendation purposes; 3) The generated IDs should be
short yet unique, effectively identifying the recommendation items.
However, IDs that meet such stringent requirements are clearly
not available in existing item information. Therefore, in this paper,
we propose training an ID generator that automatically learns a
textual ID for each item that fulfills the above criteria. The new Figure 1: The ID generator takes plain text from each item’s
framework, named as IDGenRec, treat ID generation as another meta textual information and generates abstractive textual
text-to-text process. As shown in Figure 1, the ID generator, which IDs for the item’s representation.
is also a language model, takes an item’s metadata (i.e., all avail-
able textual information about the item) and produces qualified
textual IDs. Consequently, the user’s history and the target item for training strategy that trains the LLM-based ID generator and
recommendation can be represented in natural language, without the base recommender asynchronously, ensuring that their
any “uncontextualized” tokens, thus making it suitable for training learned knowledge is well-aligned.
an LLM-based generative recommender. This overall process is We note that some recent LLM-based recommendation models
illustrated in Figure 2. Notably, by considering all items’ text in the employ an encoder-only structure, similar to BERT, as their repre-
user’s history, the same ID generator can produce another textual sentation function. Some of these models [12, 18] also possess (or
ID, serving as the user’s ID that represents a “high-level profile” of partially possess) characteristics of a foundational recommenda-
the user’s preferences. The creation of user ID is optional, and we tion model, though they do not function as generative models. The
will provide ablation study results in the experiments. distinctions among these models are depicted in Table 1. However,
Many challenges lie in this work, and we propose related strate- generative models present several advantages over discriminative
gies to address each of them in the paper, including: methods. These include transforming the retrieval and ranking
processes into a more streamlined generative process, eliminating
(1) The ID generator should understand lengthy metadata that the need of one-by-one item score calculation, and leveraging the
may include unnecessary information, and should generate extensive knowledge embedded in pre-trained generative LLMs.
tokens that cover the crucial details of the item which are Nonetheless, these encoder-only methods remain valuable as base-
important for recommendations. For this purpose, we have lines for zero-shot evaluation.
selected a T5 model originally trained for article tag genera- We conduct two types of experiments to demonstrate the effec-
tion and fine-tuned it with recommendation objectives. tiveness of our proposed method. First, we evaluate the method
(2) The generated IDs should be short yet unique, suitable for under a standard sequential recommendation setting. Experiments
identifying the recommendation items. However, the auto- on 4 widely-used public datasets show its significant improvements
matically generated IDs may not always satisfy the unique- compared to sequential recommendation baselines, including both
ness criterion, especially as the number of items increases. traditional and generative models. Then, to explore the possibility
Therefore, we propose a diverse ID generation algorithm to of training a foundational generative model that learns general
always ensure each item has a unique ID allocated. recommendation knowledge, inspired by the training paradigm
(3) Since the framework relies on collaboration between two of LLMs in standard NLP tasks, we compile user histories from
LLMs—the ID generator and the base recommender—a metic- 19 datasets from the Amazon Review Datasets, encompassing a
ulously designed training strategy is required to enable seam- diverse range of recommendation domains, to build a massive rec-
less collaboration between them. We propose an alternate ommendation training corpus. After training the model on this
Towards LLM-RecSys Alignment with Textual ID Learning SIGIR’24, July 14–18, 2024, Washington D.C., USA.

Figure 2: A real example showing the generative recommendation workflow. The ID generator generates item IDs for items
from the user history by taking their plain text. Then, the generated IDs are interpolated into the template. Addtionally, the
user ID is generated by using all items’ text in the user’s history, showing a “high-level profile” of the user’s preference. The
position embeddings are subsequently combined with token embeddings to capture the sequence of interactions. Finally, the
base recommender generates the ID of the recommended item based on constrained decoding.

extensive dataset, we directly apply the foundational model to 6 more on recommendation-relevant information such as the item
unseen datasets either intra- or inter-platform and evaluate the sequence and less on the exact format of the prompt. An example
recommendation performance in a completely zero-shot setup. The of a completed prompt is illustrated in Figure 2. The decoder then
results show very promising recommendation performance, even generates the target item “[target_item_ID]” token by token, which
surpassing many traditional recommendation models that are based is able to uniquely identify the target recommendation. Later, we
on supervised training. will detail the two primary LLM components of our generative
process: the ID generator and the base recommender.
2 APPROACH
We first introduce the generation process, including how the prompts 2.2 ID Generator
are constructed and how the generated IDs are integrated into the The ID generator is a generative model that produces item IDs using
text-to-text format, as discussed in Section 2.1. Then, in Sections the item’s meta-information. This meta-information encompasses
2.2 and 2.3, we describe how the IDs are generated by utilizing the all textual data related to the item, including both relevant and ir-
metadata of items, employing a diverse ID generation algorithm relevant aspects for recommendation purposes. Potential elements
to ensure that the IDs are unique for each item. With these gen- of this information may comprise the item’s title, category, price,
erated IDs, the base recommender system is presented in Section general description, creation time, popularity, location, etc. The
2.4. Finally, in Section 2.5, we demonstrate how the ID generator specific content largely depends on the platform and dataset. Al-
and recommender are trained alternately with respect to the rec- though the meta-information is typically presented in a key-value
ommendation objective, ensuring that they work collaboratively dictionary format, we convert it into plain text during processing,
and effectively. such as “name: zeppelin; categories: cocktail bars, restaurants; stars:
4.0; ...” This allows the ID generator to freely learn which pieces of
2.1 Generative Process information should be prioritized when generating IDs.
Consider an item whose plain description is a lengthy sequence
In this study, we introduce the foundational generative model
of tokens 𝒘 = [𝑤 1, 𝑤 2, . . . , 𝑤𝑚 ]. Token embeddings will be gener-
within the context of sequential recommendation systems. This ap-
ated from 𝒘 by the model’s parameters and combined with position
proach is particularly apt as it aligns naturally with the sequential
embeddings before being fed into the language model. The com-
representation of user history, serving as an input prompt. Build-
bination with position embeddings will be omitted in the rest of
ing upon existing work in generative recommendation systems
the formulations since this process is common in large language
[8, 14, 37, 39], our model requires predefined prompt templates
models. The output of the ID generator will be a concise set of ID
for the generation process. For instance, a typical template could
tokens 𝒅 = [𝑑 1, 𝑑 2, . . . , 𝑑𝑛 ], where 𝑛 ≪ 𝑚. When generating each
be: “User [user_ID] has purchased items [item_ID], [item_ID], ...,
token of the item ID, the model attends to both the item’s entire
[item_ID]; predict the next possible item to be bought by the user.”
plain description and the previously generated ID units. Thus, the
In this template, all item IDs from the user’s history are interpolated
probability of a generated ID is denoted as:
at the placeholders, preserving their sequential order. Besides, a
user ID can also be produced by the same ID generator, taking the 𝑛
meta information of all items in the user history. Generating a user
Ö
𝑝 (𝑑 1, · · · , 𝑑𝑛 ) = 𝑝𝜃 (𝑑𝑖 |𝑑 <𝑖 , 𝒘) (1)
ID is optional and proved to be benifitial to the recommendation 𝑖=1
performance in experiments. We have developed 10 such templates,
each with minor differences from the others, randomly selecting Where 𝑑 <𝑖 = [𝑑 1, · · · , 𝑑𝑖 −1 ], and 𝜃 are the parameters of the ID
one of them for each training instance to ensure the model focuses generator. This process is illustrated in Figure 1.
SIGIR’24, July 14–18, 2024, Washington D.C., USA. Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li and Yongfeng Zhang

2.3 Diverse ID Generation into the LLM-based recommender. The recommender then gener-
The two primary properties of the generated IDs are: 1) the IDs ates the tokens of the target recommended item ID in an autore-
should have a reasonable length, and 2) the IDs should be unique. gressive manner. To ensure that the decoded ID corresponds to an
However, these two properties are somewhat contradictory: a set of actual existing item, we adopt a constrained sequence decoding
IDs constrained to a shorter length is more likely to result in dupli- strategy [5]. More specifically, a prefix tree is used to store all gen-
cates when generated by the ID generator. Therefore, we propose erated candidate IDs. Each newly generated token is constrained
an algorithm to ensure that the generated IDs are both short and by the previously generated tokens, ensuring that the generation
unique. The core concept of the algorithm is fundamentally based process only considers tokens that can potentially form an existing
on diverse beam search (DBS) [32], a method commonly used in candidate ID in the dataset. Suppose the completed input prompt is
sentence generation. It is a variation of the standard beam search denoted as a sequence of tokens 𝒙 = [𝑥 1, 𝑥 2, · · · , 𝑥𝑛 ]. In this case,
that is designed to generate a more diverse set of sequences. DBS the base recommender model aims to generate 𝒚 = [𝑦1, 𝑦2, · · · , 𝑦𝑛 ],
partitions the beams into groups and introduces a diversity penalty, where 𝒚 is an ID for a real item in the dataset. We define V (𝑦 <𝑖 ) as
denoted as 𝜆, as a hyperparameter to discourage the selection of the subset of valid tokens in the vocabulary, constrained by the pre-
similar sequences within the same group. A higher value of 𝜆 pro- fix tree with previously generated tokens as nodes. The generation
motes more diversity, while a lower value of 𝜆 gives more weight to of each token by the decoder is defined as:
the model’s probabilities, potentially leading to less diverse outputs.
(
𝑝 (𝑦𝑖 |𝑦 <𝑖 , 𝒙) if 𝑦𝑖 ∈ V (𝑦 <𝑖 ),
The ID generator employs DBS in generating IDs. To ensure 𝑝 (𝑦𝑖 |𝑦 <𝑖 , 𝒙) = 𝜙 (2)
0 otherwise.
the uniqueness of the generated IDs, the algorithm generates 𝑘
groups of IDs each time and compares each generated ID against a Therefore, the probability of the recommendation of a target
set of already existing IDs. If a duplicate is detected, the algorithm item [𝑦1, · · · , 𝑦𝑛 ] is:
increases the diversity penalty in the beam search. This increase 𝑛
Ö
in penalty continues until a unique ID is produced, or until the (3)

𝑝 (𝑦1, · · · , 𝑦𝑛 ) = 𝑝𝜙 𝑦𝑖 |𝑦 <𝑖 , 𝒙, V (𝑦 <𝑖 )
diversity penalty reaches a pre-set maximum threshold (e.g., 1 in 𝑖=1
this paper). In cases where the maximum penalty is insufficient
2.5 Alternate Training
to generate a unique ID, the algorithm extends the permissible ID
length and repeats this process with the initial diversity penalty. Training the ID generator and the base recommender are two sepa-
Algorithm 1 elaborates on this process. rate but interdependent tasks. The ID generator is trained to pro-
duce optimal IDs that the base recommender can easily interpret.
Concurrently, the base recommender adjusts its parameters to en-
Algorithm 1 Diverse ID Generation Algorithm hance the correct recommendation for items, each represented
by its currently generated ID. Training both components simul-
1: Initialize set U to store unique IDs
2: Initialize diversity penalty 𝜆 to 0.1 taneously may result in an unstable training process. Hence, we
3: Initialize ID length limit 𝐿 to 1 propose alternating the training sessions of the ID generator and
4: for each item in the dataset do the base recommender, proceeding through a specified number of
5: Initialize 𝑓 𝑜𝑢𝑛𝑑 as False iterations. This approach involves asynchronously updating the
6: while not 𝑓 𝑜𝑢𝑛𝑑 do IDs between the two training phases of the base recommender for
7: Generate 𝑘 IDs using ID Generator with current 𝐿 and 𝜆 better integration and performance.
8: for each generated ID 𝑖𝑑 do Both the ID generator and the base recommender are trained by
9: if 𝑖𝑑 not in U then minimizing the negative log-likelihood of the final prediction made
10: Add 𝑖𝑑 to U and save the item-ID pair
by the recommendation pipeline compared to the ground-truth
11: Set 𝑓 𝑜𝑢𝑛𝑑 to True
target item ID.
12: break
13: end if 2.5.1 Training Base Recommender. At each round of training the
14: end for base recommender, we pre-compute item IDs for all items using
15: if not 𝑓 𝑜𝑢𝑛𝑑 then
the ID generator at that point. For each user in the training data,
16: 𝜆 ← 𝜆 + 0.1
the current IDs of the items in the user’s history are filled into
17: if 𝜆 exceeds predefined limit then
18: Increase 𝐿 and reset 𝜆 to 0.1 the sampled template to complete the input prompt 𝒙. Then, the
19: end if base recommender 𝜔 is trained with a common teacher forcing
20: end if strategy [33], i.e., the loss of the next token is computed under the
21: end while ground-truth value of the previous token:
22: end for |𝒚 |
∑︁
Lrec = − log 𝑃𝜔 (𝑦𝑖 |𝑦 <𝑖 , 𝒙) (4)
𝑖=1
2.5.2 Training ID Generator. During this training process, all pa-
2.4 Base Recommender rameters in the base recommender are fixed, and only the ID gen-
After generating item IDs and incorporating them into the prompt erator is updated. The goal is for the ID generator to produce IDs
template, the system tokenizes the completed prompt and feeds it that are suitable for the base recommendation model.
Towards LLM-RecSys Alignment with Textual ID Learning SIGIR’24, July 14–18, 2024, Washington D.C., USA.

Since the output of the ID generator is a set of discrete tokens Table 2: Amazon review datasets categorized by density
(IDs), it is inherently non-differentiable. This poses a challenge for Density Range Amazon Datasets
training the model using gradient-based optimization techniques, Den. ≥ 0.5 Instruments, Patio
as gradients cannot flow back through these discrete outputs. To 0.1 ≤ Den. < 0.5 Automotive, Instant, Office, Music, Grocery, Baby
circumvent this, for each item in the user history, we calculate the 0.05 ≤ Den. < 0.1 Tools, Pet, Toys, Phones, Beauty, Games, Apps
Den. ≤ 0.05 Clothing, Sports, Health, Home, Kindle,
output logits of each token by the ID generator across all vocab-
CDs, Electronics, Movies, Books
ulary, denoted as Logits𝜙 (V), where 𝜙 is the ID generator model
and V is the vocabulary. We then compute the average embedding
for each token of the ID through the parameters of the base recom- supervised learning capabilities of our model against widely-used
mendation model, denoted as Emb𝜔 Logits𝜙 (V) , where 𝜔 is the baselines. The second component is a zero-shot evaluation designed
base recommender model. This creates a continuous, differentiable to assess the model’s potential as the backbone for a foundational
representation of the generated IDs. These ID embeddings are then generative recommender system. We will begin by introducing the
directly interpolated into the prompt template at the related posi- datasets used in the experiments and detailing our model’s training
tions at the embedding level. We use Embinterp to represent this process. Subsequently, we will discuss the experimental results for
completed input embedding. In this way, the ID generator 𝜙 can each of the two experimental settings.
be trained, guided by the loss computed from the recommendation
output. This process is formulated as: 3.1 Datasets
|𝒚 |
∑︁   For the standard evaluation of sequential recommendation, we
Lid = − log 𝑃𝜔 𝑦𝑖 | 𝑦 <𝑖 , Embinterp selected four widely-used datasets. Three of them, namely Sports,
𝑖=1 Beauty, and Toys, are from the Amazon review dataset [10, 23],
along with another dataset from Yelp3 . These datasets are also
  
where Embinterp = Insert Emb𝜔 (prompt) , Emb𝜔 Logits𝜙 (V)
used in previous papers [8, 14, 42], and we follow the exact data
(5) processing steps to filter out users and items with fewer than 5
The parameters of the base recommender model (i.e., 𝜔) are fixed, interactions, thereby allowing for a fair comparison with all the
and the loss is only backpropagated to the ID generator (i.e., 𝜙), baselines.
ensuring that the IDs generated capture the essential characteristics For the zero-shot experiments for foundational model evaluation,
of each item, as determined by their meta-information, in a format we have two groups of datasets: pre-training datasets and testing
that the base recommendation model can effectively interpret. datasets. The pre-training datasets are all from the Amazon review
dataset, containing various domains, and the testing data are se-
2.6 Model Initialization lected from both Amazon review datasets (intra-platform) and the
We choose the T5 model [27] as the backbone for both the ID genera- Yelp dataset (inter-platform). We propose a detailed data selection
tor and the base recommender for two main reasons: 1) To maintain rule for deciding which data are used as pre-training datasets and
the model’s simplicity, as this paper does not aim to conduct ex- which are used for testing, as follows:
tensive empirical studies of LLM structures, but rather focuses on First, we split all 24 Amazon review datasets into different groups
the core concept of ID generation; 2) To ensure a fair compari- according to their densities, as shown in Table 2. Guided by their
son with previous generative recommendation works, which are density range, we include Sports, Beauty, Toys, Music, and Instru-
also based on T5, thereby demonstrating that the improvement ments in the test datasets. These datasets cover all the density cate-
in recommendation ability comes solely from a more elegant ID gories across Amazon review datasets. Besides, including Sports,
selection. Beauty, and Toys in the test datasets provides a better view of the
For the base recommender, the standard pretrained T5 check- foundational model’s ability compared to traditional models, since
point is chosen as the backbone to incorporate pre-learned knowl- the three datasets are previously used in standard evaluation. Along
edge into the recommendation task. For the ID generator, given that with Yelp, these form all the test datasets for zero-shot evaluation.
generating IDs from lengthy texts is non-trivial and highly task- We have 19 Amazon datasets left, and all of them will be se-
specific, a more dedicated starting point is preferable. Consequently, lected to create the massive recommendation corpus for training
we select a T5 small model fine-tuned on the article tag generation the foundational model. Since their sizes vary extremely (e.g., Books
task1 , as the initial configuration for the ID generator. This model, contains 603,668 users and Automotive only has 2,928 users), we
trained on 190k Medium articles2 , is adept at generating concise randomly downsample the large datasets to only include 30, 000
tags from article textual content. This selection is driven by the users, which is around the median number of users of the Amazon
significant similarity between tag generation for news articles and datasets. All the selected recommendation records together form a
the summarization of items in a few words. “Fusion” dataset for training the foundation recommendation model.
Complete and detailed data statistics can be seen in Table 3.
3 EXPERIMENT
Our experiments comprise two components: the first is an evalua- 3.2 Implementation Details
tion of standard sequential recommendation to compare the basic We use SentencePiece [17] with a vocabulary size of 32, 128 as the
tokenizer. The predefined templates for sequential recommendation
1 https://huggingface.co/nandakishormpai/t5-small-machine-articles-tag-generation
2 https://www.kaggle.com/datasets/fabiochiusano/medium-articlesdataset 3 https://www.yelp.com/dataset
SIGIR’24, July 14–18, 2024, Washington D.C., USA. Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li and Yongfeng Zhang

are largely adopted from P5 [8], with the only difference being that Table 3: Dataset Statistics
the “dataset” information is removed from the templates, as our Category Datasets # Users # Items # Interactions Density
model is more generalized and possesses cross-dataset capabilities. Sports 35,598 18,357 296,337 0.0453%
In the diverse ID generation algorithm, we set 𝑘 = 5 as the Beauty 22,363 12,101 198,502 0.0734%
Std. Eval.
number of groups for DBS and start with 𝜆 = 0.1 as the initial Toys 19,412 11,924 167,597 0.0724%
diversity penalty. This penalty is increased by 0.1 each iteration Yelp 30,431 20,033 316,354 0.0519%
when the algorithm fails to generate a unique item ID. If 𝜆 reaches Pre-training Fusion 183,918 233,323 2,875,446 0.0067%
1.0 without producing a valid ID, we then extend the length limit Sports 35,598 18,357 296,337 0.0453%
for the item ID. This iterative process of adjusting the diversity Beauty 22,363 12,101 198,502 0.0734%
penalty and, if necessary, the length of the ID, ensures that the Toys 19,412 11,924 167,597 0.0724%
Zero-shot
Diverse ID Generation algorithm can successfully produce IDs that Music 5,541 3,568 64,706 0.3273%
Instruments 1,429 900 10,261 0.7978%
are both succinct and distinct. Initially, the token length is set within Yelp (Cross Platform) 30,431 20,033 316,354 0.0519%
the range of [1, 10), but if the algorithm is unable to generate a
unique ID once the diversity penalty has reached its maximum,
we increase the token length limit to the range of [10, 20). In our
experiments, even for item IDs in the largest ID set, specifically the sequential appearance order in the dataset. P5-CID creates
Fusion dataset, only 11.20% of the items required a second attempt item IDs guided by collaborative filtering. P5-SemID uses
at ID generation with an increased diversity penalty, and merely item category as semantic information to construct item IDs.
3.72% of the items necessitated an extension in ID length. More details can be seen in [14].
In the standard recommendation experiments, where the model
is trained on a single dataset, we begin by training the ID generator 3.4.2 Results. The standard evaluation results are shown in Table
for 1 epoch, followed by training the base recommender for 10 4. Generally speaking, our proposed method significantly outper-
epochs, for a total of 3 iterations. The learning rate is set to 1𝑒 − 3 forms all other baselines. When compared to the second best base-
for training the base recommender and 1𝑒 − 8 for training the ID line on each dataset (highlighted with an underline), our method
generator. This approach is applied to all the datasets in the standard shows improvements of 39.44%, 23.55%, 42.37%, and 36.76% on the
recommendation setting. Sports, Beauty, Toys, and Yelp datasets, respectively. These improve-
Regarding the training of the foundational model for zero-shot ments are calculated by averaging the increase across four metrics
recommendation, we find that training the model for one epoch relative to the corresponding best performance from baselines, i.e.,
achieved the best performance when tested on the test datasets. the bold value against the underline value of each row for each
Further training negatively affected the zero-shot performance. dataset. Such results underscore the robustness and significance
Interestingly, a similar pattern was also observed in NLP research of the improvements brought by our generative recommendation
[25, 31], where pre-training LLMs for more than one epoch led method. Besides, according to comparisons with P5 and its variants,
to overfitting on the training data. This observe suggests that the representing items with learned textual IDs indeed better utilizes
proposed framework narrows the gap between recommendation the semantic understanding abilities of LLMs.
and language generation.
3.4.3 Ablation Studies. We conduct ablation studies regarding
3.3 Evaluation Metrics two components of the model design.
In all experiments, we evaluate the ranking performance of the First, we assess how critical the alternate training strategy is for
recommendation models using Normalized Discounted Cumula- the model, compared to 1) training only the ID generator while
tive Gain (NDCG) at 5 and 10, as well as Hit Ratio (HR) at 5 and the base recommender remains fixed with the default pre-trained
10. For fair comparison with baselines, we adopt a leave-one-out T5 parameters; 2) training only the base recommender with the
strategy for testing. Meanwhile, negative sampling is not used in item IDs directly generated by the T5 model trained on article
the evaluation. Instead, we rank over all items for evaluation. tag generation. The total number of training epochs is the same
as when each is trained in the alternate training setup, i.e., 1 × 3
epochs for the ID generator and 10 × 3 epochs for the base rec-
3.4 Exp1: Standard Evaluation
ommender. The results, shown in Table 5, indicate that alternate
3.4.1 Baselines. The baselines for standard evaluation include training significantly boosts the model’s performance by enabling
two types of recommendation methods: better collaboration between the two components. Moreover, train-
• Traditional sequential recommendation methods, including ing only the ID generator does not yield good results. However,
GRU4Rec [11], Caser [30], HGN [22], SASRec [16], Bert4Rec training only the base recommender with initially generated IDs
[28], FDSA [41], and S3Rec [42]. These models are popular still outperforms all baselines.
and cover various model structures, with GRU4Rec based on Second, we evaluate whether generating optional user IDs en-
RNN, Caser on CNN, and SASRec, HGN, Bert4Rec, FDSA, hances recommendation performance, as shown in Table 6. In sum-
and S3Rec on transformers. mary, while using only item IDs already allows the model to make
• Generative recommendation methods, including P5 [8, 37] excellent recommendations in the standard setting, including user
and its variations [14] with different ID generation strategies. IDs does prove beneficial. Nonetheless, relying solely on user IDs
P5-SID generates numerical item IDs with respect to their does not provide enough information for making recommendations.
Towards LLM-RecSys Alignment with Textual ID Learning SIGIR’24, July 14–18, 2024, Washington D.C., USA.

Table 4: Standard evaluation for single dataset recommendation. All improvements are significant at 𝑝 < 0.05 compared to the
best baseline under the student’s t-test.
Dataset Metric GRU4Rec Caser HGN SASRec Bert4Rec FDSA S3Rec P5-SID P5-CID P5-SemID IDGenRec
HR@5 0.0129 0.0116 0.0189 0.0233 0.0115 0.0182 0.0251 0.0264 0.0313 0.0274 0.0429
NDCG@5 0.0086 0.0072 0.0120 0.0154 0.0075 0.0122 0.0161 0.0186 0.0224 0.0193 0.0326
Sports
HR@10 0.0204 0.0194 0.0313 0.0350 0.0191 0.0288 0.0385 0.0358 0.0431 0.0406 0.0574
NDCG@10 0.0110 0.0097 0.0159 0.0192 0.0099 0.0156 0.0204 0.0216 0.0262 0.0235 0.0372
HR@5 0.0164 0.0205 0.0325 0.0387 0.0203 0.0267 0.0387 0.0430 0.0489 0.0433 0.0618
NDCG@5 0.0099 0.0131 0.0206 0.0249 0.0124 0.0163 0.0244 0.0288 0.0477 0.0299 0.0486
Beauty
HR@10 0.0283 0.0347 0.0512 0.0605 0.0347 0.0407 0.0647 0.0602 0.0680 0.0652 0.0814
NDCG@10 0.0137 0.0176 0.0266 0.0318 0.0170 0.0208 0.0327 0.0368 0.0357 0.0370 0.0541
HR@5 0.0097 0.0166 0.0321 0.0463 0.0116 0.0228 0.0443 0.0231 0.0215 0.0247 0.0655
NDCG@5 0.0059 0.0107 0.0221 0.0306 0.0071 0.0140 0.0294 0.0159 0.0133 0.0167 0.0481
Toys
HR@10 0.0176 0.0270 0.0497 0.0675 0.0203 0.0381 0.0700 0.0304 0.0327 0.0376 0.0870
NDCG@10 0.0084 0.0141 0.0277 0.0374 0.0099 0.0189 0.0376 0.0183 0.0170 0.0209 0.0551
HR@5 0.0176 0.0150 0.0186 0.0170 0.0051 0.0158 0.0201 0.0346 0.0261 0.0202 0.0468
NDCG@5 0.0110 0.0099 0.0115 0.0110 0.0033 0.0098 0.0123 0.0242 0.0171 0.0131 0.0368
Yelp
HR@10 0.0285 0.0263 0.0326 0.0284 0.0090 0.0276 0.0341 0.0486 0.0428 0.0324 0.0578
NDCG@10 0.0145 0.0134 0.0159 0.0147 0.0090 0.0136 0.0168 0.0287 0.0225 0.0170 0.0404

3.4.4 Case Studies. As generating item IDs using human vo- Table 5: Comparison of recommendation accuracy for dif-
cabulary is the core concept of the proposed method, we conduct ferent training strategies: 1) ID-only: Training only the ID
an extensive quality study with real examples of the generated generator. 2) Rec-only: Training only the base recommender.
item IDs from their plain text information, as shown in Figure 3. 3) Alternate: Training ID generator and base recommender
Specifically, we select two examples from each dataset: one with alternately for 3 iterations.
lengthy plain text data and one with shorter plain text data. For ID-only Rec-only Alternate
each example, we present the ID generated by the initial ID gen- HR@5 0.0102 0.0350 0.0429
erator and the ID generated by the fine-tuned ID generator at the NDCG@5 0.0070 0.0271 0.0326
end of alternate training. Initially, the ID generator tends to select Sports
HR@10 0.0155 0.0461 0.0574
the first few words from the item’s meta information, regardless NDCG@10 0.0087 0.0307 0.0372
of their relevance. Excitingly, after training, the ID generator is
HR@5 0.0111 0.0601 0.0618
capable of selecting words that more accurately represent the items.
NDCG@5 0.0067 0.0442 0.0486
The learned IDs are generally more informative and representative, Beauty
HR@10 0.0192 0.0797 0.0814
containing less extraneous information such as numbers.
NDCG@10 0.0093 0.0505 0.0541

3.5 Exp2: Zero-shot Evaluation Table 6: Recommendation accuracy for using 1) only user
We test the potential of our model to serve as a foundation recom- IDs, 2) only item IDs, and 3) both user and item IDs.
mendation model. The experiment is conducted by first training User ID Item ID User & Item ID
the model on the fusion dataset, i.e., an extensive recommendation HR@5 0.0177 0.0404 0.0429
corpus collected from 19 recommendation datasets from Amazon NDCG@5 0.0118 0.0308 0.0326
Review, then directly test its recommendation performance on each Sports
HR@10 0.0300 0.0528 0.0574
test dataset in a completely zero-shot setting (all the users and items NDCG@10 0.0141 0.0348 0.0372
are not seen in the training dataset). HR@5 0.0202 0.0577 0.0618
NDCG@5 0.0325 0.0441 0.0486
Beauty
HR@10 0.0138 0.0778 0.0814
3.5.1 Baselines. UniSRec [12] is the main comparable baseline
NDCG@10 0.0177 0.0506 0.0541
in the zero-shot setting. It uses an encoder-only structure to learn
item representations from their meta text information. We use the
use all the plain text information. Since the patterns have not been
inductive version of UniSRec, which has cross-platform capability.
seen in both models, this is a fair comparison.
In our experiments, we use the same training data for training UniS-
Rec. Although UniSRec has a certain degree of flexibility to learn 3.5.2 Results. The results are presented in Table 7. In this zero-
general information in recommendations, it incorporates some shot evaluation, our model generally outperforms UniSRec, with the
human-inductive knowledge in data preprocessing, e.g., the “title,” exception of HR@10 on the Music dataset, where UniSRec demon-
“category,” and “brand” information are carefully selected for Ama- strates better performance. Notably, for the Yelp dataset, which
zon datasets. For the Amazon dataset, we still use the pre-selected is cross-platform from the training data, our model significantly
categories as in the original paper to showcase its best performance, surpasses the baseline with a 353.46% improvement. This demon-
therefore introducing a slight advantage for UniSRec. For Yelp, we strates the superior generalizability of our proposed model, making
SIGIR’24, July 14–18, 2024, Washington D.C., USA. Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li and Yongfeng Zhang

Figure 3: Quality study of the generated item IDs: two examples for each dataset, one with lengthy plain text data and one with
shorter plain text data. The blue IDs are generated by the initial ID generator pre-trained on article tag generation, while the
green IDs are generated by the ID generator after alternate training.

it more suitable for serving as a foundational model backbone. This Moreover, by comparing with the average HR and NDCG scores
is indeed expected, as our method can automatically create rep- across the shared datasets presented in the standard evaluations,
resentative IDs for the items without requiring any special data the foundational model’s zero-shot recommendation capability is
processing. surprisingly comparable to, or even surpasses, that of some tradi-
tional recommendation models based on supervised training. For
Towards LLM-RecSys Alignment with Textual ID Learning SIGIR’24, July 14–18, 2024, Washington D.C., USA.

Table 7: Zero-shot Evaluation. Intra-platform datasets are re-ranking framework, this method can directly generate the target
from the same platform of the training data (i.e., Amazon) item from the complete pool of items [19], thus gathering signifi-
and inter-platform datasets are from an unseen recommen- cant attention. Geng et al. [8] introduced one of the first text-to-text
dation platform (i.e., Yelp). generative recommendations, leveraging a pre-trained T5 as the
Scenario Dataset Metric UniSRec IDGenRec recommendation backbone. The model was fine-tuned on 5 diverse
HR@5 0.0060 0.0156 recommendation tasks with predefined prompts and demonstrated
NDCG@5 0.0034 0.0134
Sports promising results on the downstream tasks. Following [8], [9] pro-
HR@10 0.0098 0.0218
NDCG@10 0.0046 0.0154 posed a multimodal version with a similar architecture, considering
HR@5 0.0118 0.0174 item visual features during recommendation generation. In these
NDCG@5 0.0068 0.0135 works, numerical IDs were assigned to each item, enabling the
Beauty
HR@10 0.0206 0.0310 recommended items to be generated token-by-token. Subsequently,
NDCG@10 0.0096 0.0177 [14] posited that indexing item IDs is the most pivotal aspect of
HR@5 0.0097 0.0103
generative recommendation. They conducted an extensive study on
NDCG@5 0.0055 0.0079
Intra-platform Toys indexing methods such as sequential indexing, semantic indexing,
HR@10 0.0175 0.0215
NDCG@10 0.0080 0.0114 and collaborative indexing, and assessed their effects on sequen-
HR@5 0.0152 0.0184 tial recommendation performance. Although these methods have
NDCG@5 0.0087 0.0148 shown potential, a limitation exists: A primary advantage of using
Music
HR@10 0.0294 0.0238 pre-trained language models lies in their capability to understand
NDCG@10 0.0133 0.0165
HR@5 0.0154 0.0203
human language, making it crucial to fully utilize contextual in-
NDCG@5 0.0084 0.0139 formation. However, the rich textual information in recommender
Instruments
HR@10 0.0280 0.0440 platforms is not fully harnessed by using numerical IDs. Further-
NDCG@10 0.0125 0.0215 more, the employment of Out-Of-Vocabulary (OOV) tokens for item
HR@5 0.0064 0.0300 representation limits the generalizability of these methods, confin-
NDCG@5 0.0051 0.0248 ing the trained model to a single dataset. Alternative approaches,
Inter-platform Yelp
HR@10 0.0081 0.0329
NDCG@10 0.0057 0.0258
such as using item titles [15] or categories [14] as semantic IDs,
may initially seem to offer a more meaningful representation. How-
ever, these approaches may encounter duplicated ID issues as the
example, the zero-shot performance surpasses that of GRU4Rec on number of items increases, and the IDs may have unintended over-
all four datasets, Bert4Rec on three out of four datasets, and Caser laps. For example, similar titles may represent completely different
on two out of four datasets. Impressively, on the Yelp dataset, our items. Our method distinguishes itself by proposing an ID-generator
model’s zero-shot performance outperforms all traditional methods that derives semantic item IDs from rich contextual information,
based on supervised training, only falling short of the P5 model. enabling both the utilization of this information and facilitating
As the model’s zero-shot recommendation capability may further cross-platform, zero-shot recommendations.
improve with larger and more carefully collected training data, this Another branch of generative recommendation research has
suggests great potential for its future use as a foundational model. concentrated on probing the potential of LLMs to directly generate
recommendations without the need for training or fine-tuning
4 RELATED WORKS [4, 7, 13, 20, 21, 29]. The primary focus in these works is the design
In recent years, there has been a surge in research on recommenda- of prompts. The emergence of this line of research was inspired
tion systems leveraging LLMs. LLM-based recommender systems by the exceptional zero-shot capabilities of ChatGPT. Researchers
can be broadly classified into two categories: discriminative and are keen to ascertain how ChatGPT performs in a recommendation
generative [35]. scenario. However, as pointed out by [20], ChatGPT cannot generate
In LLM-based discriminative recommendation, LLMs are pri- recommendations with accuracy competitive to that of traditional
marily used to learn better representations of users and items by recommendation methods.
leveraging contextual information in the recommendation process
[12, 18, 24, 26, 34, 36, 38, 40]. Among them, BERT [6] is commonly
used as the backbone. Compared to recommender systems that 5 CONCLUSION
learn user/item embeddings primarily from user-item associations, In this paper, we address the item encoding problem in generative
these models capture rich textual information in conjunction with recommendation systems by introducing a novel framework that
collaborative filtering. The core idea of these models is the learning incorporates textual ID learning. Specifically, we employ an ID gen-
of embeddings with LLMs and integrating the embeddings into a erator to produce unique, concise, and semantically rich textual IDs
ranking score calculation based paradigm. Since this types of works that are platform-agnostic and are based on human vocabulary. We
are not the main focus of this paper, readers may refer to [35] for a also propose a diverse ID generation algorithm and an alternative
more comprehensive study. training strategy to better align the LLM-based ID generator and
Our work belongs to the second category: LLM-based generative the base recommender. This model has been proven to outperform
recommendation [3, 8, 9, 14]. This novel approach transitions from existing recommendation baselines in the standard sequential rec-
the traditional ranking-based recommendation to a pure text-to-text ommendation setting. Furthermore, by training our model on some
method. In comparison with the original score computation and datasets while testing on other unseen datasets, our model shows
SIGIR’24, July 14–18, 2024, Washington D.C., USA. Juntao Tan, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Zelong Li and Yongfeng Zhang

strong performance under zero-shot recommendation scenario. Our of the 38th international ACM SIGIR conference on research and development in
research offers a new perspective to better align large language information retrieval. 43–52.
[24] Aashiq Muhamed, Iman Keivanloo, Sujan Perera, James Mracek, Yi Xu, Qingjun
models and recommender systems by bridging the two through Cui, Santosh Rajagopalan, Belinda Zeng, and Trishul Chilimbi. 2021. CTR-BERT:
meticulously learned textual IDs, which may serve as a solid basis Cost-effective knowledge distillation for billion-parameter teacher models. In
NeurIPS Efficient Natural Language and Speech Processing Workshop.
for training foundational recommendation models in the future. [25] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
Training language models to follow instructions with human feedback. Advances
REFERENCES in Neural Information Processing Systems 35 (2022), 27730–27744.
[1] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav [26] Zhaopeng Qiu, Xian Wu, Jingyue Gao, and Wei Fan. 2021. U-BERT: Pre-training
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Se- user representations for improved recommendation. In Proceedings of the AAAI
bastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. Conference on Artificial Intelligence, Vol. 35. 4320–4327.
arXiv preprint arXiv:2204.02311 (2022). [27] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
[2] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of
Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling transfer learning with a unified text-to-text transformer. The Journal of Machine
instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022). Learning Research 21, 1 (2020), 5485–5551.
[3] Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. [28] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang.
M6-rec: Generative pretrained language models are open-ended recommender 2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep-
systems. arXiv preprint arXiv:2205.08084 (2022). resentations from transformer. In Proceedings of the 28th ACM international
[4] Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxi- conference on information and knowledge management. 1441–1450.
ang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering ChatGPT’s Capabilities in [29] Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun
Recommender Systems. arXiv preprint arXiv:2305.02182 (2023). Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as
[5] Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2020. Au- Re-Ranking Agent. arXiv preprint arXiv:2304.09542 (2023).
toregressive entity retrieval. arXiv preprint arXiv:2010.00904 (2020). [30] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
[6] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
Pre-training of deep bidirectional transformers for language understanding. arXiv international conference on web search and data mining. 565–573.
preprint arXiv:1810.04805 (2018). [31] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
[7] Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong, Haofen Wang, and Jiawei Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Zhang. 2023. Chat-rec: Towards interactive and explainable llms-augmented Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
recommender system. arXiv preprint arXiv:2303.14524 (2023). preprint arXiv:2302.13971 (2023).
[8] Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. [32] Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Ste-
Recommendation as language processing (rlp): A unified pretrain, personalized fan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding
prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424
Recommender Systems. 299–315. (2016).
[9] Shijie Geng, Juntao Tan, Shuchang Liu, Zuohui Fu, and Yongfeng Zhang. 2023. [33] Ronald J Williams and David Zipser. 1989. A learning algorithm for continually
VIP5: Towards Multimodal Foundation Models for Recommendation. arXiv running fully recurrent neural networks. Neural computation 1, 2 (1989), 270–280.
preprint arXiv:2305.14302 (2023). [34] Chuhan Wu, Fangzhao Wu, Tao Qi, and Yongfeng Huang. 2021. Empowering
[10] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual news recommendation with pre-trained language models. In Proceedings of the
evolution of fashion trends with one-class collaborative filtering. In proceedings 44th International ACM SIGIR Conference on Research and Development in Infor-
of the 25th international conference on world wide web. 507–517. mation Retrieval. 1652–1656.
[11] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. [35] Likang Wu, Zhi Zheng, Zhaopeng Qiu, Hao Wang, Hongchao Gu, Tingjia Shen,
2015. Session-based recommendations with recurrent neural networks. arXiv Chuan Qin, Chen Zhu, Hengshu Zhu, Qi Liu, et al. 2023. A Survey on Large
preprint arXiv:1511.06939 (2015). Language Models for Recommendation. arXiv preprint arXiv:2305.19860 (2023).
[12] Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Li, Bolin Ding, and Ji-Rong [36] Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, Bhuvan Middha, Fangzhao Wu,
Wen. 2022. Towards universal sequence representation learning for recommender and Xing Xie. 2022. Training large-scale news recommenders with pretrained
systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Dis- language models in the loop. In Proceedings of the 28th ACM SIGKDD Conference
covery and Data Mining. 585–593. on Knowledge Discovery and Data Mining. 4215–4225.
[13] Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, [37] Shuyuan Xu, Wenyue Hua, and Yongfeng Zhang. 2024. OpenP5: An Open-Source
and Wayne Xin Zhao. 2023. Large language models are zero-shot rankers for Platform for Developing, Training, and Evaluating LLM-based Recommender
recommender systems. arXiv preprint arXiv:2305.08845 (2023). Systems. SIGIR (2024).
[14] Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023. How to [38] Shaowei Yao, Jiwei Tan, Xi Chen, Juhao Zhang, Xiaoyi Zeng, and Keping Yang.
Index Item IDs for Recommendation Foundation Models. SIGIR-AP (2023). 2022. ReprBERT: Distilling BERT to an Efficient Representation-Based Relevance
[15] Jianchao Ji, Zelong Li, Shuyuan Xu, Wenyue Hua, Yingqiang Ge, Juntao Tan, and Model for E-Commerce. In Proceedings of the 28th ACM SIGKDD Conference on
Yongfeng Zhang. 2024. Genrec: Large language model for generative recommen- Knowledge Discovery and Data Mining. 4363–4371.
dation. In European Conference on Information Retrieval. Springer, 494–502. [39] Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong
[16] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom- Wen. 2023. Recommendation as instruction following: A large language model
mendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, empowered recommendation approach. arXiv preprint arXiv:2305.07001 (2023).
197–206. [40] Song Zhang, Nan Zheng, and Danli Wang. 2022. GBERT: Pre-training User
[17] Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language representations for Ephemeral Group Recommendation. In Proceedings of the
independent subword tokenizer and detokenizer for neural text processing. arXiv 31st ACM International Conference on Information & Knowledge Management.
preprint arXiv:1808.06226 (2018). 2631–2639.
[18] Jiacheng Li, Ming Wang, Jin Li, Jinmiao Fu, Xin Shen, Jingbo Shang, and Julian [41] Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, Deqing
McAuley. 2023. Text Is All You Need: Learning Language Representations for Wang, Guanfeng Liu, Xiaofang Zhou, et al. 2019. Feature-level Deeper Self-
Sequential Recommendation. arXiv preprint arXiv:2305.13731 (2023). Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
[19] Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. 2023. Large Language Models [42] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang,
for Generative Recommendation: A Survey and Visionary Discussions. arXiv Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning for se-
preprint arXiv:2309.01157 (2023). quential recommendation with mutual information maximization. In Proceedings
[20] Junling Liu, Chao Liu, Renjie Lv, Kang Zhou, and Yan Zhang. 2023. Is chatgpt a of the 29th ACM international conference on information & knowledge management.
good recommender? a preliminary study. arXiv preprint arXiv:2304.10149 (2023). 1893–1902.
[21] Qijiong Liu, Nuo Chen, Tetsuya Sakai, and Xiao-Ming Wu. 2023. A First
Look at LLM-Powered Generative News Recommendation. arXiv preprint
arXiv:2305.06566 (2023).
[22] Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical gating networks for
sequential recommendation. In Proceedings of the 25th ACM SIGKDD international
conference on knowledge discovery & data mining. 825–833.
[23] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
2015. Image-based recommendations on styles and substitutes. In Proceedings

You might also like