Retrieval-Augmented Personalization for Multimodal Large Language Models

Hao, Haoran; Han, Jiaming; Li, Changsheng; Li, Yu-Feng; Yue, Xiangyu

Computer Science > Computer Vision and Pattern Recognition

arXiv:2410.13360 (cs)

[Submitted on 17 Oct 2024 (v1), last revised 18 Nov 2024 (this version, v2)]

Title:Retrieval-Augmented Personalization for Multimodal Large Language Models

Authors:Haoran Hao, Jiaming Han, Changsheng Li, Yu-Feng Li, Xiangyu Yue

View PDF HTML (experimental)

Abstract:The development of large language models (LLMs) has significantly enhanced the capabilities of multimodal LLMs (MLLMs) as general assistants. However, lack of user-specific knowledge still restricts their application in human's daily life. In this paper, we introduce the Retrieval Augmented Personalization (RAP) framework for MLLMs' personalization. Starting from a general MLLM, we turn it into a personalized assistant in three steps. (a) Remember: We design a key-value database to store user-related information, e.g., user's name, avatar and other attributes. (b) Retrieve: When the user initiates a conversation, RAP will retrieve relevant information from the database using a multimodal retriever. (c) Generate: The input query and retrieved concepts' information are fed into MLLMs to generate personalized, knowledge-augmented responses. Unlike previous methods, RAP allows real-time concept editing via updating the external database. To further improve generation quality and alignment with user-specific information, we design a pipeline for data collection and create a specialized dataset for personalized training of MLLMs. Based on the dataset, we train a series of MLLMs as personalized multimodal assistants. By pretraining on large-scale dataset, RAP-MLLMs can generalize to infinite visual concepts without additional finetuning. Our models demonstrate outstanding flexibility and generation quality across a variety of tasks, such as personalized image captioning, question answering and visual recognition. The code, data and models are available at this https URL.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
Cite as:	arXiv:2410.13360 [cs.CV]
	(or arXiv:2410.13360v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2410.13360

Submission history

From: Haoran Hao [view email]
[v1] Thu, 17 Oct 2024 09:10:26 UTC (18,944 KB)
[v2] Mon, 18 Nov 2024 15:35:14 UTC (18,942 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Retrieval-Augmented Personalization for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Retrieval-Augmented Personalization for Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators