Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.09072v1 [cs.CV] 14 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

UniCode [Uncaptioned image] : Learning a Unified Codebook for Multimodal Large Language Models

Sipeng Zheng11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Bohan Zhou22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Yicheng Feng22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT    Ye Wang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT    Zongqing Lu1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTBeijing Academy of Artificial Intelligence (BAAI) 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSchool of Computer Science
Zongqing Lu is the corresponding author.
   Peking University
spzheng@baai.ac.cn zhoubh@stu.pku.edu.cn yewang@stu.ecnu.edu.cn {fyc813, zongqing.lu}@pku.edu.cn
Abstract

In this paper, we propose UniCode, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals. This innovation addresses a critical limitation in existing MLLMs: their reliance on a text-only codebook, which restricts MLLM’s ability to generate images and texts in a multimodal context. Towards this end, we propose a language-driven iterative training paradigm, coupled with an in-context pre-training task we term “image decompression”, enabling our model to interpret compressed visual data and generate high-quality images. The unified codebook empowers our model to extend visual instruction tuning to non-linguistic generation tasks. Moreover, UniCode is adaptable to diverse stacked quantization approaches in order to compress visual signals into a more compact token representation. Despite using significantly fewer parameters and less data during training, Unicode demonstrates promising capabilities in visual reconstruction and generation. It also achieves performances comparable to leading MLLMs across a spectrum of VQA benchmarks.

Keywords:
Multimodal Learning Large Model Visual Generation

1 Introduction

The rapid development of large language models (LLMs) [39, 50, 51] has spurred growing interest in their multimodal counterparts [1, 29, 33]. Empowered by LLMs, existing foundation models have shown remarkable capabilities in multimodal understanding, spanning from basic image classification [41, 63] and captioning [54, 29], to more intricate tasks such as making strategic high-level plans for open-world agents [67, 17]. As illustrated in Figure 1 (a), these works rely on a lightweight module such as a multimodal projector [33, 29] to seamlessly map visual signals into LLM’s textual space with minimal training cost.

Refer to caption
Figure 1: Three paradigms of MLLMs: (a) vis enc+text tok incorporates a lightweight module to align visual signals with the LLM, specifically designed for languge generation; (b) vis tok+text tok concatenates the text codebook with quantized visual tokens, significantly increasing the computational cost and complexity; (c) unified tok learns a unified codebook to interpret both visual and text modalities without additional modules. We explore the last option by proposing UniCode in this work.

Progress has been made though, most multimodal large language models (MLLMs) are still limited to language generation. This limitation stems from their reliance on text-only codebooks, which restricts their application across diverse scenarios, such as image generation [27]. Note that images, like text, can be tokenized into a series of discrete codes through Vector Quantization (VQ) [52, 15, 27], which suggests a straightforward strategy to enhance MLLMs: extending the LLM’s codebook to include visual codes [36, 8] as shown in Figure 1 (b). However, this approach introduces new challenges. Firstly, it requires considerable effort to overcome the substantial modality gap between visual and text codes. Secondly, enlarging the codebook leads to an upsurge in model parameters and risks “codebook collapse” [12], where the model overly relies on a limited set of codes, posing significant obstacles in the training of MLLMs.

Instead of expanding the codebook size, we pose a question: “Is it feasible to learn a unified codebook capable of quantizing language, vision, and potentially other modalities?” As illustrated in Figure 1 (c), a unified codebook could seamlessly integrade various data types, thereby equipping MLLMs with the ability to generate non-linguistic content without the need for additional parameters or specialized modules. Recent initiatives have explored to map visual signals into the text token space of a frozen LLM. Yet, these efforts offen yield inferior results compared with those using a learned visual codebook [26, 31], or they require an exponential number of tokens to accurately represent an image [62]. More importantly. relying on a frozen LLM means stopping adapting to follow human instructions through alignment with high-quality, instruction-following data [49, 6]. Hence, being trainable is a indispensable feature for MLLMs.

Considering this, we introduce UniCode, the first attempt to craft a Unified Codebook for MLLMs, by integrating a VAE-style visual tokenizer [52] with the LLM. To achieve this, we alternate the training process between these two modules, iteratively synchronizing the visual tokenizer’s codebook with the LLM’s to maintain consistency. This process, which we term "language-driven iterative training", utilizes a smooth moving average to udpate the visual codebook. To further enhance the fidelity of images generated by UniCode, we introduce a novel pre-training task: in-context image decompression. This task leverages in-context instructions to transform compressed image data into discrete visual tokens. Furthermore, UniCode is designed to support stacked quantization [27, 62] to optimize visual tokenization efficiency, which compresses images into stacked code maps to reduce the feature resolution. The unified codebook effectively converts visual inputs into language tokens. Based on it, UniCode brodens the scope of visual instruction tuning  [31] by reformulating multimodal generation in the context of instruction-following format.

Our key contributions can be summarized as follows:

  • We propose UniCode, a innovative paradigm for MLLMs, featuring a unified codebook capable of tokenizing both visual and textual inputs. To achieve this, we adopt language-driven iterative training to learn such a codebook without additional parameters for visual-text alignment.

  • We enrich the model’s tuning with non-linguistic data integrated into the existing visual instructional dataset. The tuning process is augmented by a unique in-context image decompression task, designed to improve the model’s ability to interpret and generate complex multimodal content.

  • Experimental analysis shows the effectiveness of UniCode compared to state-of-the-art MLLMs. Notably, this is achieved using a more efficient visual encoder that requires significantly fewer parameters and training samples.

2 Related Work

2.1 Visual Quantization

Vector quantization (VQ) has achieved remarkable success in creating high-resolution images [52, 61, 5] and videos [20, 57, 56]. VQ-VAE [52] firstly converts images into discrete representations and autoregressively models their distribution. Following this work, Razavi et al.[43] adopt learned hierarchical representations, while Esser et al.[15] introduce perceptual adversarial loss [53] to refine the perceptual quality of reconstructed images. Inspired by residual quantization [24, 35], Lee et al.[27] develop residual quantization (RQ), a technique that encodes images into a stacked map of discrete codes, thereby efficiently reducing the spatial resolution of features. You et al. [59] propose hierarchical vector quantization (HQ) which employs a pyramid scheme with two-level codes for image encoding. Despite these advancements, a limitation of these methods is that their codebooks, being jointly trained with the encoder and decoder, lack direct interpretability in natural language. To address this, recent research has investigated to leverage frozen LLMs for image understanding [26, 31]. Liu et al.[31] innovate with LQAE, replacing the learned codebook with a text vocabulary from the frozen BERT [11]. Despite its novelty, LQAE falls short in the fidelity of image reconstruction, underscoring the challenges of using a frozen LLM for content generation across modalities. Yu et al.[62] aim to solve the challenge by arranging quantized tokens in a multi-layer, coarse-to-fine pyramid.

2.2 Multimodal Instruction Tuning

In the field of natural language processing (NLP), previous studies have made significant strides in enabling LLMs [4, 42, 64, 7] to comprehend and execute natural language instructions, through a process known as instruction tuning [40]. Following this practice, recent efforts have extended its application to the multimodal realm [58, 55]. Among these works, Liu et al.[33] introduce LLaVA, the first model to apply the concept of visual instruction tuning to build a versatile visual assistant. Following this, Li et al.[28] propose Mimic-it, enhancing the model’s capability by incorporating multimodal in-context information directly into instruction data. Zhang et al.[65] and Zhao et al.[66] have further research in this area by scaling instructional data and enriching it with text-dense images. In addition to simply increasing data volume, Dai et al.[9] develope InstructBLIP based on BLIP-2 [29], which introduces an advanced visual feature extraction mechanism to bolster performance across vision-language tasks.

While existing foundation models have marked impressive strides in multimodal benchmarks, their capabilities are still limited to text-only generation. Recent, a notable advancement Emu is introduced by Sun et al.[48], a model crafted for generative pretraining across multiple modalities. Despite its innovation, Emu necessitates a robust visual encoder with 1 billion parameters and relies on 80 million samples for effective pretraining. Meanwhile, Lu et al.[36] propose Unified-IO 2, a MLLM akin to our approach which encodes and generates text, vision, audio, and interleaved sequences. Yet, it also requires significant computational demands with 1 billion image-text pairs. Instead, UniCode diverges from the above approaches by focusing on learning a unified codebook. We demonstrate through experiments that our model substantially decreases resource requirements while still achieving competitive results.

3 Unified Codebook Learning

UniCode is built without bells and whistles for easy replication based on arbitrary transformer-based architecture of LLMs. Our primary objective is to craft a unified codebook that efficiently tokenizes multimodal information. To achieve this, we first give a brief overview of visual tokenization in Section 3.1. Then in Section 3.2, we propose our language-driven iterative training paradigm, after discussing various alternatives for synchronizing the learning of both visual and linguistic codebooks. In Section 3.3, we further propose a novel image decompression task designed for generation enhancement.

3.1 Visual Tokenization

Visual tokenization [52] is a process that compresses visual signals (e.g., images) into a series of discrete tokens, which generally consists of an encoder 𝔼𝔼\mathbbm{E}blackboard_E, a decoder 𝔻𝔻\mathbbm{D}blackboard_D and a codebook ={(k,e(k))|k{1,,K}}conditional-set𝑘𝑒𝑘𝑘1K\mathbbm{C}=\{(k,e(k))|k\in\{1,\cdots,{\rm K}\}\}blackboard_C = { ( italic_k , italic_e ( italic_k ) ) | italic_k ∈ { 1 , ⋯ , roman_K } }, where KK{\rm K}roman_K denotes the codebook size. Here, \mathbbm{C}blackboard_C is a finite set of pairs, each consisting of a code k𝑘kitalic_k and its corresponding n𝑛nitalic_n-dimensional code embedding e(k)n𝑒𝑘superscript𝑛e(k)\in\mathbbm{R}^{n}italic_e ( italic_k ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Similar to LLM, for each vector zn𝑧superscript𝑛z\in\mathbbm{R}^{n}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the operation of visual tokenization Q(z;)𝑄𝑧Q(z;\mathbbm{C})italic_Q ( italic_z ; blackboard_C ) is defined to select the code from \mathbbm{C}blackboard_C whose embedding is closest to z𝑧zitalic_z, which is described as:

Q(z;)=argmink{1,,K}ze(k)22.𝑄𝑧subscriptargmin𝑘1Ksuperscriptsubscriptnorm𝑧𝑒𝑘22Q(z;\mathbbm{C})=\operatorname*{argmin}_{k\in\{1,\cdots,{\rm K}\}}\|z-e(k)\|_{% 2}^{2}.italic_Q ( italic_z ; blackboard_C ) = roman_argmin start_POSTSUBSCRIPT italic_k ∈ { 1 , ⋯ , roman_K } end_POSTSUBSCRIPT ∥ italic_z - italic_e ( italic_k ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (1)

Given an image H×W×3superscriptHW3\mathcal{I}\in\mathbbm{R}^{{\rm H}\times{\rm W}\times 3}caligraphic_I ∈ blackboard_R start_POSTSUPERSCRIPT roman_H × roman_W × 3 end_POSTSUPERSCRIPT, the visual tokenizer first uses the encoder 𝔼𝔼\mathbbm{E}blackboard_E to derive its feature map Z0h×w×csubscript𝑍0superscript𝑤𝑐Z_{0}\in\mathbbm{R}^{{h}\times{w}\times c}italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT, with c𝑐citalic_c representing the embedding dimension. Subsequently, each vector zZ0𝑧subscript𝑍0z\in Z_{0}italic_z ∈ italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is assigned to the closest code within the codebook \mathbbm{C}blackboard_C, yielding a code map Mij=𝒬(Z0ij;)subscript𝑀𝑖𝑗𝒬subscriptsubscript𝑍0𝑖𝑗M_{ij}=\mathcal{Q}({Z_{0}}_{ij};\mathbbm{C})italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_Q ( italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; blackboard_C ) and its quantized feature map Zij=e(Mij)subscript𝑍𝑖𝑗𝑒subscript𝑀𝑖𝑗Z_{ij}=e(M_{ij})italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_e ( italic_M start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), where i[1,h]𝑖1i\in[1,h]italic_i ∈ [ 1 , italic_h ] and j[1,w]𝑗1𝑤j\in[1,w]italic_j ∈ [ 1 , italic_w ], respectively. The decoder then utilizes Z𝑍Zitalic_Z to reconstructs the image. In this work, LLM is designed to either interpret these quantized embeddings as input, or to directly generate discrete tokens that can signify visual semantic concepts.

Efficient Stack Quantization. As the resolution (h×w)𝑤(h\times w)( italic_h × italic_w ) of code map M𝑀Mitalic_M increases, the computational demand on LLM grows quadratically. Given the LLM’s inherent constraint on processing only a finite length of token sequences, reducing the resolution of the code map becomes crucial. However, the fidelity of reconstruction is deeply influenced by the tokens’ bit-depth [45]. To strike a balance between efficiency and quality in visual tokenization, we consider stacked quantization as a viable solution [38] to decrease the resolution of M𝑀Mitalic_M. Specifically, stacked quantization preserves the visual information by generating a D𝐷Ditalic_D-layer code map M^dh^d×w^d×Dsubscript^𝑀𝑑superscriptsubscript^𝑑subscript^𝑤𝑑𝐷\hat{M}_{d}\in\mathbbm{N}^{\hat{h}_{d}\times\hat{w}_{d}\times D}over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_N start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT, where d[1,D]𝑑1𝐷d\in[1,D]italic_d ∈ [ 1 , italic_D ] and the dimensions h^^\hat{h}over^ start_ARG italic_h end_ARG, w^^𝑤\hat{w}over^ start_ARG italic_w end_ARG are significantly reduced compared to hhitalic_h, w𝑤witalic_w. For each element (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in the code map, its ultimate embedding is an aggregation of D𝐷Ditalic_D quantized vectors z^ij=d=1De(M^i,j,d)subscript^𝑧𝑖𝑗superscriptsubscript𝑑1𝐷𝑒subscript^𝑀𝑖𝑗𝑑\hat{z}_{ij}=\mathcal{F}_{d=1}^{D}e(\hat{M}_{i,j,d})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT italic_e ( over^ start_ARG italic_M end_ARG start_POSTSUBSCRIPT italic_i , italic_j , italic_d end_POSTSUBSCRIPT ), with \mathcal{F}caligraphic_F denoting the aggregation function (e.g., concatenation in HQ [59], cumulative sum in RQ [27]). In our study, HQ serves as a prime example for illustration. Note that UniCode is adaptable to various variants of stacked quantization, making it a fertile area for further research.

In our approach, these visual tokens directly correspond to entries in the LLM’s codebook, enabling our proposed UniCode to seamlessly interpret these aggregated quantized embeddings. Furthermore, we propose an image decompression task to enhance the LLM’s capability for converting quantized embeddings into language tokens.

3.2 Codebook Learning Paradigm

Before introducing our proposed paradigm, we first discuss two alternatives to obtain a unified codebook:

Frozen LLM Codebook. We start with a straightforward approach as illustrated in Figure 2 (a), where the visual tokenizer’s codebook is initialized with a pretrained LLM and simply keeps frozen during training. While this approach directly links the visual tokenizer with the language vocabulary, it falls short in accurately capturing the semantic nuances in images. Our empirical study further reveals that employing a frozen codebook adversely impacts the quality of reconstruction, especially for stacked quantization methods such as hierarchical quantization (HQ). This can be primarily attributed to two factors: the absence of an explicit mechanism to synchronize the encoder/decoder with the frozen codebook, and the varying scales of multi-layer embeddings that divides the codebook into multiple parts [31]. The pursuit of optimal reconstruction fidelity motivates the following development of dynamic alignment between the codebook and the encoder/decoder of visual tokenizer.

Refer to caption
Figure 2: Illustration of multiple paradigms to obtain a unified codebook. Dotted line indicates the training loop: (a) frozen LLM codebook, which initiates the codebook with a pretrained LLM and freezes it during training; (b) dual alternative training, which jointly trains both visual tokenizer and LLM, by alternatively updating each one’s codebook using the other’s parameters. (c) language-driven iterative training, which smoothly updates the codebook of visual tokenizer with LLM’s through a moving average manner.

Dual Alternative Training. As shown in Figure 2 (b), this approach dynamically aligns the visual tokenizer and LLM by alternating their training. In each training step of the visual tokenizer, its codebook is directly replaced by that of the LLM, and vice versa. This approach ensures both modules are progressively optimized in a unified direction using a shared codebook. However, a new challenge arises now from the disparity in the codebook change rate in the two modules. To be specific, the codebook change in the visual tokenizer is significantly greater than that of the LLM. This becomes even more severe for stacked quantization due to their multi-layer code map, where each additional layer requires one more update of the codebook. Such disparity finally leads to the misalignment between the codebook and LLM, resulting in the impairment of the LLM’s language generation capabilities.

Language-driven Iterative Training. To overcome the above issues and facilitate unified codebook learning, we introduce this paradigm as illustrated in Figure 2 (c). Unlike dual alternative training, this approach does not employ the visual tokenizer to update the LLM’s codebook. Instead, we apply the exponential moving average (EMA) method [27] to ensure the codebook’s alignment with the visual encoder, which dynamically updates the visual tokenizer’s codebook at a certain decay rate λ𝜆\lambdaitalic_λ:

=λ+(1λ)𝕀Z.superscript𝜆1𝜆𝕀𝑍\small\mathbbm{C}^{{}^{\prime}}=\lambda\mathbbm{C}+(1-\lambda)\mathbbm{I}\cdot Z.blackboard_C start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_λ blackboard_C + ( 1 - italic_λ ) blackboard_I ⋅ italic_Z . (2)

Zhw×c𝑍superscript𝑤𝑐Z\in\mathbbm{R}^{hw\times c}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_h italic_w × italic_c end_POSTSUPERSCRIPT represents the flattened features of a given image, as generated by the encoder. The indicator map 𝕀K×hw𝕀superscriptK𝑤\mathbbm{I}\in\mathbbm{R}^{{\rm K}\times hw}blackboard_I ∈ blackboard_R start_POSTSUPERSCRIPT roman_K × italic_h italic_w end_POSTSUPERSCRIPT summarises the usage of each code of \mathbbm{C}blackboard_C in the feature map Z𝑍Zitalic_Z. Crucially, at regular intervals, we integrate the codebook Lsubscript𝐿\mathbbm{C}_{L}blackboard_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT in the LLM to replace 𝕀Z𝕀𝑍\mathbbm{I}\cdot Zblackboard_I ⋅ italic_Z to update \mathbbm{C}blackboard_C, which can be denoted as:

=λ+(1λ)L.superscript𝜆1𝜆subscript𝐿\small\mathbbm{C}^{{}^{\prime}}=\lambda\mathbbm{C}+(1-\lambda)\mathbbm{C}_{L}.blackboard_C start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_λ blackboard_C + ( 1 - italic_λ ) blackboard_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT . (3)

Equation 3 ensures the gradual convergence of the visual tokenizer’s codebook towards Lsubscript𝐿\mathbbm{C}_{L}blackboard_C start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT during training. Our paradigm not only aids in the efficient acquisition of a unified codebook, but also ensures the training of LLM remains undisturbed by the updates in the visual tokenizer. Note that our paradigm is adaptable to various tuning approaches of LLM, including full parameter tuning, LoRA [22], or even freezing LLM. A significant distinction of our approach, compared to other MLLMs is that it does not need additional modules for visual-text alignment. We believe this could be an alternative for unified MLLMs, especially considering the recent breakthrough of visual sequential modeling [3].

3.3 In-context Image Decompression

Refer to caption
Figure 3: Illustration of the procedure for the in-context image decompression task, which accepts the compressed quantized embeddings Z^h^×w^^𝑍superscript^^𝑤\hat{Z}\in\mathbbm{R}^{\hat{h}\times\hat{w}}over^ start_ARG italic_Z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG × over^ start_ARG italic_w end_ARG end_POSTSUPERSCRIPT as inputs, and then proceeds to transform these embeddings into their flattened codes M^h^×w^×D^𝑀superscript^^𝑤𝐷\hat{M}\in\mathbbm{R}^{\hat{h}\times\hat{w}\times D}over^ start_ARG italic_M end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG × over^ start_ARG italic_w end_ARG × italic_D end_POSTSUPERSCRIPT that are subsequently used for visual decoding.

Since we adopt the stacked quantization as introduced in Section 3.1 to represent images with fewer tokens, UniCode encounters a misalignment issue when aggregating word embeddings with the LLM, which can hinder the learning of semantically meaningful tokens. To tackle this issue, we propose an image decompression pre-training task as shown in Figure 3, whose objective is to reconstruct the multi-layer code map M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG by feeding the LLM with the aggregated quantized embeddings Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG. Initially, Z^^𝑍\hat{Z}over^ start_ARG italic_Z end_ARG is processed into a flattened sequence of length h^×w^^^𝑤\hat{h}\times\hat{w}over^ start_ARG italic_h end_ARG × over^ start_ARG italic_w end_ARG. We then define the target sequence as {u1,u2,,uh^×w^×D}subscript𝑢1subscript𝑢2subscript𝑢^^𝑤𝐷\{u_{1},u_{2},...,u_{\hat{h}\times\hat{w}\times D}\}{ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUBSCRIPT over^ start_ARG italic_h end_ARG × over^ start_ARG italic_w end_ARG × italic_D end_POSTSUBSCRIPT }, which is derived from M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG, where each ulM^subscript𝑢𝑙^𝑀u_{l}\in\hat{M}italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ over^ start_ARG italic_M end_ARG. Our goal is to maximize the likelihood of generation in an auto-regressive manner:

maxθl=1h^×w^×DlogPΘ(ul|u<l;Z^),subscript𝜃superscriptsubscript𝑙1^^𝑤𝐷subscript𝑃Θconditionalsubscript𝑢𝑙subscript𝑢absent𝑙^𝑍\small\max\limits_{\theta}\sum_{l=1}^{\hat{h}\times\hat{w}\times D}\log P_{% \Theta}(u_{l}|u_{<l};\hat{Z}),roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_h end_ARG × over^ start_ARG italic_w end_ARG × italic_D end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_u start_POSTSUBSCRIPT < italic_l end_POSTSUBSCRIPT ; over^ start_ARG italic_Z end_ARG ) , (4)

where ΘΘ\Thetaroman_Θ denotes the trainable parameters of the LLM. Moreover, to enhance our model’s capability in interpreting and generating across various modalities, we adopt a strategy similar to Liu et al.[33]. To be specific, we construct instruction-following pairs of multi-turn, conversation-style data for in-context learning: {𝒳m1,𝒳z1,,𝒳mT,𝒳zT}superscriptsubscript𝒳𝑚1superscriptsubscript𝒳𝑧1superscriptsubscript𝒳𝑚𝑇superscriptsubscript𝒳𝑧𝑇\{\mathcal{X}_{m}^{1},\mathcal{X}_{z}^{1},...,\mathcal{X}_{m}^{T},\mathcal{X}_% {z}^{T}\}{ caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT }. Here, an image is segmented into T𝑇Titalic_T pieces, with 𝒳ztsuperscriptsubscript𝒳𝑧𝑡\mathcal{X}_{z}^{t}caligraphic_X start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝒳mtsuperscriptsubscript𝒳𝑚𝑡\mathcal{X}_{m}^{t}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT representing the quantized embeddings and their corresponding visual codes for each segment t𝑡titalic_t. We organize these segments sequentially and consider each 𝒳mtsuperscriptsubscript𝒳𝑚𝑡\mathcal{X}_{m}^{t}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT as the response from the LLM. For a sequence of length L𝐿Litalic_L, the probability of generating the target codes 𝒳msubscript𝒳𝑚\mathcal{X}_{m}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is computed as:

P(𝒳mt|Xzt)=i=1LPΘ(xi|𝒳m<i,𝒳z<i),𝑃conditionalsuperscriptsubscript𝒳𝑚𝑡superscriptsubscript𝑋𝑧𝑡superscriptsubscriptproduct𝑖1𝐿subscript𝑃Θconditionalsubscript𝑥𝑖superscriptsubscript𝒳𝑚absent𝑖superscriptsubscript𝒳𝑧absent𝑖\small P(\mathcal{X}_{m}^{t}|X_{z}^{t})=\prod_{i=1}^{L}P_{\Theta}(x_{i}|% \mathcal{X}_{m}^{<i},\mathcal{X}_{z}^{<i}),italic_P ( caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_X start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < italic_i end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < italic_i end_POSTSUPERSCRIPT ) , (5)

where 𝒳m<isuperscriptsubscript𝒳𝑚absent𝑖\mathcal{X}_{m}^{<i}caligraphic_X start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < italic_i end_POSTSUPERSCRIPT and 𝒳z<isuperscriptsubscript𝒳𝑧absent𝑖\mathcal{X}_{z}^{<i}caligraphic_X start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT < italic_i end_POSTSUPERSCRIPT denotes the visual codes and their compressed quantized embeddings in all segments before the current prediction token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. We incorporate this task with our multimodal instruction tuning data, which mimics the misalignment between the compressed image embeddings and the LLM, encouraging our UniCode to generate images with higher quality.

4 Training

Following Liu et al.[32], we first leverage pairs of image-text data for multimodal instruction tuning. Specifically, we organize each instructional instance into a sequence of multi-round dialogues, represented as {𝒳q1,𝒳a1,,𝒳qN,𝒳aN}superscriptsubscript𝒳𝑞1superscriptsubscript𝒳𝑎1superscriptsubscript𝒳𝑞𝑁superscriptsubscript𝒳𝑎𝑁\{\mathcal{X}_{q}^{1},\mathcal{X}_{a}^{1},...,\mathcal{X}_{q}^{N},\mathcal{X}_% {a}^{N}\}{ caligraphic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , caligraphic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT }. In this sequence, each pair {𝒳qi,𝒳ai}superscriptsubscript𝒳𝑞𝑖superscriptsubscript𝒳𝑎𝑖\{\mathcal{X}_{q}^{i},\mathcal{X}_{a}^{i}\}{ caligraphic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } signifies a question-answer round between a human and the chatbot assistant, with N𝑁Nitalic_N indicating the total number of dialogue rounds. This structured format is consistently applied throughout our instructional dataset. In addition, with the advancement that allows images to be represented as discrete language tokens, our model is capable of converting text-to-image samples (e.g., CC3M [46]) into instruction-answer pairs. Lastly, we also prepare task-specific data using the same format for in-context image decompression as described in Section 3.3. We combine all the above data for multimodal instruction tuning. To train our model efficiently, we adopt a negative log-likelihood objective over the prediction tokens:

(Θ)=j=1LlogPΘ(yj|,y^1:j1).Θsuperscriptsubscript𝑗1𝐿subscript𝑃Θconditionalsubscript𝑦𝑗subscript^𝑦:1𝑗1\small\mathcal{L}(\Theta)=-\sum_{j=1}^{L}\log P_{\Theta}(y_{j}|\mathcal{I},% \hat{y}_{1:j-1}).caligraphic_L ( roman_Θ ) = - ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | caligraphic_I , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT 1 : italic_j - 1 end_POSTSUBSCRIPT ) . (6)

Here, y𝑦yitalic_y and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG are used to represent the target and input token sequences, respectively, while ΘΘ\Thetaroman_Θ denotes the model parameters and L𝐿Litalic_L denotes the length of the target sequence. Depending on the specific instruction provided, the input visual content, represented as \mathcal{I}caligraphic_I, may correspond to an empty image. A notable aspect is the restriction of loss computation exclusively to the answer tokens 𝒳asubscript𝒳𝑎\mathcal{X}_{a}caligraphic_X start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which is designed to avoid oversimplifying the training process and to ensure that the model remains focused on generating accurate and coherent responses. We adopt a two-stage instruction-tuning process to train UniCode. It is important to note that our training process does not include the multimodal alignment stage, which is different from Liu et al.[33].

Stage I: Unified Codebook Learning. Our goal in this stage is to align the visual tokenizer with the LLM to share one codebook. We train the visual tokenizer through the image reconstruction task. There is no limitation on the type of training images, and in practice, more diverse and large-scale data can bring better performance to the visual tokenizer. To strike a balance between performance and efficiency, we only consider a limited scale in this work. For the LLM, it requires textual instruction-answer data to enhance its ability to follow instructions [6]. We alternate training process between these two modules, updating the codebook parameters using our language-driven iterative paradigm. After Stage 1, UniCode obtains a unified codebook that can simultaneously represent non-linguistic signals to achieve multimodal Input/Output (I/O).

Stage II: Multimodal Instruction Tuning. In the second stage, we keep the visual encoder and decoder frozen while exclusively fine-tuning the LLM. This stage fully utilizes the comprehensive multimodal instructional dataset, which augments the model’s effectiveness in interpreting and responding to intricate multimodal instructions. The focus of this stage is on the model’s ability to produce multimodal outputs, thereby significantly enriching its multimodal comprehension and response capabilities.

5 Experiments

To throughly evaluate the expansive multimodal capabilities of UniCode, we first conduct a series of ablation studies in Section 5.2. We then carry out comparison experiments across several key benchmarks: image generation (Section 5.3), image reconstruction (Section 5.4), and multimodal understanding (Section 5.5). Due to the space limitation, more details, visualization, and experimental results can be seen in our appendix.

5.1 Implemented Details

During Stage I, we train the visual tokenizer on the LCS-558K dataset introduced by LLaVA [33]. We directly employ a pretrained LLM [6], opting not to conduct further instruction tuning on text-only data. It’s worth mentioning that this stage is designed with flexibility to extend and allow the LLM to undergo pretraining on a extensive text corpora with full parameter tuning. In Stage II, we focus on fine-tuning the LLM using a curated combined dataset, which includes Mixed-665K [33], the text-to-image dataset CC3M  [46], and our specially tailored data for the in-context image decompression task.

Table 1: Comparisons of different paradigms for MLLMs on VQA and image generation benchmarks. Here, "tok" is used as an abbreviation for "tokenizer."
paradigm VQA Benchmarks Image Gen (FID \downarrow)
VQA22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT VizWiz SQA VQATT{}^{\rm T}start_FLOATSUPERSCRIPT roman_T end_FLOATSUPERSCRIPT POPE ImgNet LSUN-cat LSUN-church
vis enc+text tok 52.3 45.4 62.2 42.1 69.7 - - -
vis tok+text tok 49.0 44.5 56.7 37.8 65.4 9.82 10.28 10.78
unified tok 53.1 46.2 62.9 42.5 71.8 6.72 8.07 6.96

5.2 Ablation Study

Comparison of different paradigms for MLLMs. In Table 1, we compare three MLLM paradigms as depicted in Figure 1. Notably, the use of a unified codebook (Row 3) obtains stable improvements in both VQA and image generation benchmarks compared to the separate use of visual and text tokenizers (Row 2). This can be attributed to the aligned distribution of shared tokens and pretrained LLM, which also results in greater resource efficiency during both training and inference. Additionally, the unified codebook also exhibits slight improvement in VQA tasks compared with using “visual encoder+text tokenizer" (Row 1). Note that Row 1 lacks direct applicability to image generation, instead, our unified codebook enables LLMs to produce multimodal outputs.

Table 2: Comparisons of different visual encoder setups, where “cc3m imgs” and “GT imgs” refer to using additional images in CC3M [46] and evaluation groundtruth to train the visual encoder, “w/ ViT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT” denotes using the pretrained and larger ViT encoder [16] instead of training it from scratch.
Setup VQA Benchmarks
VQA22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT VizWiz SQA VQATT{}^{\rm T}start_FLOATSUPERSCRIPT roman_T end_FLOATSUPERSCRIPT POPE
UniCode 53.1 46.2 62.9 42.5 71.8
+ cc3m imgs 53.6 47.4 64.3 45.6 74.3
++ GT imgs 53.7 47.4 64.8 44.9 75.1
+++ w/ ViT*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 56.2 47.1 65.4 47.3 77.6

Comparison of different visual setups. UniCode employs a relatively lightweight visual encoder and is trained on a modest dataset of 558K images [32]. While efficient, this setup restricts its ability to extract comprehensive visual features and hinders its generalization to novel contexts. This limitation becomes more evident when compared to models like CLIP [41], which benefits from training on a vast collection of 400 million image-text pairs. In Table 2, we demonstrate that the limitation can be mitigated. By enriching the dataset with additional images from CC3M (Row 2) and evaluation groundtruth (Row 3), UniCode manifests consistant improvements across various VQA benchmarks. Additional improvement can be obtained by replacing the visual encoder trained from scratch with a pretrained and advanced version (Row 4).

Table 3: Comparison of different visual tokenizers. Better tokenizer brings better performance.
VQA22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT VizWiz SQA VQAT𝑇{}^{T}start_FLOATSUPERSCRIPT italic_T end_FLOATSUPERSCRIPT POPE
VQ-GAN [15] 49.1 42.6 60.8 41.2 65.1
RQ-VAE [27] 49.8 44.0 61.5 41.6 67.5
HQ-VAE [59] 53.1 46.2 62.9 42.5 71.8

In addition to the visual encoder, we also verify the influence of different visual tokenizers as demonstrated in Table 3. It is crucial to note again that UniCode is designed to be compatible with a wide range of visual quantization approaches. Furthermore, we observe that as we keep upgrading the visual tokenizer (from Row 1 to Row 3), the performance of UniCode is also improved. Such observations confirm that: the UniCode’s overall capabilities can be continuously enhanced by keeping enhancing the visual setup.

Table 4: Comparisons of different paradigms to learn a unified codebook.
paradigm VQA Benchmarks Image Gen (FID \downarrow)
VQA22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT VizWiz SQA VQATT{}^{\rm T}start_FLOATSUPERSCRIPT roman_T end_FLOATSUPERSCRIPT POPE ImgNet LSUN-cat LSUN-church
frozen 44.2 35.1 56.8 36.3 63.9 34.45 33.84 34.26
dual 9.3 5.2 11.2 8.5 13.2 8.87 9.76 9.54
iter 53.1 46.2 62.9 42.5 71.8 6.72 8.07 6.96

Comparison of different paradigms to learn the unified codebook. We include the relevant results in Table 4. The dual alternative training (dual) results in a performance collapse, particularly in VQA benchmarks. This issue arises from a disruption in the consistency between the LLM architecture and codebook, as discussed in Section 3.2. In addition, our paradigm (iter) brings representative visual tokens, therefore leading to notable improvements in visual generation compared with the frozen LLM codebook (frozen).

Table 5: Ablation of in-context image decompression task (“ImgDe”) on image generation tasks. We use FID as the metric.
Method ImageNet CC3M LSUN-Cat
w/o ImgDe 7.08 11.91 8.53
w/ ImgDe 6.72 11.54 8.07

Effect of in-context image decompression task. The results in Table 5 demonstrate that our pretraining task clearly enhances the visual generation quality of our model across various configurations: class-conditioned (ImageNet), text-conditioned (CC3M), and unconditioned (LSUN-Cat). We posit that this enhancement in performance can be attributed to the pretraining task’s ability to prevent premature convergence. It achieves this by escalating the complexity of the training process and enriching the diversity of the training samples.

Table 6: Ablation of different code map resolutions on VQA benchmarks.
192 256 320 384 Raw 320*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
VQA22{}^{\rm 2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT 38.2 53.1 41.3 42.6 36.4 54.5
VizWiz 42.8 46.2 43.9 41.3 39.1 47.1
SQAII{}^{\rm I}start_FLOATSUPERSCRIPT roman_I end_FLOATSUPERSCRIPT 61.1 62.9 63.7 62.0 59.2 63.8

Effect of different code map resolution. In Table 6. spanning Column 1-5, UniCode is pretrained with images of resolution 256×\times×256, and reaches optimal performance when tested at this identical resolution. Notably, there is a marked decrease in performance when test resolutions are increased beyond this point, even though these larger resolutions do not exceed the LLM’s token length capacity. We deduce that this drop in performance stems from a misalignment between training and testing conditions. Specifically, testing with resolutions significantly larger than those used in training creates a disparity in how each element of the code map represents image areas. In Column 6, we verify this hypothesis by pretraining UniCode using 320×\times×320 images (320*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT), and the results of our model are improveed to our expectation.

Table 7: Comparison of FIDs for unconditioned image generation on LSUN-{Cat, Bedroom, Church} [60].
Method FID \downarrow
Cat Bedroom Church
ImageBART 15.09 4.90 7.89
StyleGAN2 [25] 7.25 2.35 3.86
VQ-GAN 17.31 6.35 7.81
RQ-Transformer 8.64 3.04 7.45
HQ-TVAE*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 8.35 2.89 7.12
HQ-UniCode 8.07 2.65 6.96
Table 8: Comparison of FIDs and CLIP score for text-conditioned image generation on CC3M validation set.
Method Params FID \downarrow CLIP-s \uparrow
ImageBART 2.8B 22.61 0.23
LDM [44] 645M 17.01 0.24
VQ-GAN 1.5B 28.86 0.20
RQ-Transformer 654M 12.33 0.26
HQ-TVAE 579M 12.86 0.26
HQ-TVAE*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 7B 12.13 0.28
HQ-UniCode 7B 11.54 0.30

5.3 Comparison on Image Generation

Table 9: Comparisons of FIDs and ISs for class-conditioned image generation on ImageNet [10].
Method Params FID \downarrow IS \uparrow
ADM [13] 554M 10.94 101.0
ImageBART [14] 3.5B 21.19 61.6
VQ-Diffusion [19] 370M 11.89
VQ-VAE-2 [43] 13.5B \approx 31 \approx 45
VQ-GAN [15] 1.4B 15.78 74.3
RQ-Transformer [27] 3.8B 7.55 134.0
HQ-TVAE [59] 1.4B 7.15 -
HQ-TVAE*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 7B 7.04 171.4
HQ-UniCode 7B 6.72 208.9

We first assess the capability of our model in unconditioned image generation in Table 5.2, utilizing three subsets of the LSUN dataset. Initially, we combine ImageNet with the LCS-558K dataset to pretrain our visual tokenizer, then finetune the model for another one epoch on the downstream dataset. Given the extensive size of the dataset, we opt for LoRA [23] to finetune LLM to avoid overfitting. Due to the lack of training details for HQ-TVAE, we have implemented its 7B version (HQ-TVAE*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT) and compare it with our model (HQ-UniCode) for a fair comparison based on the same parameter setup. Our model performs clearly better than HQ-TVAE*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT. Furthermore, we carry out experiments on text-conditioned image generation in Table 5.2 and class-conditioned generation in Table 9, UniCode obtains similar improvements on these two benchmarks. As can be seen, the improved results demonstrate the benefit of employing a unified codebook to enhance visual generation, especially considering that our reconstruction quality is suboptimal compared with original HQ as discussed in Section 5.4. We attribute this benefit to the alignment between the unified codebook and LLM’s textual space. Lastly, we present some qualitative examples as shown in Figure 4. More visualization cases can be seen in our appendix.

5.4 Comparison on Image Reconstruction

Table 10: Comparison of reconstruction quality on ImageNet and LCS-558K datasets, according to their codebook size (KK\rm Kroman_K), resolution (Res), number of used layers and tokens.
Tokenizer Res Layers: Tokens K rFID \downarrow
Imagenet LCS-558K
VQ-GAN 64 1:64 16384 17.95 23.83
VQ-GAN 256 1:256 16384 4.9 11.26
SPAE [62] 256 5:341 16384 9.49 -
SPAE 256 6:597 16384 4.41 -
RQ-VAE 64 4:256 32000 6.82 12.09
HQ-VAE 256 2:320 32000 2.61 8.35
RQ-UniCode 64 8:512 32000 3.78 9.33
HQ-UniCode 256 2:320 32000 2.83 7.91

Table 10 validates the reconstruction quality of the visual tokenizer of our model. Such validation is crucial to ensure that the tokenizer preserves essential semantics after visual quantization. In this table, we observe that multi-layer stacking, as a form of stacked quantization structure, substantially boosts the model’s ability to efficiently represent images. However, this benefit comes with a trade-off: an increased number of layers significantly lengthens the sequence, posing a greater challenge for decoding by LLM. In comparison with HQ or RQ, the reconstruction quality of UniCode, when using the unified codebook, is nearly on par, indicating that our learning paradigm for the unified codebook does not significantly damage VAE training. In Figure 5, we present qualitative examples of image reconstruction on LCS-558K [32]. When compared to ImageNet, there is a significant decline in reconstruction quality on LCS-558K, probably attributed to LCS-558K’s more diverse scenes.

Refer to caption
Figure 4: Qualitative examples of text-conditioned image generation on CC3M.
Table 11: Comparison with MLLMs on VQA benchmarks. UniCode outperforms another multimodal generation model Emu. It achieves competitive results against other methods while requiring less data and fewer parameters for its visual tokenizer. Here, “M2T” and “M2M” refer to the model’s capability to generate either text only or multiple modalities. “Vis-P”, “PT” and “IT” represent the number of parameters in the visual encoder, the number of samples for multimodal alignment, and instruction tuning, respectively. Results on more benchmarks are provided in the appendix.
Method Type LLM Vis-P PT IT VQAv2v2{}^{\rm v2}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT VizWiz SQAII{}^{\rm I}start_FLOATSUPERSCRIPT roman_I end_FLOATSUPERSCRIPT VQATT{}^{\rm T}start_FLOATSUPERSCRIPT roman_T end_FLOATSUPERSCRIPT POPE MMB MMBCNCN{}^{\rm CN}start_FLOATSUPERSCRIPT roman_CN end_FLOATSUPERSCRIPT
BLIP-2 [29] M2T Vicuna-13B 303M 129M 0 41.0 19.6 61 42.5 85.3 - -
InstructBLIP [9] M2T Vicuna-7B 303M 129M 1.2M - 34.5 60.5 50.1 - 36 23.7
Qwen-VL [2] M2T Qwen-7B 1.8B 1.4B 50M 78.8 35.2 67.1 63.8 - 38.2
Emu [48] M2M LLaMA-13B 1B 82M 240K 52.0 34.2 - - - - -
Emu-I [48] M2M LLaMA-13B 1B 82M 240K 40.0 35.4 - - - - -
LLaVA-1.5 M2T Vicuna-7B 303M 558K 665K 79.1 47.8 68.4 58.2 86.4 64.3 58.3
UniCode M2M Vicuna-7B 104M 0 665K 53.1 46.2 62.9 42.5 71.8 33.7 25.5
UniCode+ M2M Vicuna-7B 1B 0 665K 56.2 47.1 65.4 47.3 77.6 37.2 29.1

5.5 Comparison on Multimodal Understanding

Refer to caption
Figure 5: Qualitative examples of image reconstruction generated by our proposed UniCode. Their raw images can be seen in the appendix.

We first carry out experiments on a diverse set of seven benchmarks in Table 11, including VQA-v2 (VQAv2v2{}^{\rm v2}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT[18], VizWiz [21], ScienceQA-IMG (SQAII{}^{\rm I}start_FLOATSUPERSCRIPT roman_I end_FLOATSUPERSCRIPT[37], TextVQA (VQATT{}^{\rm T}start_FLOATSUPERSCRIPT roman_T end_FLOATSUPERSCRIPT[47], POPE [30], MMB [34] and MMBCNCN{}^{\rm CN}start_FLOATSUPERSCRIPT roman_CN end_FLOATSUPERSCRIPT. Experimental results on more benchmarks are provided in our appendix. It is encouraging that our model performs considerably well even with the smallest scale of training data and fewer parameters. UniCode outperforms many recently proposed MLLMs in several benchmarks. More importantly, it obtains stable improvement on both VQAv2v2{}^{\rm v2}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT and VizWiz benchmarks when compared to another multimodal generation model Emu [48]. Through these experiments, we validate the feasibility of a unified codebook as an alternative paradigm for multimodal generative models. When compared to the current state-of-the-art model LLaVA-1.5, UniCode shows significant performance variations across different benchmarks. For example, UniCode’s performance is competitive with LLaVA-1.5 in the VQATT{}^{\rm T}start_FLOATSUPERSCRIPT roman_T end_FLOATSUPERSCRIPT and SQAII{}^{\rm I}start_FLOATSUPERSCRIPT roman_I end_FLOATSUPERSCRIPT benchmarks. However, it lags significantly (nearly 20%) behind in the POPE [30] benchmark. We speculate that this is likely due to the insufficient training data provided for the visual tokenizer, which leads to the limitation of the tokenizer in terms of generalization.

UniCode initially employs a lightweight visual tokenizer, which, due to limitations in resolution, training data, and the scale of parameters, results in suboptimal performance. To address these shortcomings, we have developed an enhanced variant, referred to as ’UniCode+’. UniCode+ incorporates a more substantial dataset for training and integrating a pretrained and larger ViT encoder as detailed in Table 2. As demonstrated in Table 11, UniCode+ significantly outperforms the original UniCode across all VQA benchmarks. This improvement underscores the potential for elevating model performance through the adoption of a more sophisticated visual encoder.

6 Conclusion

We introduce UniCode, a pioneering effort in the Multimodal Language Learning Model (MLLM) field to create a unified codebook for both visual and textual tokenization. UniCode innovates with a language-driven iterative training paradigm and an in-context image decompression task, enabling the unified codebook to facilitate multimodal instruction tuning for non-linguistic generation tasks. Our comprehensive experiments in multimodal understanding and generation, coupled with an extensive ablation study, position UniCode as a promising new approach for advancing research within the MLLM community.

References

  • [1] Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems 35, 23716–23736 (2022)
  • [2] Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 (2023)
  • [3] Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. arXiv preprint arXiv:2312.00785 (2023)
  • [4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
  • [5] Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11315–11325 (2022)
  • [6] Chiang, W.L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J.E., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023)
  • [7] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
  • [8] Cui, Y., Yang, Z., Yao, X.: Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177 (2023)
  • [9] Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500 (2023)
  • [10] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • [11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [12] Dhariwal, P., Jun, H., Payne, C., Kim, J.W., Radford, A., Sutskever, I.: Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020)
  • [13] Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems 34, 8780–8794 (2021)
  • [14] Esser, P., Rombach, R., Blattmann, A., Ommer, B.: Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems 34, 3518–3532 (2021)
  • [15] Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)
  • [16] Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X., Huang, T., Wang, X., Cao, Y.: Eva: Exploring the limits of masked visual representation learning at scale. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19358–19369 (2023)
  • [17] Feng, Y., Wang, Y., Liu, J., Zheng, S., Lu, Z.: Llama rider: Spurring large language models to explore the open world. arXiv preprint arXiv:2310.08922 (2023)
  • [18] Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017)
  • [19] Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., Guo, B.: Vector quantized diffusion model for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10696–10706 (2022)
  • [20] Gupta, A., Tian, S., Zhang, Y., Wu, J., Martín-Martín, R., Fei-Fei, L.: Maskvit: Masked visual pre-training for video prediction. arXiv preprint arXiv:2206.11894 (2022)
  • [21] Gurari, D., Li, Q., Stangl, A.J., Guo, A., Lin, C., Grauman, K., Luo, J., Bigham, J.P.: Vizwiz grand challenge: Answering visual questions from blind people. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3608–3617 (2018)
  • [22] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • [23] Hu, E., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • [24] Juang, B.H., Gray, A.: Multiple stage vector quantization for speech coding. In: ICASSP’82. IEEE International Conference on Acoustics, Speech, and Signal Processing. vol. 7, pp. 597–600. IEEE (1982)
  • [25] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8110–8119 (2020)
  • [26] Koh, J.Y., Salakhutdinov, R., Fried, D.: Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823 (2023)
  • [27] Lee, D., Kim, C., Kim, S., Cho, M., Han, W.S.: Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11523–11532 (2022)
  • [28] Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Yang, J., Li, C., Liu, Z.: Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425 (2023)
  • [29] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [30] Li, Y., Du, Y., Zhou, K., Wang, J., Zhao, W.X., Wen, J.R.: Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355 (2023)
  • [31] Liu, H., Yan, W., Abbeel, P.: Language quantized autoencoders: Towards unsupervised text-image alignment. arXiv preprint arXiv:2302.00902 (2023)
  • [32] Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744 (2023)
  • [33] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
  • [34] Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al.: Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281 (2023)
  • [35] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
  • [36] Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172 (2023)
  • [37] Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems 35, 2507–2521 (2022)
  • [38] Martinez, J., Hoos, H.H., Little, J.J.: Stacked quantizers for compositional vector compression. arXiv preprint arXiv:1411.2173 (2014)
  • [39] OpenAI: Chatgpt. https://openai.com/blog/chatgpt (2022)
  • [40] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022)
  • [41] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [42] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1), 5485–5551 (2020)
  • [43] Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32 (2019)
  • [44] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [45] Shannon, C.E., et al.: Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec 4(142-163),  1 (1959)
  • [46] Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2556–2565 (2018)
  • [47] Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., Batra, D., Parikh, D., Rohrbach, M.: Towards vqa models that can read. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8317–8326 (2019)
  • [48] Sun, Q., Yu, Q., Cui, Y., Zhang, F., Zhang, X., Wang, Y., Gao, H., Liu, J., Huang, T., Wang, X.: Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222 (2023)
  • [49] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model (2023)
  • [50] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
  • [51] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
  • [52] Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems 30 (2017)
  • [53] Wang, C., Xu, C., Wang, C., Tao, D.: Perceptual adversarial networks for image-to-image transformation. IEEE Transactions on Image Processing 27(8), 4066–4079 (2018)
  • [54] Wang, J., Yang, Z., Hu, X., Li, L., Lin, K., Gan, Z., Liu, Z., Liu, C., Wang, L.: Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100 (2022)
  • [55] Xu, Z., Shen, Y., Huang, L.: Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773 (2022)
  • [56] Yan, W., Hafner, D., James, S., Abbeel, P.: Temporally consistent transformers for video generation. arXiv preprint arXiv:2210.02396 (2022)
  • [57] Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157 (2021)
  • [58] Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al.: mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023)
  • [59] You, T., Kim, S., Kim, C., Lee, D., Han, B.: Locally hierarchical auto-regressive modeling for image generation. Advances in Neural Information Processing Systems 35, 16360–16372 (2022)
  • [60] Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
  • [61] Yu, J., Li, X., Koh, J.Y., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y., Baldridge, J., Wu, Y.: Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627 (2021)
  • [62] Yu, L., Cheng, Y., Wang, Z., Kumar, V., Macherey, W., Huang, Y., Ross, D.A., Essa, I., Bisk, Y., Yang, M.H., et al.: Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms. arXiv preprint arXiv:2306.17842 (2023)
  • [63] Yuan, L., Chen, D., Chen, Y.L., Codella, N., Dai, X., Gao, J., Hu, H., Huang, X., Li, B., Li, C., et al.: Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021)
  • [64] Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X.V., et al.: Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022)
  • [65] Zhang, Y., Zhang, R., Gu, J., Zhou, Y., Lipka, N., Yang, D., Sun, T.: Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107 (2023)
  • [66] Zhao, B., Wu, B., Huang, T.: Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087 (2023)
  • [67] Zheng, S., Liu, J., Feng, Y., Lu, Z.: Steve-eye: Equipping llm-based embodied agents with visual perception in open worlds. arXiv preprint arXiv:2310.13255 (2023)