Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Curriculum Learning with Quality-Driven Data Selection

Biao Wu*
Australian Artificial Intelligence Institute
biaowu165534@gmail.com
&Fang Meng*
University of Liverpool
meng.fang@liverpool.ac.uk,
Ling Chen
Australian Artificial Intelligence Institute
Ling.Chen@uts.edu.au
Abstract

The remarkable multimodal capabilities demonstrated by OpenAI’s GPT-4 have sparked significant interest in the development of Multimodal Large Language Models (MLLMs). Visual instruction tuning of MLLMs using machine-generated instruction-following data has been shown to improve zero-shot capabilities on many tasks, but there has been less exploration of controlling the instruction data quality. Current methodologies for data selection in MLLMs often rely on single, unreliable scores or use downstream tasks for selection, which is time-consuming and can lead to potential overfitting on the chosen evaluation datasets. To mitigate these limitations, we propose a novel data selection methodology that utilizes image-text correlation and model perplexity to evaluate and select data of varying quality. This approach leverages the distinct distribution of these two attributes, mapping data quality into a two-dimensional space that allows for the selection of data based on their location within this distribution. By utilizing this space, we can analyze the impact of task type settings, used as prompts, on data quality. Additionally, this space can be used to construct multi-stage subsets of varying quality to facilitate curriculum learning. This multiple training strategy not only utilizes a minimal amount of data but also maintains data quality diversity, significantly enhancing the model’s fine-tuning performance. Our research includes comprehensive experiments conducted on various datasets. The results emphasize substantial enhancements in five commonly assessed capabilities compared to using the complete dataset. Our codes, data, and models are publicly available at: https://anonymous.4open.science/r/EHIT-31B4

1 Introduction

Instruction-following Multimodal Large Language Models (MLLMs) excel in multi-modality tasks [25, 39, 29]. Their effectiveness largely comes from using Large Language Models (LLMs) to generate synthetic data for visual instruction tuning. SELF-FILTER [8] emphasizes that visual instruction tuning is a straightforward alignment process in MLLMs training. It only needs a small amount of tuning data to activate the pre-trained capabilities and align them with the target interaction format. To improve this process, dataset selection tasks have been proposed to choose high-quality instruction-tuning data, enhancing the performance of these models [8].

Despite the central role that datasets play in training large language models, exploring data quality for instruction tuning in vision-and-language models remains challenging. Many existing data selection methods use simple rules based on the characteristics of images and texts separately, such as the length of captions, the use of nouns, the complexity of sentences, the aspect ratio of images, and the minimum size of images [3, 31, 5, 32]. These methods also consider the reliability of the data source [11]. More advanced techniques focus on the alignment between images and texts, using models like CLIP [17] to evaluate how closely the content of an image matches the accompanying text. This is done by measuring the similarity between image and text features [30, 31, 29] or by checking if the image’s main object is mentioned in the caption [32]. However, These approaches focus on high-quality data, with limited exploration of data quality diversity.

To improve the effectiveness of multimodal instruction data selection and the utilization of data diversity, we propose a new data selection method. This method constructs a representation space through two attributes of the data, which allows clear observation of the data in distributional differences for different task type settings. Meanwhile, this effectively categorizes data quality by distribution, allowing us to select different quality subsets for training. Specifically, our new method calculates each sample’s clip score and model loss, using them as two-dimensional coordinates. By dividing key areas, we can obtain subsets of data with varying quality.

We introduce a new training strategy, curriculum learning. Unlike most instruction tuning tasks, our curriculum learning method involves multiple training stages, each using progressively higher-quality data. We begin by training on data randomly sampled from a high-quality sample space. In subsequent stages, we progressively refine the distribution area of this high-quality data, sampling from increasingly focused spaces. This iterative training process mimics the human learning approach, enabling the model to use data of varying quality to maintain diversity. Through extensive experiments on LLaVA-v1.5, we demonstrate that our methods can surpass models trained on the full instruction data using only about 5% of the raw instruction tuning dataset samples. This improvement is consistent across multiple evaluation datasets and benchmarks.

We summarize the main contributions of this paper:

  1. 1.

    We propose a new method for selecting high-quality data by observing its quality distribution and creating a subset. We found a correlation between data quality and its distribution. Additionally, the task type for text generation impacts the quality of visual instruction tuning data.

  2. 2.

    We propose a curriculum learning strategy and demonstrate that the model’s performance can be further improved by using a multi-stage approach to adjust the quality of the training data, requiring only a small amount of data.

  3. 3.

    We evaluated our method on multiple tasks and achieved better performance using only 5% compared to having used the full data.

2 Related Work

Multimodal Instruction Tuning

Multimodal Instruction Tuning is pivotal in advancing the capabilities of models like LLaVA [26], MiniGPT-4 [41], and InstructBLIP [9], which thrive on intricately paired image-text data. This technique refines the models’ performance beyond what is achievable with conventional VQA datasets [12, 15], which often provide limited, short-answer data that can impair model performance. Recognizing this, the MiniGPT-4 [41] team curated a dataset of 3,500 image-text pairs, refined through interactions with ChatGPT, to enhance the models’ ability to generate nuanced responses. Similarly, LLaVA [26] set a benchmark by creating LLaVA-Instruct-150K, a dataset generated by prompting GPT-4 with rich annotations from the COCO dataset [24], including image captions and object details, to produce detailed questions and answers. Expanding the scope, LLaVAR [38] addressed the challenges of interpreting text-rich images by assembling over 422,000 pieces of instruction-following data through OCR technology, supplemented by an additional 16,000 high-quality entries processed by GPT-4. Furthermore, InstructBLIP [9] incorporated a diverse array of 26 public datasets, including LLaVA-Instruct-150K, to create a more comprehensive visual instruction tuning dataset. This effort, however, highlighted the prevalence of brief, perceptually focused content in existing datasets. Meanwhile, M3IT [20] transformed 40 distinct datasets into a unified vision-to-text framework, utilizing ChatGPT to rephrase and enrich the context of the responses, broadening the scope of training data suitable for deep learning models. This collective endeavor to enrich multimodal datasets [18, 22] illustrates a strategic pivot towards generating a larger, more varied corpus of visual instruction data. These datasets now cover an extensive range of tasks from basic visual recognition to complex reasoning and planning, setting a new standard for training sophisticated multimodal systems.

Data Selection

Data selection is a developing field in the instruction-tuning of large language models, focused on identifying high-quality data and removing harmful information that could lead to errors  [7, 4]. In this area,  [7] introduced Alpagasus, a method that automates data selection by assessing instruction quality via queries to ChatGPT, thereby improving training efficiency. [21] suggested using the IFD score as an indicator of data difficulty, while  [4] developed Instruction Mining, which evaluates sample quality through a linear combination of various indicators. Concurrently, [23] proposed assessing data by the one-shot learning performance on specific tasks. Finally,  [36] in their study on InstructionGPT-4, apply a combination of multimodal scores and a regression model trained on predefined tasks for data selection, although their application is confined to MiniGPT-4 [41], which includes just 3,400 instructions.

Curriculum Learning

Curriculum Learning has emerged as an effective strategy in machine learning, allowing models to start with simpler tasks and gradually progress to more complex ones. This method, inspired by the way humans learn, has been applied across various domains such as natural language processing and computer vision  [2, 35]. In this context,  [2] pioneered the concept by showing how a progressive learning schedule can improve performance in neural networks. More recently,  [35] proposed a dynamic curriculum learning approach that adjusts the difficulty of the data based on the model’s performance during training. Additionally,  [28] introduced an automatic curriculum learning framework that utilizes reinforcement learning to dynamically select training samples, optimizing the learning process. Lastly,  [34] explored self-paced learning, a variation where the model self-assesses and chooses the appropriate learning pace, thereby aligning with curriculum learning principles to improve overall training efficacy.

3 Methods

3.1 Data selection

We define our data selection task in the context of instruction fine-tuning. Given an instruction tuning dataset 𝒟={𝐱j}j=1N𝒟superscriptsubscriptsubscript𝐱𝑗𝑗1𝑁\mathcal{D}=\{\mathbf{x}_{j}\}_{j=1}^{N}caligraphic_D = { bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where each 𝐱j=(𝐱ji,𝐱jt)subscript𝐱𝑗superscriptsubscript𝐱𝑗𝑖superscriptsubscript𝐱𝑗𝑡\mathbf{x}_{j}=(\mathbf{x}_{j}^{i},\mathbf{x}_{j}^{t})bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) represents a pair of input image and text, our objective is to select a subset of size m𝑚mitalic_m from 𝒟𝒟\mathcal{D}caligraphic_D. The goal is to prune 𝒟𝒟\mathcal{D}caligraphic_D such that the resulting subset, 𝒟fm𝒟superscriptsubscript𝒟𝑓𝑚𝒟\mathcal{D}_{f}^{m}\subset\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊂ caligraphic_D, enables the pre-trained vision-language model f𝑓fitalic_f to achieve optimal performance on downstream tasks {Ti}i=1tsuperscriptsubscriptsubscript𝑇𝑖𝑖1𝑡\{T_{i}\}_{i=1}^{t}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Here, |𝒟fm|=msuperscriptsubscript𝒟𝑓𝑚𝑚|\mathcal{D}_{f}^{m}|=m| caligraphic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | = italic_m.

3.2 Data Curriculum

We propose to select a subset of the dataset based on 1) clip score for image-text feature similarity and 2) model loss for data perplexity. We use these two data attributes to create a representation space for all data instances. By employing this method, we select the data using the region of the representation space that exhibits higher or lower values for both attributes. The vision-language model f𝑓fitalic_f is pre-trained. We denote its total loss as l𝑙litalic_l and the loss on visual instruction data xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as lisubscript𝑙𝑖l_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Additionally, We denote its clip score as s𝑠sitalic_s and the correlation on visual instruction data xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. By maximizing the l𝑙litalic_l and s𝑠sitalic_s, we can obtain a relatively high-quality subset of data 𝒟fmsuperscriptsubscript𝒟𝑓𝑚\mathcal{D}_{f}^{m}caligraphic_D start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.

Intermediate Data Similarity

To evaluate the similarity between an input image xjisuperscriptsubscript𝑥𝑗𝑖x_{j}^{i}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and text xjtsuperscriptsubscript𝑥𝑗𝑡x_{j}^{t}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, we use the CLIP model to extract features from both. Specifically, we apply the image encoder of the CLIP, defined as I()𝐼I(\cdot)italic_I ( ⋅ ), to obtain the feature vector from the image, and the text encoder of the CLIP, defined as T()𝑇T(\cdot)italic_T ( ⋅ ), to derive the feature vector from the text. We then compute the dot product of both features to generate a clip score, which we define as sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

sj=I(xji)T(xjt)subscript𝑠𝑗𝐼superscriptsubscript𝑥𝑗𝑖𝑇superscriptsubscript𝑥𝑗𝑡s_{j}=I(x_{j}^{i})\cdot T(x_{j}^{t})italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_I ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ⋅ italic_T ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

We partition the data subset by identifying the upper bounds Smaxsubscript𝑆𝑚𝑎𝑥S_{max}italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and lower bounds Sminsubscript𝑆𝑚𝑖𝑛S_{min}italic_S start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT of the sjsubscript𝑠𝑗s_{j}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Using these bounds, we select the sample data djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to obtain the corresponding subset, which we refer to as the Data of Intermediate Similarity (DIS):

DIS={djSminsjSmax}𝐷𝐼𝑆conditional-setsubscript𝑑𝑗subscript𝑆𝑚𝑖𝑛subscript𝑠𝑗subscript𝑆𝑚𝑎𝑥DIS=\{d_{j}\mid S_{min}\leq s_{j}\leq S_{max}\}italic_D italic_I italic_S = { italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_S start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }

Clip score reflects how well the image features correspond to the text features, allowing us to identify and select high-quality data where the image and text are closely related.

Intermediate Data Loss

The loss produced by the model, which is also a measure of perplexity, reflects the difference between the target text and the model’s internal preferences. A higher loss makes the learning process more challenging for the model. Following a standard LLaVA architecture, the image encoder provides latent encoded features Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Concurrently, the text decoder is tasked with maximizing the conditional likelihood of the paired text Yjsubscript𝑌𝑗Y_{j}italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT under the forward autoregressive factorization:

lj=t=1TlogPθ(Yj,tYj,<t,Xj)subscript𝑙𝑗superscriptsubscript𝑡1𝑇subscript𝑃𝜃conditionalsubscript𝑌𝑗𝑡subscript𝑌𝑗absent𝑡subscript𝑋𝑗l_{j}=-\sum_{t=1}^{T}\log P_{\theta}(Y_{j,t}\mid Y_{j,<t},X_{j})italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT ∣ italic_Y start_POSTSUBSCRIPT italic_j , < italic_t end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )

We partition the data subset by detecting the upper bounds Lmaxsubscript𝐿𝑚𝑎𝑥L_{max}italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and lower bounds Lminsubscript𝐿𝑚𝑖𝑛L_{min}italic_L start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT of the loss. Using these bounds, we select the sample data djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to obtain the corresponding subset, which we refer to as the Data of Intermediate Loss (DIL):

DIL={djLminljLmax}𝐷𝐼𝐿conditional-setsubscript𝑑𝑗subscript𝐿𝑚𝑖𝑛subscript𝑙𝑗subscript𝐿𝑚𝑎𝑥DIL=\{d_{j}\mid L_{min}\leq l_{j}\leq L_{max}\}italic_D italic_I italic_L = { italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_L start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }

Intermediate Data Quantity

When each piece of data has clip score and loss, we can construct a two-dimensional representation space based on these two attributes. Therefore, we select the sample data djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and set both related upper bounds and lower bounds to select the high-quality subset, which we refer to as the Data of Intermediate Quantity (DIQ) :

DIQ={djLminljLmax,SminsjSmax}𝐷𝐼𝑄conditional-setsubscript𝑑𝑗formulae-sequencesubscript𝐿𝑚𝑖𝑛subscript𝑙𝑗subscript𝐿𝑚𝑎𝑥subscript𝑆𝑚𝑖𝑛subscript𝑠𝑗subscript𝑆𝑚𝑎𝑥DIQ=\{d_{j}\mid L_{min}\leq l_{j}\leq L_{max}\ ,S_{min}\leq s_{j}\leq S_{max}\}italic_D italic_I italic_Q = { italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_L start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≤ italic_S start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT }

We propose a data curriculum framework that starts training with simpler tasks and progressively advances to more complex ones. Based on our DIQ𝐷𝐼𝑄DIQitalic_D italic_I italic_Q, we divide the region into unified blocks and use ΔLΔ𝐿\Delta Lroman_Δ italic_L and ΔSΔ𝑆\Delta Sroman_Δ italic_S , corresponding to model loss and clip score respectively. By employing data selection methods, we can control the quality of a subset of data by gradually increasing clip score thresholds and loss thresholds. Consequently, we divide the learning process into several phases k𝑘kitalic_k, and we select the sample data in each phase with different quantities :

Ck={djLplj,Spsj}subscript𝐶𝑘conditional-setsubscript𝑑𝑗formulae-sequencesubscript𝐿𝑝subscript𝑙𝑗subscript𝑆𝑝subscript𝑠𝑗C_{k}=\{d_{j}\mid L_{p}\leq l_{j},S_{p}\leq s_{j}\}italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }

where

Lp=Lmin+kΔLandSp=Smin+kΔS.subscript𝐿𝑝subscript𝐿𝑚𝑖𝑛𝑘Δ𝐿andsubscript𝑆𝑝subscript𝑆𝑚𝑖𝑛𝑘Δ𝑆L_{p}=L_{min}+k\Delta L~{}\text{and}~{}S_{p}=S_{min}+k\Delta S.italic_L start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT + italic_k roman_Δ italic_L and italic_S start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT + italic_k roman_Δ italic_S .

As k increases, the learning process can be divided into multiple phases: Initialization, Intermediate, and Advanced.

  • Initialization Phase (k=0): The model starts with a distribution of high-quality data, focusing on underlying patterns without being overwhelmed by complexity.

  • Intermediate Phase (k=1): Data quality is improved by increasing the thresholds for clip score and loss, narrowing the candidate region of high-quality data.

  • Advanced Phase (k=2): The model is exposed to the most challenging data, characterized by higher clip score and model loss, testing its ability to handle complex and less consistent relationships.

This phased approach ensures progressive learning, better generalization, reduced overfitting, and enhanced robustness. By systematically organizing and presenting data based on quality metrics, the data curriculum ensures the model develops a solid foundation before tackling more complex data, leading to improved performance on multimodal tasks.

4 Experiments

In this section, we first detail our settings and the chosen base models. Then we introduce the different train scenarios and evaluation benchmarks used in our experiments and the baseline methods. We show that our proposed method achieves better performance on multiple tasks using less data.

4.1 Experimental Setup

VL instruction data.

We use the core set, SVIT-core-157K, as our raw data, totaling 157,712 samples. SVIT [39] extends visual instruction tuning data to present a large-scale dataset containing 4.2 million command adjustment data. These data include dialog Q&A pairs, complex inference Q&A pairs, referring Q&A pairs, and detailed descriptions. More details can be found in the Appendix.

Base models.

We use the LLaVA-v1.5-7B [25] model architecture and its pre-training weights as our base models. The entire LLaVA training process is divided into two stages. For the first stage of pretraining, LLaVA-1.5-558k [26] selected from CC3M data are used, which have been converted into instruction-following data by GPT-4. For the second stage of visual instruction tuning, LLaVA-1.5-mix-665k [25] has been used.

Train setting

We consider LoRA finetuning for the new instruction data. We define the state where LLaVA-1.5-mix-665k [25] has been used for instruction tuning as scenario 1, and the state where this data has not yet been used for instruction tuning as scenario 2. And, to verify the effectiveness of the data selection strategy for LLaVA model training, we mainly consider these two scenarios.

Benchmarks

We assess our methods using a mix of academic-task-oriented benchmarks and new benchmarks tailored for instruction-following LMMs, covering a total of 5 benchmarks. For academic-focused benchmarks, VQA-v2 [12] and GQA [15] test the model’s visual perception abilities with open-ended questions. VizWiz [13] includes 8,000 images to evaluate the model’s zero-shot generalization on visual queries from visually impaired individuals. In line with InstructBLIP [10], we use the image subset of ScienceQA [27] with multiple-choice questions to gauge zero-shot performance in scientific question answering. TextVQA [33] involves text-rich visual question answering.

4.2 Scenario 1: Training from LLaVA

We use the LLaVA-v1.5-7B [25] architecture with model weights fully fine-tuned using LLaVA-1.5-mix-665k data. Subsequently, we fine-tune this model with LoRA [14] during the follow-up experiments. In training, we keep the visual encoder, projector, and LLM weights frozen, and maximize the likelihood of with trainable parameters of LoRA only. We keep the rest of the training protocol the same to allow for a fair comparison. Scenario 1, which only includes LoRA tuning, takes approximately 16 hours on an NVIDIA Tesla A100 GPU with 40GB of memory, using DeepSpeed ZeRO Stage 3. We use the SVIT-core-157K [39] dataset for continuous fine-tuning to establish a baseline. And the same method is applied to fine-tune our data.

Method LLM Res. PT IT VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT
BLIP-2[19] Vicuna-13B 224 129M - 41.0 41 19.6 61 42.5
InstructBLIP[9] Vicuna-7B 224 129M 1.2M 49.2 34.5 60.5 50.1
InstructBLIP[9] Vicuna-13B 224 129M 1.2M 49.5 33.4 63.1 50.7
Shikra[6] Vicuna-13B 224 600K 5.5M 77.4
IDEFICS-9B [16] LLaMA-7B 224 353M 1M 50.9 38.4 35.5 25.9
IDEFICS-80B[16] LLaMA-65B 224 353M 1M 60.0 45.2 36.0 30.9
Qwen-VL[1] Qwen-7B 448 1.4B 50M 78.8 59.3 35.2 67.1 63.8
Qwen-VL-Chat[1] Qwen-7B 448 1.4B 50M 78.2 57.5 38.9 68.2 61.5
LLAVA-V1.5[25] Vicuna-7B 336 558K 665K 78.5 62.0 50.0 66.8 58.2
+ SVIT-Core-157K[39] Vicuna-7B 336 558K +157K 75.9 57.1 49.1 69.0 56.3
+ Ours Vicuna-7B 336 558K +7K 77.9 61.8 51.1 69.5 57.3
Table 1: Comparison with SoTA methods on 5 benchmarks. We achieves better performance on all benchmarks than SVIT-Core-157K. Res, PT, and IT indicate input image resolution, and the number of samples in the pretraining and instruction tuning stage, respectively. Benchmark names are abbreviated due to space limits. VQA-v2 [12], GQA [15], VisWiz [13], ScienceQA-IMG [27], TextVQA [33]. More details can be found in the Evaluation Metrics section of the Appendix.

We report our main results in Table 1. Our method, using only 7000 samples of SVIT-core-157K, achieved higher performance across all benchmarks compared to the full data experiment setup. Furthermore, it surpassed the base model on SQA [27] and VisWiz [13], reaching state-of-the-art (SOTA) performance. In the efficient LoRA training setup, our data exceeded SVIT-core-157K[39] by 4.7 points in GQA [15], 2.0 points in VQAV2 [12], 1.0 point in TextVQA [33], 2.0 points in VisWiz [13], and 0.5 points in SQA [27]. The improvements verify the better training effects of our data since less data amount and same model are used.

Effectiveness of DIQ

In Table 2, we use the top-right corner in the left panel of Figure 7 (shown in the appendix) as the top 5% of the DIQ and conducted a comparison experiment, we found that using the 5% selected by DIQ resulted in better performance compared to using the top 5% of DIS and DIL separately.

Strategy Scenario 1
SQA TextVQA GQA
DIS 57.06 56.13 61.06
DIL 68.82 56.30 60.87
DIQ 69.56 56.84 61.16
Result with Data Curriculum
Ours 69.51 57.25 61.80
Table 2: Results across different methods.

We realized that this improvement is due to the subset from DIQ selecting data evenly from the entire region, whereas DIS and DIL focus on regions with high levels of clip score or loss. Based on these insights, we introduced curriculum learning, utilizing multi-stage training that progresses from low-quality to high-quality data. This approach, as demonstrated in the ablation experiment in Table 2, highlights the importance of increasing the diversity of data quality for improving model performance. By employing this method, we found that using curriculum learning with the DIQ method can further enhance model performance.

To further understand the effectiveness of curriculum learning, we observe that it starts with simple examples, which have lower noise and smaller loss. This provides a smoother loss landscape, reducing gradient oscillations and instability for a more stable initial training process. As the model progresses to higher-quality data, it benefits from established initial parameters and a clear learning direction, facilitating easier optimization. By gradually increasing data quality, curriculum learning helps the model adapt and optimize progressively, leading to improved performance as shown in our results.

4.3 Scenario 2: Training from Vicuna + projection.

To check the quality of our selected data and ensure consistency in our experiments, we use the LLaVA-v1.5-7B [25] model architecture and its pre-training weights, only a projector. We utilize this projector, the pre-trained CLIP visual encoder ViT-L/14, and Vicuna-7b to establish the weights of LLAVA that only the alignment task has been completed. This setup helps us observe how different selected datasets activate the model’s ability to engage in dialogue while avoiding interference from other instruction-tuning data on this task. The rest of the model training protocol is kept unchanged for fair comparison. We keep the training setting as same as scenario 1 and only update the LoRA weights of the LLM.

Refer to caption
Figure 1: Comparison of ablation results with data from different DIQ regions in scenario 2.

Effectiveness of DIQ

To verify the effectiveness of DIQ, we analyze the clip score and loss for all the data. In Figure 7, we divided the data into 9 regions and selected 7,000 samples from each region as corresponding data subsets using the DIQ method. The axes in Figure 1 are the loss value and the clip score value. It shows the position of the columns in each region shows the range of the corresponding dual attributes. The color of the columns also reflects the size of the corresponding value, the higher the performance the darker the color of the columns, and vice versa. Combining the performance of SQA, TextVQA, and GQA, we find that data with a higher clip score and loss show better performance on the downstream task, implying that the top-right DIQ subset contains higher quality data.

4.4 Exploring Different Data Selection

Effectiveness of DIS and DIL.

To verify the effectiveness of DIS and DIL separately, we first verified the data selection results of individual methods, in scenario 1 of the LLaVA training program. As shown in Figure 2, both DIS and DIL, using only the top 5% (around 7000 samples) of the selected data, significantly outperform the results using all the data. The model performance gradually decreases as the amount of data increases.

Refer to caption
Figure 2: Comparison of ablation experiment results in scenario 1 with different data select ratios.

As shown in Figure 3, similar to scenario 1, both DIS and DIL, using only the top 5% of the selected data, significantly outperform the results using all the data. This result is consistent with the hypothesis presented in LIMA’s [40] study, which demonstrates that alignment also be a straightforward process in MLLM training. In this process, the model learns the style or format of interacting with users, effectively utilizing the knowledge and capabilities it acquired during pretraining. Meanwhile, a high-quality subset of data is sufficiently informative to help the model adapt well to new user interaction styles in scenario 2, compared to the full data.

Refer to caption
Figure 3: Comparison of ablation experiment results in scenario 2 with different data select ratios.

Effectiveness of Mixed Methods.

Table 3 first compares the performance differences of the top 5% of DIS, DIL, and DIQ in scenarios 1 and 2. We notice that using the 5% selected by DIS and DIL separately outperformed the top 5% of DIQ in scenario 2. We realized that this improvement is due to the DIS and DIL subsets focusing on regions with a higher clip score or loss, where data with both high attributes predominate, resulting in an overall higher data quality compared to DIQ. Based on these insights, we explore the mixed method, We combined the top high-quality subsets obtained from different methods to create a larger, high-quality subset.

Therefore, we observed that for Scenario 1, the model performs best with only 5% of data based on DIQ. Comparing different data subsets from various regions, as well as combining data from different regions, did not improve the model’s performance. That indicates that scenario 1 mainly benefits from smaller, high-quality data. For Scenario 2, the model performs best with 15% of data based on the mix of different region data. In our comparison, we found that when increasing the data size from 5% to 10% with a single strategy, the performance of both DIS and DIL decreased due to a relative drop in data quality. However, when multiple top 5% data subsets were combined, the model’s performance improved, even at the same 10% scale. This demonstrates that in scenario 2, the model relies more on the quantity of high-quality data. Consequently, when we combined the top 5% subsets from all three regions, the model’s performance improved further, confirming this observation.

Effectiveness of Curriculum Learning.

In Table 3, first, we randomly sampled 2,400 examples from all the regions from DIQ for the first training phase, corresponding to C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . In the second phase, we narrowed the range of high-quality data and randomly sampled 2,400 examples to further fine-tune the model trained in the first phase, corresponding to C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We repeated this process for the third phase and got C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. In total, 7000 samples of data were used, which is consistent with the data size of the DIQ approach. After introducing curriculum learning in scenario 1, the model’s performance improved further. However, even with curriculum learning, the model’s performance declined as the data size increased. This indicates that in scenario 1, in addition to enhancing data quality diversity, it is also crucial to maintain a small scale. For scenario 2, the model’s performance further improved when using 15% of the data. This proves that both curriculum learning and the quantity of high-quality data are important to scenario 2.

Strategy Data Size Scenario 1 Scenario 2
SQA TextVQA GQA AVG SQA TextVQA GQA AVG
Result with scaling the high-quality data with different subset
5% in DIS 7 k 57.06 56.13 61.06 58.08 61.92 11.14 4.96 25.34
5% in DIL 7 k 68.82 56.30 60.87 62.00 59.79 15.68 6.18 27.22
10% in DIS 14 k 69.31 56.30 59.93 61.18 61.18 8.88 1.03 23.03
10% in DIL 14 k 69.46 55.98 60.42 61.95 61.63 13.07 4.76 26.49
5% in DIQ 7 k 69.56 56.84 61.16 62.52 61.53 11.45 3.89 25.62
5% DIS + 5% DIL 14k 69.76 56.41 60.04 62.07 61.78 13.68 5.86 27.11
5% DIS + 5% DIQ 14k 69.96 56.32 60.72 62.33 61.87 13.19 7.16 27.41
5% DIL + 5% DIQ 14k 69.06 56.05 60.75 61.95 60.98 14.71 5.95 27.21
5% DIS + 5% DIL + 5% DIQ 21k 70.25 56.32 60.72 62.43 61.92 13.17 7.20 27.43
Result with Curriculum Learning with different size
C1 + C2 + C3 (Ours) 7 k 69.51 57.25 61.80 62.85 59.93 9.18 1.78 23.63
C1 + C2 + C3 14 k 69.56 56.99 61.73 62.76 61.53 12.93 3.70 26.05
C1 + C2 + C3 21 k 69.51 56.54 61.38 62.48 61.97 15.49 6.39 27.95
Table 3: Comparison of ablation results with different data selection strategies and curriculum sizes. The underlined data is the maximum value considering only scaled high-quality data, and the bolded data is the global maximum value. All the "x%" refers to selecting top x% examples from the corresponding data.

4.5 What Makes Selected Data Quality Different?

Visual instruction data generated via unimodal LLM exhibit different properties on the epistemic evidence space of clip score and loss by forming image-text pairs with their corresponding images. We try to understand what causes this problem with visual instruction data and how to control the distribution and quality of visual instruction data. The existing methods for generating data for visual instruction-tuning primarily use single-mode LLMs to adjust the text format in the data. This approach can lead to inconsistencies between the images and the corresponding text content, causing mismatches or failing to accurately capture the main elements of the images. Additionally, the design of prompts often influences the visual instruction data, altering the generation process to suit different tasks. This variation in text generation methods for different tasks exacerbates the issue of data quality divergence. To better compare the distribution of data for different tasks in space, we visualize the space.

Refer to caption

Figure 4: Data distribution comparison for Referring QA and Detail Description tasks.

Refer to caption

Figure 5: Data distribution comparison for Complex Reasoning and Detail Description tasks.

As shown in Figure 5, there are significant differences in the distributions between Detail Description data and Referring QA data. The Detail Description data are widely distributed in the upper right corner of the space, while the Referring QA data are widely distributed in the lower left corner of the space. This indicates that the task type used as a prompt can significantly influence the attributes of the data and lead to differences in quality.

Refer to caption
Figure 6: Heatmap visualization of statistics on token length.

Meanwhile, as shown in Figure 5, we found that the data quality distribution of Detail Description and Complex Reasoning tasks also differs significantly. In particular, data quality distribution for Complex Reasoning tasks, which are constructed through multi-turn dialogues, is spread over a wider area, highlighting the challenge of maintaining data quality in dialogue-based task types.

Additionally, various factors can influence the data’s attributes and quality besides the task type. We analyzed the token lengths of data in all regions and visualized the distribution using a heatmap. As shown in Figure 6, brighter areas indicate longer text lengths, primarily on the right side, suggesting a consistent correlation between token length and data loss. This indicates that visual instruction data created using LLMs from the same representation space have a stable logical hierarchy and rich information. Therefore, longer visual instruction data can effectively improve data quality by providing more detailed and coherent information.

5 Conclusion

In this paper, we introduce a curriculum learning method that imitates the human learning process. By gradually improving the quality of training data from easy to difficult stages, our method enhances performance while requiring less training data. In addition, we demonstrate the effectiveness of utilizing a dual-attribute representation space in controlling the quality of multimodal training data that divides data subsets based on clip score and model loss. We find that not only do data with higher dual-attribute values lead to better performance, but we also found a correlation between the task type used during visual instruction data creation and the distribution of positions in the dual-attribute space. At the same time, we found that different selection strategies for the subset of high-quality data are needed at different stages of training. When MLLMs have completed instruction fine-tuning tasks, incorporating curriculum learning can significantly improve fine-tuning performance.

Limitation

Regarding the limitations, the scope of our experiments was constrained by computational costs, limiting our focus to a single model. We conducted all related experiments on LLaVA-v1.5[25]. Nevertheless, this approach allowed us to achieve significant results within our computational constraints. To address these limitations, future research could explore more powerful and advanced models as computational resources allow.

References

  • [1] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • [2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
  • [3] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  • [4] Yihan Cao, Yanbin Kang, and Lichao Sun. Instruction mining: High-quality instruction data selection for large language models. CoRR, abs/2307.06290, 2023.
  • [5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  • [6] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  • [7] Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. Alpagasus: Training A better alpaca with fewer data. CoRR, abs/2307.08701, 2023.
  • [8] Ruibo Chen, Yihan Wu, Lichang Chen, Guodong Liu, Qi He, Tianyi Xiong, Chenxi Liu, Junfeng Guo, and Heng Huang. Your vision-language model itself is a strong filter: Towards high-quality instruction tuning with data selection. arXiv preprint arXiv:2402.12501, 2024.
  • [9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [10] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  • [11] Karan Desai, Gaurav Kaul, Zubin Aysola, and Justin Johnson. Redcaps: Web-curated image-text data created by the people, for the people. arXiv preprint arXiv:2111.11431, 2021.
  • [12] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • [13] Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  • [14] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • [15] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • [16] IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  • [17] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, July 2021.
  • [18] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  • [19] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR, 23–29 Jul 2023.
  • [20] Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu Sun, Lingpeng Kong, and Qi Liu. M3it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
  • [21] Ming Li, Yong Zhang, Zhitao Li, Jiuhai Chen, Lichang Chen, Ning Cheng, Jianzong Wang, Tianyi Zhou, and Jing Xiao. From quantity to quality: Boosting LLM performance with self-guided data selection for instruction tuning. CoRR, abs/2308.12032, 2023.
  • [22] Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Fu, Guosheng Lin, Chunhua Shen, Ling Chen, and Yunchao Wei. Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253, 2023.
  • [23] Yunshui Li, Binyuan Hui, Xiaobo Xia, Jiaxi Yang, Min Yang, Lei Zhang, Shuzheng Si, Junhao Liu, Tongliang Liu, Fei Huang, and Yongbin Li. One shot learning as instruction data prospector for large language models. CoRR, abs/2312.10302, 2023.
  • [24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [25] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • [26] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [27] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  • [28] Hailan Ma, Daoyi Dong, Steven X Ding, and Chunlin Chen. Curriculum-based deep reinforcement learning for quantum control. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [29] Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, and Ludwig Schmidt. Improving multimodal datasets with image captioning. Advances in Neural Information Processing Systems, 36, 2024.
  • [30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR, 2021.
  • [31] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  • [32] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  • [33] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  • [34] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum self-paced learning for cross-domain object detection. Computer Vision and Image Understanding, 204:103166, 2021.
  • [35] Petru Soviany, Radu Tudor Ionescu, Paolo Rota, and Nicu Sebe. Curriculum learning: A survey. International Journal of Computer Vision, 130(6):1526–1565, 2022.
  • [36] Lai Wei, Zihao Jiang, Weiran Huang, and Lichao Sun. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4. CoRR, abs/2308.12067, 2023.
  • [37] Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023.
  • [38] Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  • [39] Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  • [40] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.
  • [41] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

Evaluation Metrics

To better validate the activation of the model’s conversational capabilities, the experiments were centered around the evaluation of open questions. The evaluation metrics mainly use accuracy for VQA tasks. We compare a model optimized through continuous instruction tuning with TA-selected-15k against leading MLLMs.: BLIP-2 [19], InstructBLIP [9], Shikra [6], IDEFICS [16], Qwen-VL(-Chat) [1], mPLUG-Owl2 [37] and LLaVA-v1.5 [25]. We evaluate these models on popular benchmarks: VQA-v2 [12], GQA [15], VisWiz [13], ScienceQA-IMG [27], TextVQA [33].

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: see Section 1.

  5. Guidelines:

    • The answer NA means that the abstract and introduction do not include the claims made in the paper.

    • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.

    • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.

    • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

  6. 2.

    Limitations

  7. Question: Does the paper discuss the limitations of the work performed by the authors?

  8. Answer: [Yes]

  9. Justification: see the conclusion.

  10. Guidelines:

    • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.

    • The authors are encouraged to create a separate "Limitations" section in their paper.

    • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.

    • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.

    • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.

    • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.

    • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.

    • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

  11. 3.

    Theory Assumptions and Proofs

  12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  13. Answer: [Yes]

  14. Justification: see Section 3.

  15. Guidelines:

    • The answer NA means that the paper does not include theoretical results.

    • All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced.

    • All assumptions should be clearly stated or referenced in the statement of any theorems.

    • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.

    • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.

    • Theorems and Lemmas that the proof relies upon should be properly referenced.

  16. 4.

    Experimental Result Reproducibility

  17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  18. Answer: [Yes]

  19. Justification: see Section 4.

  20. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.

    • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.

    • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.

    • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example

      1. (a)

        If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.

      2. (b)

        If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.

      3. (c)

        If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).

      4. (d)

        We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

  21. 5.

    Open access to data and code

  22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  23. Answer: [Yes]

  24. Justification: see abstract.

  25. Guidelines:

    • The answer NA means that paper does not include experiments requiring code.

    • Please see the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).

    • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https://nips.cc/public/guides/CodeSubmissionPolicy) for more details.

    • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.

    • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.

    • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

    • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

  26. 6.

    Experimental Setting/Details

  27. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  28. Answer: [Yes]

  29. Justification: see Section 4.

  30. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.

    • The full details can be provided either with the code, in appendix, or as supplemental material.

  31. 7.

    Experiment Statistical Significance

  32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  33. Answer: [No]

  34. Justification: error bars are not reported because it would be too computationally expensive

  35. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.

    • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

    • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)

    • The assumptions made should be given (e.g., Normally distributed errors).

    • It should be clear whether the error bar is the standard deviation or the standard error of the mean.

    • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified.

    • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

    • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

  36. 8.

    Experiments Compute Resources

  37. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  38. Answer: [Yes]

  39. Justification: see Section 4.

  40. Guidelines:

    • The answer NA means that the paper does not include experiments.

    • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.

    • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.

    • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).

  41. 9.

    Code Of Ethics

  42. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  43. Answer: [Yes] .

  44. Justification: see Section 4.

  45. Guidelines:

    • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.

    • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.

    • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).

  46. 10.

    Broader Impacts

  47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  48. Answer: [N/A]

  49. Justification: there is no societal impact of the work performed.

  50. Guidelines:

    • The answer NA means that there is no societal impact of the work performed.

    • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

    • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.

    • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

    • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.

    • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

  51. 11.

    Safeguards

  52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  53. Answer: [Yes] .

  54. Justification: see Section 4.

  55. Guidelines:

    • The answer NA means that the paper poses no such risks.

    • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.

    • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.

    • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

  56. 12.

    Licenses for existing assets

  57. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  58. Answer: [Yes] .

  59. Justification: see Section 4.

  60. Guidelines:

    • The answer NA means that the paper does not use existing assets.

    • The authors should cite the original paper that produced the code package or dataset.

    • The authors should state which version of the asset is used and, if possible, include a URL.

    • The name of the license (e.g., CC-BY 4.0) should be included for each asset.

    • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.

    • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

    • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.

    • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.

  61. 13.

    New Assets

  62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  63. Answer: [N/A]

  64. Justification: the paper does not release new assets.

  65. Guidelines:

    • The answer NA means that the paper does not release new assets.

    • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.

    • The paper should discuss whether and how consent was obtained from people whose asset is used.

    • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

  66. 14.

    Crowdsourcing and Research with Human Subjects

  67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  68. Answer: [N/A]

  69. Justification: the paper does not involve crowdsourcing nor research with human subjects.

  70. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.

    • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

  71. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  73. Answer: [N/A]

  74. Justification: the paper does not involve crowdsourcing nor research with human subjects.

  75. Guidelines:

    • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.

    • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.

    • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.

    • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.