Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Yulin Luo
Peking University
yulin@stu.pku.edu.cn &Ruichuan An
Xi’an Jiaotong University
arctanx@stu.xjtu.edu.cn
&Bocheng Zou
University of Wisconsin-Madison
bochengz@cs.wisc.edu
&Yiming Tang
Peking University
yimingtangible@163.com
&Jiaming Liu
Peking University
jiamingliu@stu.pku.edu.cn &Shanghang Zhang
Peking University
shanghang@pku.edu.cn
National Key Laboratory for Multimedia Information Processing in School of Computer Science at Peking University (Same for all author from Peking University)Equal contributionCorresponding author
Abstract

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. With the help of SSD-LLM, we can structuralize the datasets into subpopulation-level automatically, achieve average +2.5 worst group accuracy gain compared to previous methods on subpopulation shift benchmark Waterbirds and Metashift, and also identify more consistent slice topics with a higher model error rate of 36.44% on slice discovery task for ImageNet.

1 Introduction

Subpopulation, defined by a set of data points that share common characteristics, is an important concept in machine learning  Yang et al. (2023). Many tasks are subpopulation-related. For example, image clustering conditioned on text criteria Kwon et al. (2023a) is to partition an image dataset into different subpopulations based on user-specified criteria, studying subpopulation shift Yang et al. (2023); Liang & Zou (2022); Zhang et al. (2022) is to mitigate the negative impact of imbalanced subpopulation distributions in the training set on the model, slice discovery Eyuboglu et al. (2022); Chen et al. (2023) is aimed at identifying subpopulations model underperform.

Refer to caption
Figure 1: (A) The Workflow of Subpopulation Structure Discovery with Large Language Models (SSD-LLM). SSD-LLM can further support several downstream tasks including: (B) Dataset Subpopulation Organization; (C) Subpopulation Shift; (D) Slice discovery.

After summarizing the commonalities of these tasks, we find that analyzing the subpopulation distribution is the key to solving all these problems. If the subpopulation distribution can be characterized, image clustering results under different criteria are naturally obtained Kwon et al. (2023a), additional images can be supplemented to rare subgroups to balance the whole dataset Dunlap et al. (2023), and slices can be easily discovered by statistics error rate on validation set Chen et al. (2023). Despite its importance, existing work Yang et al. (2023) lacks systematic exploration of subpopulation distribution. To adjust the issue, for the first time, we propose the concept of subpopulation structure to represent, analyze, and utilize subpopulation distributions within datasets. By definition, a subpopulation structure is a set of hierarchical relations among several subpopulations determined by certain criteria.

Former works like Metashift Liang & Zou (2022) and NICO++ Zhang et al. (2022) have constructed image datasets including the subpopulation information, which organizes the images with respect to some extra attributes, and can be viewed as a "class-attribute-subpopulation" structure. The problem of such a structure is ignoring the category of attributes (or Dimension), leading to attribute inconsistency and confusion. To solve this issue, we introduce a "class-dimension-attribute-subpopulation" structure. The comparison of the two structures can be seen in Fig. 2. By articulating the classification dimensions, this improved structure provides more nuanced attribute assignments.

Refer to caption
Figure 2: Metashift has the same-level attributes Surfboard, Water, and Grass for class Dog, which is irrational due to the possible overlap. As an improvement, we take dimensions into consideration. The class Dog has dimensions including Action, Co-occurrence Object, Location, etc., and in dimension Location, it includes various attributes like Water, Grass, etc, which offers a more appropriate assignment for the samples.

Automatically identifying subpopulation structures within image datasets presents a significant challenge. The approach must be capable of extracting key information from images and summarizing essential content from extensive texts. Furthermore, it necessitates comprehensive world knowledge, enabling a broad understanding of various aspects of the datasets, including diverse categories, common attributes, and the relationships between dimensions and attributes.

Recently, Large Language Model (LLM) Liu et al. (2023b); Kojima et al. (2022); Wei et al. (2022) and Multimodal Large Language Model (MLLM) Liu et al. (2023a); Gao et al. (2023b) have attracted wide attention due to their superior capacities. LLM has shown extensive world knowledge and remarkable abilities in summarization, instruction following Liu et al. (2023b), etc. MLLM extends the capabilities of LLM to handle visual inputs. By visual instruction tuning Liu et al. (2023a), MLLM can verbalize the rich information of images. Motivated by these, we propose a novel framework Subpopulation Structure Discovery with Large Language Model(SSD-LLM)1 to automatically uncover the structure. The core idea is to generate informative captions from images with MLLM, followed by analyzing and summarizing the subpopulation structure of datasets with LLM. Specifically, we design two elaborate prompt engineering components, Criteria Initialization and Criteria Self-Refinement. The former utilizes a sample-based generate-and-select paradigm to summarize dimensions and attributes sequentially, which is token-efficient when applied to large datasets. The latter employs self-consistency as an indicator to evaluate and refine the criteria. After obtaining complete criteria, each image is assigned to corresponding attributes according to its caption. The final subpopulation structures can be leveraged to finish various downstream tasks with the help of our proposed Task Specific Tuning. In this work, we focus on four application scenarios, i.e. dataset subpopulation organization, subpopulation shift, and slice discovery. We validate the effectiveness SSD-LLM on these subpopulation-related tasks. For subpopulation shift, we achieve an improvement of +2.5 in worst group accuracy across two datasets compared to SOTA methods, and for slice discovery, we can identify more consistent slice topics with a higher model error rate of 6.61.

Our contributions are summarized as follows:

  • We introduce the concept of subpopulation structure to characterize subpopulation distribution in an interpretable manner for the first time.

  • We propose a class-dimension-attribute-subpopulation structure, solving the attribute confusion of the current class-attribute-subpopulation structure.

  • We propose a Subpopulation Structure Discovery with a Large Language Model (SSD-LLM) framework to uncover the underlying subpopulation structure of datasets automatically, with two elaborate prompt engineering components Criteria Initialization and Criteria Self-Refinement.

  • We provide methods for Task Specific Tuning, enabling the application of the structures across a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.

2 Related Works

2.1 Hierarchical Structure of Image Datasets

Recent research has emphasized the need to organize datasets into hierarchical structures allowing for benchmarking various downstream tasks Liang & Zou (2022); Zhang et al. (2022); Wah et al. (2011); Russakovsky et al. (2015). Metashift Liang & Zou (2022) builds a collection of 12,868 sets of images related to 410 main subjects and their contexts. NICO++ Zhang et al. (2022), Waterbirds Wah et al. (2011), and ImageNetBG Russakovsky et al. (2015) also propose methods for constructing various types of hierarchical datasets. However, the construction of these hierarchical datasets often requires manual annotation, hindering automatic construction. These approaches focus on a single dimension, such as object context in Metashift, background in Waterbirds, and ImageNetBG, while practical scenarios involve multiple dimensions hidden within the comprehensive visual information.

2.2 Extract Information from Image Captions

Recent works such as ALIA Dunlap et al. (2023), VeCAFZhang et al. (2024), Bias2Text Kim et al. (2023), and ICTC Kwon et al. (2023b) explore utilizing caption models to obtain information from datasets. ALIA provides a method to augment datasets by generating variations of existing images through captioning and text-to-image models. While ALIADunlap et al. (2023) supports dataset improvement, it lacks knowledge about attribute types, bias, or subpopulation shift existence. Bias2TextKim et al. (2023) detects dataset bias by transforming images into descriptive captions and keywords. However, without large language model participation, Bias2Text fails to support classification dimension selection and can only differentiate images with basic keywords. More recently, VeCAFZhang et al. (2024) extracts the information from image captions with LLM to perform parametric data selection for PVM finetuning. Moreover, ICTCKwon et al. (2023b) enables conditional image clustering using an LLM in a straightforward manner. Although ICTC clusters images when given the criterion, it requires human-assigned text prompts. Compared to ICTC, our approach supports automatic subpopulation structure dataset organization without human criteria assignment and can generate comprehensive criteria tailored to datasets.

2.3 LLM Prompt Engineering

As the popularity of LLMs has surged, prompt engineering, the process of crafting and refining prompts to guide LLMs towards desired outputsKojima et al. (2022); Wei et al. (2022) has also shown more and more importance. Various prompt engineering methodsLiu et al. (2023c); Qiao et al. (2022) and principles have emerged, and researchers or engineers have explored their applications in a diverse range of downstream tasksCui et al. (2023); Qian et al. (2023); Liu et al. (2023b); Park et al. (2023); Fu et al. (2023). In particular, in-context learning has emerged as a pivotal technique, validated both experimentallyDong et al. (2022); Li & Qiu (2023) and theoreticallyLuo et al. (2023); Dai et al. (2023); Xie et al. (2021). This approach involves providing the LLM with context information relevant to the task at hand, enabling it to generate more accurate and relevant responses. Least-to-most promptingZhou et al. (2022) breaks down complex tasks into smaller, more manageable steps, enhancing LLM’s reasoning skill by querying the LLM with more simplified sub-questions. Self-consistencyWang et al. (2022) proposes to ensemble multiple responses to the LLM given the same prompt to get enhanced results, suggesting that consistent responses as an indicator for correct problem solving. Self-refiningMadaan et al. (2023) demonstrates that we can use LLMs to refine their outputs by themselves with careful designing of prompts. In this work, we leverage a combination of prompt engineering techniques, including in-context learning, chain-of-thought, self-consistency, and self-refining to tackle subpopulation structure discovery effectively.

3 Method

In this section, we introduce our proposed method, subpopulation structure discovery with large language models(SSD-LLM). We describe the overall pipeline of SSD-LLM in Section 3.1, outlining how the paradigm automatically discovers the latent subpopulation structures inside the dataset. The process begins with captioning the images in the dataset with an MLLM, detailed in Section 3.2, and proceeds with criteria initialization with an LLM in Section 3.3. The paradigm then refines the initialized criteria through a recursive self-refinement procedure, detailed in Section 3.4. Finally, images are assigned to attributes, completing the subpopulation structure discovery process, as elaborated in Section 3.5. The section concludes with a discussion of how to apply the method to various downstream tasks, as presented in Section 3.6.

Refer to caption
Figure 3: Subpopulation Structure Discovery with Large Language Model(SSD-LLM). (Step 1) Multimodality Large Language Model(MLLM) extracts informative captions from images. (Step 2) LLM initializes the criteria with a sample-based generate-and-select paradigm. (Step 3) LLM self-refines the criteria using self-consistency as an indicator. (Step 4) LLM assigns each caption with specific attributes according to the refined criteria, uncovering the intrinsic subpopulation structures hidden in the dataset. The resulting criteria and subpopulations are used in several downstream tasks.

3.1 Overview

To automatically discover the subpopulation structures, we propose a novel prompt engineering paradigm that effectively leverages the capabilities of both multimodal large language models (MLLMs) and large language models (LLMs). Our proposed method comprises four key steps. (See Figure 3.) First, we transform the images into information-rich captions that capture the main information in the images using MLLMs. Second, we employ a novel sample-based prompt engineering method to guide an LLM to produce criteria consisting of dimensions and corresponding attributes organizing the dataset. Third, we prompt the LLM to self-refine this generated criteria. Last, we assign all the images in the dataset to specific attributes accordingly, uncovering the intrinsic subpopulation structures in the dataset, and paving the way for further analysis about the dataset. Detailed descriptions of each step are provided below. For notations, consistent with various former works Luo et al. (2023); Wei et al. (2022), we denote the operation of getting responses from the language models as LLM()𝐿𝐿𝑀LLM()italic_L italic_L italic_M ( ) and MLLM()𝑀𝐿𝐿𝑀MLLM()italic_M italic_L italic_L italic_M ( ), and use [,][,][ , ] to represent the concatenation operation of two pacts of texts.

3.2 Caption Extraction

To begin our approach, we leverage the powerful image captioning capabilities of the MLLM to transform the images into informative and detailed captions. Instead of briefly describing the images, we prompt the MLLM to generate more detailed captions centered around the main subject CLS. This choice stems from the empirical observation that simple descriptions often fail to capture comprehensive visual information and may not support subsequent analysis aimed at uncovering the subpopulation structures. To encourage the MLLM to produce subject-focused captions, we also include subject information in our prompt.

The prompt we used in this step is stated as follow:

P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = "Describe the image of the subject CLS in detail."

Algorithm 1 *
Step 1 Caption Extraction
0:  Dataset: Dimgsubscript𝐷𝑖𝑚𝑔D_{img}italic_D start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT, MLLM() 0:  Image Captions: C𝐶Citalic_C 1:  for i in range(NumOfIterations) 2:   img𝑖𝑚𝑔imgitalic_i italic_m italic_g = Dimgsubscript𝐷𝑖𝑚𝑔D_{img}italic_D start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT.sample() 3:   c𝑐citalic_c = MLLM(img𝑖𝑚𝑔imgitalic_i italic_m italic_g, P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) 4:   C𝐶Citalic_C.append(c𝑐citalic_c) 5:  end for
Algorithm 2 *

Step 2-1 Dimension generation

0:  Captions: C𝐶Citalic_C, LLM()
0:  dimensions: Dims𝐷𝑖𝑚𝑠Dimsitalic_D italic_i italic_m italic_s
1:  for i in range(NumOfIterations)
2:   c𝑐citalic_c = C𝐶Citalic_C.sample(NumOfSamples)
3:   S𝑆Sitalic_S.append(LLM([P21superscriptsubscript𝑃21P_{2}^{1}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT,c𝑐citalic_c]))
4:  end for
5:  Dims𝐷𝑖𝑚𝑠Dimsitalic_D italic_i italic_m italic_s = MajorityVote(S𝑆Sitalic_S)

3.3 Criteria Initialization

To discover the hidden subpopulation structures within the dataset, we employ an LLM to delve into the information-rich captions generated in the previous step. Our objective is to identify certain criteria that effectively partition the images into several distinct subgroups. Beyond simply dividing the dataset into subgroups, we articulate the classification dimension for the partition and record all the resulting attributes generated from the classification process. Along with the class information and the resulting subpopulations, this criteria naturally form a four-layer structure, class-, dimension-, attribute-, and subpopulation-. Noticing criteria encompass multiple dimensions and their corresponding attributes, we adopt a generate-and-select paradigm with the LLM to discover the dimensions and the attributes sequentially.

To determine the dimensions and attributes, we employ an iterative sampling approach, repeatedly prompting the LLM to propose dimensions and attributes based on batches of image captions. In each iteration, the LLM generates candidate dimensions and attributes, which are subsequently processed through an ensemble process. This sample-and-ensemble approach effectively addresses the challenges when processing large datasets. For naive counterparts, inputting all the image captions into the LLM at once is practically infeasible for large datasets, while processing the captions individually fails to provide the LLM with sufficient context. Furthermore, the sample-based nature of our approach promotes token efficiency. Since the number of dimensions that can differentiate images in a dataset is relatively small, and these dimensions have an appearance in numerous images, our approach effectively identifies relevant dimensions even for large-scale datasets. This efficiency is particularly crucial for applications involving large image collections or where computational resources are limited.

The prompts we used in this step are stated as follow, omitting Chain-of-thought examples for simplicity.

P21superscriptsubscript𝑃21P_{2}^{1}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = "Suggest some dimensions that can differentiate the following image captions."

P22superscriptsubscript𝑃22P_{2}^{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = "Suggest a complete criterion to differentiate the following image captions by the given dimension."

Algorithm 3 *

Step 2-2 Attribute Generation

0:  Dimensions: Dims𝐷𝑖𝑚𝑠Dimsitalic_D italic_i italic_m italic_s, Captions: C𝐶Citalic_C, Large language model: LLM
0:  Initialized criteria: Criteria𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎Criteriaitalic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a
1:  for dim𝑑𝑖𝑚dimitalic_d italic_i italic_m in Dims𝐷𝑖𝑚𝑠Dimsitalic_D italic_i italic_m italic_s do
2:   for i in range(NumOfIterations) do
3:    c𝑐citalic_c = C𝐶Citalic_C.sample(NumOfSamples)
4:    S𝑆Sitalic_S.append(LLM([P22superscriptsubscript𝑃22P_{2}^{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, dim𝑑𝑖𝑚dimitalic_d italic_i italic_m, c𝑐citalic_c]))
5:   end for
6:   Attributes𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠Attributesitalic_A italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s = MajorityVote(S𝑆Sitalic_S) *list of attributes
7:   Criteria[dim]=Attributes𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎delimited-[]𝑑𝑖𝑚𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠Criteria[dim]=Attributesitalic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a [ italic_d italic_i italic_m ] = italic_A italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s
8:   S𝑆Sitalic_S.reset()
9:  end for

3.4 Criteria Self-Refinement

To further refine the criteria and ensure its effectiveness in classifying image captions across the dataset, we implement a recursive refining process. This approach proposes a novel method for identifying image captions requiring further refinement utilizing the self-consistency of LLM responses as an indicator Wang et al. (2022). This choice stems from our empirical observation that if an image can be accurately classified according to a particular dimension, it should consistently be classified into the same attribute multiple times. Inconsistent responses, however, suggest that the current criteria require further refinement. In such cases, we leverage the LLM’s ability to self-refine text with additional instructions Madaan et al. (2023).

The prompts we used in this step are stated as follow:

P31superscriptsubscript𝑃31P_{3}^{1}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = "Classify the caption by the criteria listed below."

P32superscriptsubscript𝑃32P_{3}^{2}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = "We are unable to classify the following image caption using the provided criteria due to missing certain attributes. Please suggest an additional attribute that would enable accurate classification."

Algorithm 4 *

Step 3 Criteria Self-Refinement

0:  criteria: Criteria𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎Criteriaitalic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a, Captions: C𝐶Citalic_C, Large language model: LLM
0:  Refined criteria: Criteria𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎Criteriaitalic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a
1:  for [dim𝑑𝑖𝑚dimitalic_d italic_i italic_m, Attributes𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠Attributesitalic_A italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s] in Criteria do
2:   for i in range(NumOfTests) do
3:    c𝑐citalic_c = C𝐶Citalic_C.sample()
4:    Results𝑅𝑒𝑠𝑢𝑙𝑡𝑠Resultsitalic_R italic_e italic_s italic_u italic_l italic_t italic_s.append(LLM([P31superscriptsubscript𝑃31P_{3}^{1}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, c𝑐citalic_c dim𝑑𝑖𝑚dimitalic_d italic_i italic_m, attribute𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒attributeitalic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e]))
5:   end for
6:   if not Consistent(Results𝑅𝑒𝑠𝑢𝑙𝑡𝑠Resultsitalic_R italic_e italic_s italic_u italic_l italic_t italic_s)
7:    for i in range(NumOfSuggests) do
8:     S𝑆Sitalic_S.append(LLM([P32superscriptsubscript𝑃32P_{3}^{2}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, c𝑐citalic_c, dim𝑑𝑖𝑚dimitalic_d italic_i italic_m, attribute𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒attributeitalic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e]))
9:    end for
10:    newattribute𝑛𝑒𝑤𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒newattributeitalic_n italic_e italic_w italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e = MajorityVote(S𝑆Sitalic_S)
11:    Criteria[dim].append(newattribute)formulae-sequence𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎delimited-[]𝑑𝑖𝑚𝑎𝑝𝑝𝑒𝑛𝑑𝑛𝑒𝑤𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒Criteria[dim].append(newattribute)italic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a [ italic_d italic_i italic_m ] . italic_a italic_p italic_p italic_e italic_n italic_d ( italic_n italic_e italic_w italic_a italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e )
12:    S𝑆Sitalic_S.reset()
13:   end if
14:   Results𝑅𝑒𝑠𝑢𝑙𝑡𝑠Resultsitalic_R italic_e italic_s italic_u italic_l italic_t italic_s.reset()
15:  end for

3.5 Subpopulation Assignment

Equipped with the comprehensive criteria, we proceed to systematically assign each image to the specific attributes of each dimension. Images assigned to the same attributes across all dimensions form distinct subgroups within the dataset, revealing the intrinsic subpopulation structures hidden within the data. These subpopulation structures can then be leveraged to perform various downstream tasks, completing our overall pipeline for employing an LLM to analyze the dataset by discovering subpopulation structures.

The prompt we used in this step is stated as follow:

P4subscript𝑃4P_{4}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = "Please assign following caption to one attribute of given dimension."

Algorithm 5 *

Step 4 Subpopulation Assignment

0:  Captions: C𝐶Citalic_C, Large language model: LLM, criteria: Criteria𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎Criteriaitalic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a
0:  Further assignments for each caption c𝑐citalic_c.
1:  for c𝑐citalic_c in C𝐶Citalic_C do
2:   for [dim𝑑𝑖𝑚dimitalic_d italic_i italic_m, Attributes𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠Attributesitalic_A italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s] in Criteria𝐶𝑟𝑖𝑡𝑒𝑟𝑖𝑎Criteriaitalic_C italic_r italic_i italic_t italic_e italic_r italic_i italic_a do
3:    c𝑐citalic_c.assign(LLM([P4subscript𝑃4P_{4}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT,c𝑐citalic_c,dim𝑑𝑖𝑚dimitalic_d italic_i italic_m,Attributes𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠Attributesitalic_A italic_t italic_t italic_r italic_i italic_b italic_u italic_t italic_e italic_s]))
4:   end for
5:  end for

3.6 Task-specific Tuning for Downstream Tasks

Once we have identified the criteria and subpopulations within the dataset, we can leverage this information to tackle several downstream tasks effectively. This includes organizing the subpopulations, which can reveal valuable insights into the data, such as potential subpopulation biases and the presence of long-tail attributes. More importantly, the subpopulation structures can be used to improve model performance on various tasks if combined with extra operations, including handling subpopulation shifts and slice discovery.

Dataset Subpopulation Organization There are several advantages to organizing subpopulations within a dataset. It provides information about the numbers of data points in each subpopulation and can reflect dataset biases and help us identify some long-tail subgroups. Recent research suggests that creating hierarchical datasets with additional image attributes is beneficial for building more complex benchmarks Liang & Zou (2022); Zhang et al. (2022). However, organizing subpopulations for such tasks currently requires significant manual effort. Our method, SSD-LLM, automates subpopulation organization, holding the potential to revolutionize the construction of hierarchical datasets.

Subpopulation Shift Subpopulation shift Yang et al. (2023), a common challenge in machine learning, occurs when the proportion of some subpopulations between training and deployment changes. Previous work Yang et al. (2023); Santurkar et al. (2020) have shown that subpopulation shifts can cause significant drops to model performances, necessitating the need for methods that can handle subpopulation shifts. SSD-LLM, combined with image generation, offers a solution to better handle the scenarios of subpopulation shifts. In our experiments, after we apply SSD-LLM to the datasets, we statistics the number of images contained in each subpopulation and utilize diffusion model to generate images for underrepresented subpopulations, which helps to achieve balanced classes and attributes. Specifically, we randomly sample attributes from the subpopulation structure for each class and employ LLM to make complete sentences based on these words as the input prompt of the diffusion model. SSD-LLM with image generation can achieve competitive results in several datasets, showcasing the effectiveness and strength of our method.

Slice Discovery Slice discovery is a task aiming at uncovering subpopulations within a dataset where a machine learning model consistently exhibits poor performances. These subpopulations with underperformances, or slices, provide valuable insights into the model’s limitations and potential biases. By identifying slices, researchers can gain a deeper understanding of where the models underperform and provide information about how to improve the model’s performance in these challenging subpopulations.

4 Experiments

We now present experimental results demonstrating the effectiveness of SSD-LLM. In particular, we present the main settings and results in this section and defer extra details, including the precise text prompts used and various visualization result, to appendix. In our experiments, we mainly use LLama2-accessory Zhang et al. (2023) for the VLM and GPT-4 Achiam et al. (2023) for the LLM, while ablation studies investigating other choices of foundation models are included in appendix.

Our approach was benchmarked against the most current baselines in tasks including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery, where it demonstrated competitive performance. This success underscores the method’s efficacy in identifying and analyzing subgroups, further affirming its utility in addressing related challenges. Moreover, it illustrates that the unified paradigm can effectively address a variety of downstream tasks.

Refer to caption
Figure 4: A visualization of organised subpopulations in a dataset of cats.
        Dataset         Criterion         SCAN                  IC|TC         Ours
                 Stanford         40 Action         Action         0.397         0.774         0.752
        Location         0.359*         0.822*         0.807*
        Mood         0.250*         0.793*         0.682*
        Place365         Place         0.332         -         0.523
Table 1: Quantitative results of Dataset Subpopulation Organization. Accuracies labeled with * are evaluated by having a human provide ground truth labels introducing in Kwon et al. (2023b). We consider IC|TC as an upper bound because it involves manual justification during whole process. In particular, their approach manually specifies the dimensions.

4.1 Evaluation on Dataset Subpopulation Organization

Setup The data assignment process facilitated by SSD-LLM, which leverages the discovered subgroup structure, can be considered a form of clustering. To evaluate the quality of these identified subpopulations, we assess the clustering accuracy by comparing images against secondary labels that reflect subgroup attributes derived via SSD-LLM. Full information for datasets, text criterion, and model selection can be found in appendix.

Comparison Methods Scan Van Gansbeke et al. (2020) is a two-stage clustering method that decouples feature learning and clustering. IC|TC Kwon et al. (2023b) is a new paradigm for image clustering that supports human interaction. It utilizes the given Text Criteria to accurately control the quality of the clustering results.

Results and Analysis In Table 1, we report the average accuracy achieved by each method based on the predefined textual criteria. SSD-LLM outperforms all baseline methods, exhibiting an average improvement of 35.6% over the baseline metrics. This enhancement is attributed to the effective subpopulation structures uncovered by our approach. When compared to IC|TC, SSD-LLM demonstrates comparable performance. It is important to note that IC|TC is considered an experimental upper bound due to its incorporation of artificial judgment, which involves manually refining text criteria for improved clustering results; in contrast, our approach is fully automated. Moreover, SSD-LLM offers greater flexibility by not requiring the pre-specification of cluster centers, thereby eliminating the constraints imposed by the number of ground truth (GT) labels and leveraging the full potential of LLM for attribute discovery and assignment. We also conduct visualization of our organized subpopulations, as shown in Figure 4. We can visually observe that the pictures and dimensions are indeed consistent, indicating the effectiveness and rationality of our mining and assigning process.

 Type Method Average Accuracy Worst Group Accuracy
Waterbirds Metashift Average Waterbirds Metashift Average
 Vanilla ERM 84.1 91.2 87.7 69.1 82.1 75.6
Subgroup Robust Methods GroupDRO 86.9 91.5 89.2 73.1 83.1 78.1
JTT 88.9 91.2 90.1 71.2 82.6 76.9
LfF 86.6 80.4 83.5 75.0 72.3 73.7
LISA 89.2 91.4 90.3 77.0 79.0 78.0
Imbalanced Learning Resample 86.2 92.2 89.2 70.0 81.0 75.5
Reweight 86.2 91.5 88.9 71.9 83.1 77.5
Focal 89.3 91.6 90.5 71.6 81.0 76.3
CBLoss 86.8 91.4 89.1 74.4 83.1 78.8
BSoftmax 88.4 91.3 89.9 74.1 82.6 78.4
Traditional Data Augmentation Mixup 89.2 91.4 90.3 77.5 79.0 78.3
RandAug 86.3 90.9 88.6 71.4 80.9 76.2
Diffusion Class Prompt 85.9 91.5 88.7 71.3 82.7 77.0
Class-Attribute Prompt 89.1 91.4 90.3 73.5 83.8 78.7
CiP 88.0 91.1 89.6 73.5 82.4 77.9
LLM+Diffusion SSD-LLM (Ours) 90.1 92.3 91.2 78.3 84.2 81.3
 
Table 2: Numerical comparison of methods and their performances for image classification with subpopulation shifts.

4.2 Evaluation on Subpopulation Shift

Setup We evaluate subpopulation shifts on two commonly used image datasets, Metashift (Cats vs Dogs) and Waterbirds (Landbirds vs Waterbirds). We choose Average Accuracy and Worst Group Accuracy as evaluation metrics. To ensure a fair comparison, following Yang et al. (2023), we conduct a random search of 16 trials over a joint distribution of all hyperparameters. We then use the validation set to select the best hyperparameters for each algorithm, fix them, and rerun the experiments under five different random seeds to report the final average results. To make the evaluation more realistic, we consider the model selection setting Attributes are unknown in both training and validation.

Comparison Methods Following recent benchmarking efforts Yang et al. (2023), we compare SSD-LLM with several types of methods: (1) vanilla: ERM Vapnik (1999), (2) Subgroup Robust Methods: GroupDRO Sagawa et al. (2019), LfF Nam et al. (2020), JTT Liu et al. (2021), LISA Yao et al. (2022), (3) Imbalanced Learning: ReSample Japkowicz (2000), ReWeight Japkowicz (2000), Focal Lin et al. (2017), CBLoss Cui et al. (2019), Bsoftmax Ren et al. (2020), (4) Traditional Data Augmentation: Mixup Zhang et al. (2017), RandAug Cubuk et al. (2020), (5) Diffusion: Class Prompt Shipard et al. (2023), Class-Attribute Prompt Shipard et al. (2023), CiP Lei et al. (2023).

Results and Analysis From Table. 2, SSD-LLM achieves better performance compared to previous methods, with an improvement of +0.7 and +2.5 in average accuracy and worst group accuracy respectively across two datasets. Further analysis is conducted from three perspectives: (1) Despite based on conventional ERM, data-based approaches achieve overall competitive performance compared to model-based algorithms, demonstrating the potential of such paradigm. (2) For diffusion-based methods, the better performance of class-attribute prompts over class prompts reveals the critical role of knowing dataset imbalanced attributes for alleviating subpopulation shift via image generation, primary due to more targeted generated images. However, its requirement for pre-identifying manually annotated imbalanced dataset attributes underscore the value of our SSD-LLM. We not only automate the discovery of dataset attributes but also provides more detailed annotations than manual labeling, leading to enhanced performance. (3) The better performance of CiP over Class Prompt highlights the significance of diverse text prompts. However, the more superior performance of our method can owe to the comprehensive analysis of the imbalance subpopulation distribution inherent within the dataset. Therefore, the text diversity in our approach (brought by attributes sampling and LLM making-sentence) is not arbitrary but rather strategic, tackling the subpopulation shift.

Method|Categories Boats Birds Cars Cats Dogs Trucks Topic Pass Rate
ImageNet 4.33 0.81 11.33 11.14 0.69 11.71 6.72
General Prompt 39.85 19.00 31.00 19.75 20.50 24.81 25.81
Domino(Bert) 42.34 23.21 29.87 24.56 26.86 32.19 29.83
SSD-LLM (Ours) 53.31 33.67 37.34 33.69 21.08 39.57 36.44
Table 3: Results of slice discovery on Imagenet-1K with various SDMs.The best accuracy is highlighted in bold, while the second one is underlined.

4.3 Evaluation on Slice Discovery

Setup In contrast to typical slice discovery tasks, we redefine evaluation pipeline following Gao et al. (2023a), showcasing superiority of our approach. In this study, we evaluate bugs found of ImageNet models. Specifically, a classification is deemed incorrect when an image containing target object is erroneously identified by the model as containing an unrelated object.

Comparison Methods Domino Eyuboglu et al. (2022) represents a state-of-the-art method in slice discovery, which effectively clusters errors identified in the validation set and characterizes them through captions generated automatically.

Results and Analysis We evaluate the effectiveness of our method in slice discovery on 6 representative superclasses in imagenet with Resnet50 He et al. (2016) backbone. As shown in Table.  3, our SSD-LLM overcomes all other SDMs, including Domino Eyuboglu et al. (2022), by a significant margin. We find topics given by Domino is tend to encounter two unsatisfactory cases: loss of semantics, and missing class information. These cases also be discussed in Gao et al. (2023a), where they reasoned this phenomenon into the inherent difficulties of automatic SDMs. However, evidence suggests that our SSD-LLM can handle these problems, while keeping a high error found rate and maintaining automation. Specifically, SSD-LLM achieves an average error rate of 36.44%, surpassing generic prompt and Domino Eyuboglu et al. (2022) by 10.63% and 6.61%, respectively. Furthermore, when we trace back to the origin dataset, the discovered slice is also very consistent(detailed visualizations included in appendix). Interestingly, we find the data mining process of SSD-LLM is just the same as human data scientists, who take up hypotheses and improve model performance by viewing batches of bad subpopulations. Experiments show that our method overcame the inherent difficulties while maintaining automaton, paving the way for data-centric methods.

4.4 Ablation Study

Hyperparameters of Criteria Initialization/Refinement The N ×\times× M | 1/T in Table. 4 represent suggestion rounds, suggestion samples at one time and threshold for majority vote. A is the main setting of our methods. For A\rightarrowB, the performance drop indicates the effectiveness of Criteria Refinement, which can bring about more comprehensive subpopulation structure. For A\rightarrowC, we double the suggestion rounds for dimension, with performance basically unchanged, illustrating the dimension of dataset relatively stable. For A\rightarrowD and A\rightarrowE, the performance decreased slightly, demonstrating the necessity of suitable hyperparameters for attribute analysis. For A\rightarrowF, the performance drop indicates the significance of high quality attribute selection.

Methods Index Criteria Initialization Criteria Refinement Accuracy
Dimension Attribute Average Worst Group
ERM 84.1 69.1
SSD-LLM A 10 ×\times× 50 | 1/4 20 ×\times× 20 | 1/6 90.1 78.3
B 10 ×\times× 50 | 1/4 20 ×\times× 20 | 1/6 ×\times× 88.3 77.2
C 20 ×\times× 50 | 1/4 20 ×\times× 20 | 1/6 90.3 78.4
D 10 ×\times× 50 | 1/4 10 ×\times× 40 | 1/6 89.3 77.6
E 10 ×\times× 50 | 1/4 40 ×\times× 10 | 1/6 89.5 78.0
F 10 ×\times× 50 | 1/4 20 ×\times× 20 | 1/20 89.0 77.5
Table 4: Ablation study on the number of samples (NUM per round×\times×NUM of rounds), majority voting ratio, and criteria refinement(with or without).

Diffusion Generation Strategy From Table. 5, we can analyze the key components for employing diffusion model to help subpopulation shift. For A\rightarrowB, we verify the command of imbalance attributes within datasets is helpful to subpopulation shift tasks. For B\rightarrowD, we illustrate the comprehensive subpopulation structure with diverse attributes are more benefitial for the tasks. For B\rightarrowC and D\rightarrowE, we demonstrate the effectiveness of balanced subpopulation sampling strategy, which helps to balance the subpopulation distribution of datasets and improve the data quality for better model training. For E\rightarrowF, the prompt from LLM sentence-making can generate more reasonable images and thus achieving better results. For F\rightarrowG and H, we verify the scaling capability of our SSD-LLM.

Methods Index Attribute Mode Sample Mode SD Prompt Mode Number Average Accuracy Worst Group Accuracy
ERM —— —— —— —— x1 84.1 69.1
SSD-LLM A GT Attribute Unknown random Direct ×1 85.9 71.3
B GT Attribute Known random Direct ×1 89.1 73.5
C GT Attribute Known weighted Direct ×1 89.3 73.8
D SSD-LLM Attribute random Direct ×1 89.5 76.2
E SSD-LLM Attribute weighted Direct ×1 89.8 77.4
F SSD-LLM Attribute weighted LLM Sentence ×1 90.1 78.3
G SSD-LLM Attribute weighted LLM Sentence ×2 91.1 79.2
H SSD-LLM Attribute weighted LLM Sentence ×4 91.4 79.6
Table 5: Ablation study on the sample mode and SD prompt mode.

5 Conclusion

In this work, we present the first systematic exploration of subpopulation structure discovery. We provide a precise definition of subpopulation structure and introduce a fine-grained criteria to determine the structures. We have proposed a novel method, SSD-LLM, for automatic subpopulation structure discovery incorporating elaborate prompt engineering techniques . SSD-LLM can be combined with various subsequent operations to better tackle several downstream tasks, including dataset subpopulation organization, subpopulation shift and slice discovery, with minor task specific tuning. We have conducted extensive experiments to verify our proposed methods, demonstrating the remarkable effectiveness and and generality of SSD-LLM.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Chen et al. (2023) Muxi Chen, Yu Li, and Qiang Xu. Hibug: On human-interpretable model debug. 2023.
  • Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  702–703, 2020.
  • Cui et al. (2023) Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
  • Cui et al. (2019) Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9268–9277, 2019.
  • Dai et al. (2023) Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
  • Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  • Dunlap et al. (2023) Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E Gonzalez, and Trevor Darrell. Diversify your vision datasets with automatic diffusion-based augmentation. arXiv preprint arXiv:2305.16289, 2023.
  • Eyuboglu et al. (2022) Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings. arXiv preprint arXiv:2203.14960, 2022.
  • Fu et al. (2023) Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
  • Gao et al. (2023a) Irena Gao, Gabriel Ilharco, Scott Lundberg, and Marco Tulio Ribeiro. Adaptive testing of computer vision models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4003–4014, 2023a.
  • Gao et al. (2023b) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Japkowicz (2000) Nathalie Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on artificial intelligence, volume 56, pp.  111–117, 2000.
  • Kim et al. (2023) Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, and Jinwoo Shin. Bias-to-text: Debiasing unknown visual biases through language interpretation. arXiv preprint arXiv:2301.11104, 2023.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213, 2022.
  • Kwon et al. (2023a) Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K Ryu, and Kangwook Lee. Image clustering conditioned on text criteria. 2023a.
  • Kwon et al. (2023b) Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K Ryu, and Kangwook Lee. Image clustering conditioned on text criteria. arXiv preprint arXiv:2310.18297, 2023b.
  • Lei et al. (2023) Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, and Dacheng Tao. Image captions are natural prompts for text-to-image models. arXiv preprint arXiv:2307.08526, 2023.
  • Li & Qiu (2023) Xiaonan Li and Xipeng Qiu. Mot: Pre-thinking and recalling enable chatgpt to self-improve with memory-of-thoughts. arXiv preprint arXiv:2305.05181, 2023.
  • Liang & Zou (2022) Weixin Liang and James Zou. Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts. arXiv preprint arXiv:2202.06523, 2022.
  • Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp.  2980–2988, 2017.
  • Liu et al. (2021) Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In International Conference on Machine Learning, pp.  6781–6792. PMLR, 2021.
  • Liu et al. (2023a) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  • Liu et al. (2023b) Jiaxiang Liu, Tianxiang Hu, Yan Zhang, Xiaotang Gai, Yang Feng, and Zuozhu Liu. A chatgpt aided explainable framework for zero-shot medical image diagnosis. arXiv preprint arXiv:2307.01981, 2023b.
  • Liu et al. (2023c) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023c.
  • Luo et al. (2023) Yifan Luo, Yiming Tang, Chengfeng Shen, Zhennan Zhou, and Bin Dong. Prompt engineering through the lens of optimal control. arXiv preprint arXiv:2310.14201, 2023.
  • Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  • Nam et al. (2020) Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. Advances in Neural Information Processing Systems, 33:20673–20684, 2020.
  • Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  • Qian et al. (2023) Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
  • Qiao et al. (2022) Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022.
  • Ren et al. (2020) Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175–4186, 2020.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • Sagawa et al. (2019) Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
  • Santurkar et al. (2020) Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift, 2020.
  • Shipard et al. (2023) Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, and Clinton Fookes. Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  769–778, 2023.
  • Van Gansbeke et al. (2020) Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In European conference on computer vision, pp.  268–285. Springer, 2020.
  • Vapnik (1999) Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
  • Wah et al. (2011) Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Xie et al. (2021) Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  • Yang et al. (2023) Yuzhe Yang, Haoran Zhang, Dina Katabi, and Marzyeh Ghassemi. Change is hard: A closer look at subpopulation shift, 2023.
  • Yao et al. (2022) Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp.  25407–25437. PMLR, 2022.
  • Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhang et al. (2023) Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient finetuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  • Zhang et al. (2024) Rongyu Zhang, Zefan Cai, Huanrui Yang, Zidong Liu, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Baobao Chang, Yuan Du, et al. Vecaf: Vlm-empowered collaborative active finetuning with training objective awareness. arXiv preprint arXiv:2401.07853, 2024.
  • Zhang et al. (2022) Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization, 2022.
  • Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.