LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Yulin Luo
Peking University
yulin@stu.pku.edu.cn &Ruichuan An
Xi’an Jiaotong University
arctanx@stu.xjtu.edu.cn
&Bocheng Zou
University of Wisconsin-Madison
bochengz@cs.wisc.edu
&Yiming Tang
Peking University
yimingtangible@163.com
&Jiaming Liu
Peking University
jiamingliu@stu.pku.edu.cn &Shanghang Zhang
Peking University
shanghang@pku.edu.cn National Key Laboratory for Multimedia Information Processing in School of Computer Science at Peking University (Same for all author from Peking University)Equal contributionCorresponding author

Abstract

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. With the help of SSD-LLM, we can structuralize the datasets into subpopulation-level automatically, achieve average +2.5 worst group accuracy gain compared to previous methods on subpopulation shift benchmark Waterbirds and Metashift, and also identify more consistent slice topics with a higher model error rate of 36.44% on slice discovery task for ImageNet.

1 Introduction

Subpopulation, defined by a set of data points that share common characteristics, is an important concept in machine learning Yang et al. (2023). Many tasks are subpopulation-related. For example, image clustering conditioned on text criteria Kwon et al. (2023a) is to partition an image dataset into different subpopulations based on user-specified criteria, studying subpopulation shift Yang et al. (2023); Liang & Zou (2022); Zhang et al. (2022) is to mitigate the negative impact of imbalanced subpopulation distributions in the training set on the model, slice discovery Eyuboglu et al. (2022); Chen et al. (2023) is aimed at identifying subpopulations model underperform.

Refer to caption — Figure 1: (A) The Workflow of Subpopulation Structure Discovery with Large Language Models (SSD-LLM). SSD-LLM can further support several downstream tasks including: (B) Dataset Subpopulation Organization; (C) Subpopulation Shift; (D) Slice discovery.

After summarizing the commonalities of these tasks, we find that analyzing the subpopulation distribution is the key to solving all these problems. If the subpopulation distribution can be characterized, image clustering results under different criteria are naturally obtained Kwon et al. (2023a), additional images can be supplemented to rare subgroups to balance the whole dataset Dunlap et al. (2023), and slices can be easily discovered by statistics error rate on validation set Chen et al. (2023). Despite its importance, existing work Yang et al. (2023) lacks systematic exploration of subpopulation distribution. To adjust the issue, for the first time, we propose the concept of subpopulation structure to represent, analyze, and utilize subpopulation distributions within datasets. By definition, a subpopulation structure is a set of hierarchical relations among several subpopulations determined by certain criteria.

Former works like Metashift Liang & Zou (2022) and NICO++ Zhang et al. (2022) have constructed image datasets including the subpopulation information, which organizes the images with respect to some extra attributes, and can be viewed as a "class-attribute-subpopulation" structure. The problem of such a structure is ignoring the category of attributes (or Dimension), leading to attribute inconsistency and confusion. To solve this issue, we introduce a "class-dimension-attribute-subpopulation" structure. The comparison of the two structures can be seen in Fig. 2. By articulating the classification dimensions, this improved structure provides more nuanced attribute assignments.

Automatically identifying subpopulation structures within image datasets presents a significant challenge. The approach must be capable of extracting key information from images and summarizing essential content from extensive texts. Furthermore, it necessitates comprehensive world knowledge, enabling a broad understanding of various aspects of the datasets, including diverse categories, common attributes, and the relationships between dimensions and attributes.

Recently, Large Language Model (LLM) Liu et al. (2023b); Kojima et al. (2022); Wei et al. (2022) and Multimodal Large Language Model (MLLM) Liu et al. (2023a); Gao et al. (2023b) have attracted wide attention due to their superior capacities. LLM has shown extensive world knowledge and remarkable abilities in summarization, instruction following Liu et al. (2023b), etc. MLLM extends the capabilities of LLM to handle visual inputs. By visual instruction tuning Liu et al. (2023a), MLLM can verbalize the rich information of images. Motivated by these, we propose a novel framework Subpopulation Structure Discovery with Large Language Model(SSD-LLM)1 to automatically uncover the structure. The core idea is to generate informative captions from images with MLLM, followed by analyzing and summarizing the subpopulation structure of datasets with LLM. Specifically, we design two elaborate prompt engineering components, Criteria Initialization and Criteria Self-Refinement. The former utilizes a sample-based generate-and-select paradigm to summarize dimensions and attributes sequentially, which is token-efficient when applied to large datasets. The latter employs self-consistency as an indicator to evaluate and refine the criteria. After obtaining complete criteria, each image is assigned to corresponding attributes according to its caption. The final subpopulation structures can be leveraged to finish various downstream tasks with the help of our proposed Task Specific Tuning. In this work, we focus on four application scenarios, i.e. dataset subpopulation organization, subpopulation shift, and slice discovery. We validate the effectiveness SSD-LLM on these subpopulation-related tasks. For subpopulation shift, we achieve an improvement of +2.5 in worst group accuracy across two datasets compared to SOTA methods, and for slice discovery, we can identify more consistent slice topics with a higher model error rate of 6.61.

Our contributions are summarized as follows:

•

We introduce the concept of subpopulation structure to characterize subpopulation distribution in an interpretable manner for the first time.
•

We propose a class-dimension-attribute-subpopulation structure, solving the attribute confusion of the current class-attribute-subpopulation structure.
•

We propose a Subpopulation Structure Discovery with a Large Language Model (SSD-LLM) framework to uncover the underlying subpopulation structure of datasets automatically, with two elaborate prompt engineering components Criteria Initialization and Criteria Self-Refinement.
•

We provide methods for Task Specific Tuning, enabling the application of the structures across a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.

2 Related Works

2.1 Hierarchical Structure of Image Datasets

Recent research has emphasized the need to organize datasets into hierarchical structures allowing for benchmarking various downstream tasks Liang & Zou (2022); Zhang et al. (2022); Wah et al. (2011); Russakovsky et al. (2015). Metashift Liang & Zou (2022) builds a collection of 12,868 sets of images related to 410 main subjects and their contexts. NICO++ Zhang et al. (2022), Waterbirds Wah et al. (2011), and ImageNetBG Russakovsky et al. (2015) also propose methods for constructing various types of hierarchical datasets. However, the construction of these hierarchical datasets often requires manual annotation, hindering automatic construction. These approaches focus on a single dimension, such as object context in Metashift, background in Waterbirds, and ImageNetBG, while practical scenarios involve multiple dimensions hidden within the comprehensive visual information.

2.2 Extract Information from Image Captions

Recent works such as ALIA Dunlap et al. (2023), VeCAFZhang et al. (2024), Bias2Text Kim et al. (2023), and ICTC Kwon et al. (2023b) explore utilizing caption models to obtain information from datasets. ALIA provides a method to augment datasets by generating variations of existing images through captioning and text-to-image models. While ALIADunlap et al. (2023) supports dataset improvement, it lacks knowledge about attribute types, bias, or subpopulation shift existence. Bias2TextKim et al. (2023) detects dataset bias by transforming images into descriptive captions and keywords. However, without large language model participation, Bias2Text fails to support classification dimension selection and can only differentiate images with basic keywords. More recently, VeCAFZhang et al. (2024) extracts the information from image captions with LLM to perform parametric data selection for PVM finetuning. Moreover, ICTCKwon et al. (2023b) enables conditional image clustering using an LLM in a straightforward manner. Although ICTC clusters images when given the criterion, it requires human-assigned text prompts. Compared to ICTC, our approach supports automatic subpopulation structure dataset organization without human criteria assignment and can generate comprehensive criteria tailored to datasets.

2.3 LLM Prompt Engineering

As the popularity of LLMs has surged, prompt engineering, the process of crafting and refining prompts to guide LLMs towards desired outputsKojima et al. (2022); Wei et al. (2022) has also shown more and more importance. Various prompt engineering methodsLiu et al. (2023c); Qiao et al. (2022) and principles have emerged, and researchers or engineers have explored their applications in a diverse range of downstream tasksCui et al. (2023); Qian et al. (2023); Liu et al. (2023b); Park et al. (2023); Fu et al. (2023). In particular, in-context learning has emerged as a pivotal technique, validated both experimentallyDong et al. (2022); Li & Qiu (2023) and theoreticallyLuo et al. (2023); Dai et al. (2023); Xie et al. (2021). This approach involves providing the LLM with context information relevant to the task at hand, enabling it to generate more accurate and relevant responses. Least-to-most promptingZhou et al. (2022) breaks down complex tasks into smaller, more manageable steps, enhancing LLM’s reasoning skill by querying the LLM with more simplified sub-questions. Self-consistencyWang et al. (2022) proposes to ensemble multiple responses to the LLM given the same prompt to get enhanced results, suggesting that consistent responses as an indicator for correct problem solving. Self-refiningMadaan et al. (2023) demonstrates that we can use LLMs to refine their outputs by themselves with careful designing of prompts. In this work, we leverage a combination of prompt engineering techniques, including in-context learning, chain-of-thought, self-consistency, and self-refining to tackle subpopulation structure discovery effectively.

3 Method

In this section, we introduce our proposed method, subpopulation structure discovery with large language models(SSD-LLM). We describe the overall pipeline of SSD-LLM in Section 3.1, outlining how the paradigm automatically discovers the latent subpopulation structures inside the dataset. The process begins with captioning the images in the dataset with an MLLM, detailed in Section 3.2, and proceeds with criteria initialization with an LLM in Section 3.3. The paradigm then refines the initialized criteria through a recursive self-refinement procedure, detailed in Section 3.4. Finally, images are assigned to attributes, completing the subpopulation structure discovery process, as elaborated in Section 3.5. The section concludes with a discussion of how to apply the method to various downstream tasks, as presented in Section 3.6.

3.1 Overview

To automatically discover the subpopulation structures, we propose a novel prompt engineering paradigm that effectively leverages the capabilities of both multimodal large language models (MLLMs) and large language models (LLMs). Our proposed method comprises four key steps. (See Figure 3.) First, we transform the images into information-rich captions that capture the main information in the images using MLLMs. Second, we employ a novel sample-based prompt engineering method to guide an LLM to produce criteria consisting of dimensions and corresponding attributes organizing the dataset. Third, we prompt the LLM to self-refine this generated criteria. Last, we assign all the images in the dataset to specific attributes accordingly, uncovering the intrinsic subpopulation structures in the dataset, and paving the way for further analysis about the dataset. Detailed descriptions of each step are provided below. For notations, consistent with various former works Luo et al. (2023); Wei et al. (2022), we denote the operation of getting responses from the language models as $LLM()$ and $MLLM()$ , and use $[,]$ to represent the concatenation operation of two pacts of texts.

3.2 Caption Extraction

To begin our approach, we leverage the powerful image captioning capabilities of the MLLM to transform the images into informative and detailed captions. Instead of briefly describing the images, we prompt the MLLM to generate more detailed captions centered around the main subject CLS. This choice stems from the empirical observation that simple descriptions often fail to capture comprehensive visual information and may not support subsequent analysis aimed at uncovering the subpopulation structures. To encourage the MLLM to produce subject-focused captions, we also include subject information in our prompt.

The prompt we used in this step is stated as follow:

$P_{1}$ = "Describe the image of the subject CLS in detail."

Algorithm 1 *

Step 1 Caption Extraction

0: Dataset:

D_{img}

, MLLM() 0: Image Captions:

C

1: for i in range(NumOfIterations) 2:

img

D_{img}

.sample() 3:

c

= MLLM(

img

P_{1}

) 4:

C

.append(

c

) 5: end for

Algorithm 2 *

Step 2-1 Dimension generation

0: Captions:

C

, LLM()

0: dimensions:

Dims

1: for i in range(NumOfIterations)

c

C

.sample(NumOfSamples)

S

.append(LLM([

P_{2}^{1}

c

]))

4: end for

Dims

= MajorityVote(

S

)

3.3 Criteria Initialization

To discover the hidden subpopulation structures within the dataset, we employ an LLM to delve into the information-rich captions generated in the previous step. Our objective is to identify certain criteria that effectively partition the images into several distinct subgroups. Beyond simply dividing the dataset into subgroups, we articulate the classification dimension for the partition and record all the resulting attributes generated from the classification process. Along with the class information and the resulting subpopulations, this criteria naturally form a four-layer structure, class-, dimension-, attribute-, and subpopulation-. Noticing criteria encompass multiple dimensions and their corresponding attributes, we adopt a generate-and-select paradigm with the LLM to discover the dimensions and the attributes sequentially.

To determine the dimensions and attributes, we employ an iterative sampling approach, repeatedly prompting the LLM to propose dimensions and attributes based on batches of image captions. In each iteration, the LLM generates candidate dimensions and attributes, which are subsequently processed through an ensemble process. This sample-and-ensemble approach effectively addresses the challenges when processing large datasets. For naive counterparts, inputting all the image captions into the LLM at once is practically infeasible for large datasets, while processing the captions individually fails to provide the LLM with sufficient context. Furthermore, the sample-based nature of our approach promotes token efficiency. Since the number of dimensions that can differentiate images in a dataset is relatively small, and these dimensions have an appearance in numerous images, our approach effectively identifies relevant dimensions even for large-scale datasets. This efficiency is particularly crucial for applications involving large image collections or where computational resources are limited.

The prompts we used in this step are stated as follow, omitting Chain-of-thought examples for simplicity.

$P_{2}^{1}$ = "Suggest some dimensions that can differentiate the following image captions."

$P_{2}^{2}$ = "Suggest a complete criterion to differentiate the following image captions by the given dimension."

Algorithm 3 *

Step 2-2 Attribute Generation

0: Dimensions:

Dims

, Captions:

C

, Large language model: LLM

0: Initialized criteria:

Criteria

1: for

dim

Dims

2: for i in range(NumOfIterations) do

c

C

.sample(NumOfSamples)

S

.append(LLM([

P_{2}^{2}

dim

c

]))

5: end for

Attributes

= MajorityVote(

S

) *list of attributes

Criteria[dim]=Attributes

S

.reset()

9: end for

3.4 Criteria Self-Refinement

To further refine the criteria and ensure its effectiveness in classifying image captions across the dataset, we implement a recursive refining process. This approach proposes a novel method for identifying image captions requiring further refinement utilizing the self-consistency of LLM responses as an indicator Wang et al. (2022). This choice stems from our empirical observation that if an image can be accurately classified according to a particular dimension, it should consistently be classified into the same attribute multiple times. Inconsistent responses, however, suggest that the current criteria require further refinement. In such cases, we leverage the LLM’s ability to self-refine text with additional instructions Madaan et al. (2023).

The prompts we used in this step are stated as follow:

$P_{3}^{1}$ = "Classify the caption by the criteria listed below."

$P_{3}^{2}$ = "We are unable to classify the following image caption using the provided criteria due to missing certain attributes. Please suggest an additional attribute that would enable accurate classification."

Algorithm 4 *

Step 3 Criteria Self-Refinement

0: criteria:

Criteria

, Captions:

C

, Large language model: LLM

0: Refined criteria:

Criteria

1: for [

dim

Attributes

] in Criteria do

2: for i in range(NumOfTests) do

c

C

.sample()

Results

.append(LLM([

P_{3}^{1}

c

dim

attribute

]))

5: end for

6: if not Consistent(

Results

)

7: for i in range(NumOfSuggests) do

S

.append(LLM([

P_{3}^{2}

c

dim

attribute

]))

9: end for

10:

newattribute

= MajorityVote(

S

)

11:

Criteria[dim].append(newattribute)

12:

S

.reset()

13: end if

14:

Results

.reset()

15: end for

3.5 Subpopulation Assignment

Equipped with the comprehensive criteria, we proceed to systematically assign each image to the specific attributes of each dimension. Images assigned to the same attributes across all dimensions form distinct subgroups within the dataset, revealing the intrinsic subpopulation structures hidden within the data. These subpopulation structures can then be leveraged to perform various downstream tasks, completing our overall pipeline for employing an LLM to analyze the dataset by discovering subpopulation structures.

The prompt we used in this step is stated as follow:

$P_{4}$ = "Please assign following caption to one attribute of given dimension."

Algorithm 5 *

Step 4 Subpopulation Assignment

0: Captions:

C

, Large language model: LLM, criteria:

Criteria

0: Further assignments for each caption

c

1: for

c

C

2: for [

dim

Attributes

] in

Criteria

c

.assign(LLM([

P_{4}

c

dim

Attributes

]))

4: end for

5: end for

3.6 Task-specific Tuning for Downstream Tasks

Once we have identified the criteria and subpopulations within the dataset, we can leverage this information to tackle several downstream tasks effectively. This includes organizing the subpopulations, which can reveal valuable insights into the data, such as potential subpopulation biases and the presence of long-tail attributes. More importantly, the subpopulation structures can be used to improve model performance on various tasks if combined with extra operations, including handling subpopulation shifts and slice discovery.

Dataset Subpopulation Organization There are several advantages to organizing subpopulations within a dataset. It provides information about the numbers of data points in each subpopulation and can reflect dataset biases and help us identify some long-tail subgroups. Recent research suggests that creating hierarchical datasets with additional image attributes is beneficial for building more complex benchmarks Liang & Zou (2022); Zhang et al. (2022). However, organizing subpopulations for such tasks currently requires significant manual effort. Our method, SSD-LLM, automates subpopulation organization, holding the potential to revolutionize the construction of hierarchical datasets.

Subpopulation Shift Subpopulation shift Yang et al. (2023), a common challenge in machine learning, occurs when the proportion of some subpopulations between training and deployment changes. Previous work Yang et al. (2023); Santurkar et al. (2020) have shown that subpopulation shifts can cause significant drops to model performances, necessitating the need for methods that can handle subpopulation shifts. SSD-LLM, combined with image generation, offers a solution to better handle the scenarios of subpopulation shifts. In our experiments, after we apply SSD-LLM to the datasets, we statistics the number of images contained in each subpopulation and utilize diffusion model to generate images for underrepresented subpopulations, which helps to achieve balanced classes and attributes. Specifically, we randomly sample attributes from the subpopulation structure for each class and employ LLM to make complete sentences based on these words as the input prompt of the diffusion model. SSD-LLM with image generation can achieve competitive results in several datasets, showcasing the effectiveness and strength of our method.

Slice Discovery Slice discovery is a task aiming at uncovering subpopulations within a dataset where a machine learning model consistently exhibits poor performances. These subpopulations with underperformances, or slices, provide valuable insights into the model’s limitations and potential biases. By identifying slices, researchers can gain a deeper understanding of where the models underperform and provide information about how to improve the model’s performance in these challenging subpopulations.

4 Experiments

We now present experimental results demonstrating the effectiveness of SSD-LLM. In particular, we present the main settings and results in this section and defer extra details, including the precise text prompts used and various visualization result, to appendix. In our experiments, we mainly use LLama2-accessory Zhang et al. (2023) for the VLM and GPT-4 Achiam et al. (2023) for the LLM, while ablation studies investigating other choices of foundation models are included in appendix.

Our approach was benchmarked against the most current baselines in tasks including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery, where it demonstrated competitive performance. This success underscores the method’s efficacy in identifying and analyzing subgroups, further affirming its utility in addressing related challenges. Moreover, it illustrates that the unified paradigm can effectively address a variety of downstream tasks.

Dataset	Criterion	SCAN	IC\|TC	Ours
Stanford 40 Action	Action	0.397	0.774	0.752
	Location	0.359*	0.822*	0.807*
	Mood	0.250*	0.793*	0.682*
Place365	Place	0.332	-	0.523

Table 1: Quantitative results of Dataset Subpopulation Organization. Accuracies labeled with * are evaluated by having a human provide ground truth labels introducing in Kwon et al. (2023b). We consider IC|TC as an upper bound because it involves manual justification during whole process. In particular, their approach manually specifies the dimensions.

4.1 Evaluation on Dataset Subpopulation Organization

Setup The data assignment process facilitated by SSD-LLM, which leverages the discovered subgroup structure, can be considered a form of clustering. To evaluate the quality of these identified subpopulations, we assess the clustering accuracy by comparing images against secondary labels that reflect subgroup attributes derived via SSD-LLM. Full information for datasets, text criterion, and model selection can be found in appendix.

Comparison Methods Scan Van Gansbeke et al. (2020) is a two-stage clustering method that decouples feature learning and clustering. IC|TC Kwon et al. (2023b) is a new paradigm for image clustering that supports human interaction. It utilizes the given Text Criteria to accurately control the quality of the clustering results.

Results and Analysis In Table 1, we report the average accuracy achieved by each method based on the predefined textual criteria. SSD-LLM outperforms all baseline methods, exhibiting an average improvement of 35.6% over the baseline metrics. This enhancement is attributed to the effective subpopulation structures uncovered by our approach. When compared to IC|TC, SSD-LLM demonstrates comparable performance. It is important to note that IC|TC is considered an experimental upper bound due to its incorporation of artificial judgment, which involves manually refining text criteria for improved clustering results; in contrast, our approach is fully automated. Moreover, SSD-LLM offers greater flexibility by not requiring the pre-specification of cluster centers, thereby eliminating the constraints imposed by the number of ground truth (GT) labels and leveraging the full potential of LLM for attribute discovery and assignment. We also conduct visualization of our organized subpopulations, as shown in Figure 4. We can visually observe that the pictures and dimensions are indeed consistent, indicating the effectiveness and rationality of our mining and assigning process.

Type	Method	Average Accuracy			Worst Group Accuracy
Type	Method	Waterbirds	Metashift	Average	Waterbirds	Metashift	Average
Vanilla	ERM	84.1	91.2	87.7	69.1	82.1	75.6
Subgroup Robust Methods	GroupDRO	86.9	91.5	89.2	73.1	83.1	78.1
	JTT	88.9	91.2	90.1	71.2	82.6	76.9
	LfF	86.6	80.4	83.5	75.0	72.3	73.7
	LISA	89.2	91.4	90.3	77.0	79.0	78.0
Imbalanced Learning	Resample	86.2	92.2	89.2	70.0	81.0	75.5
	Reweight	86.2	91.5	88.9	71.9	83.1	77.5
	Focal	89.3	91.6	90.5	71.6	81.0	76.3
	CBLoss	86.8	91.4	89.1	74.4	83.1	78.8
	BSoftmax	88.4	91.3	89.9	74.1	82.6	78.4
Traditional Data Augmentation	Mixup	89.2	91.4	90.3	77.5	79.0	78.3
Traditional Data Augmentation	RandAug	86.3	90.9	88.6	71.4	80.9	76.2
Diffusion	Class Prompt	85.9	91.5	88.7	71.3	82.7	77.0
	Class-Attribute Prompt	89.1	91.4	90.3	73.5	83.8	78.7
	CiP	88.0	91.1	89.6	73.5	82.4	77.9
LLM+Diffusion	SSD-LLM (Ours)	90.1	92.3	91.2	78.3	84.2	81.3

Table 2: Numerical comparison of methods and their performances for image classification with subpopulation shifts.

4.2 Evaluation on Subpopulation Shift

Setup We evaluate subpopulation shifts on two commonly used image datasets, Metashift (Cats vs Dogs) and Waterbirds (Landbirds vs Waterbirds). We choose Average Accuracy and Worst Group Accuracy as evaluation metrics. To ensure a fair comparison, following Yang et al. (2023), we conduct a random search of 16 trials over a joint distribution of all hyperparameters. We then use the validation set to select the best hyperparameters for each algorithm, fix them, and rerun the experiments under five different random seeds to report the final average results. To make the evaluation more realistic, we consider the model selection setting Attributes are unknown in both training and validation.

Comparison Methods Following recent benchmarking efforts Yang et al. (2023), we compare SSD-LLM with several types of methods: (1) vanilla: ERM Vapnik (1999), (2) Subgroup Robust Methods: GroupDRO Sagawa et al. (2019), LfF Nam et al. (2020), JTT Liu et al. (2021), LISA Yao et al. (2022), (3) Imbalanced Learning: ReSample Japkowicz (2000), ReWeight Japkowicz (2000), Focal Lin et al. (2017), CBLoss Cui et al. (2019), Bsoftmax Ren et al. (2020), (4) Traditional Data Augmentation: Mixup Zhang et al. (2017), RandAug Cubuk et al. (2020), (5) Diffusion: Class Prompt Shipard et al. (2023), Class-Attribute Prompt Shipard et al. (2023), CiP Lei et al. (2023).

Results and Analysis From Table. 2, SSD-LLM achieves better performance compared to previous methods, with an improvement of +0.7 and +2.5 in average accuracy and worst group accuracy respectively across two datasets. Further analysis is conducted from three perspectives: (1) Despite based on conventional ERM, data-based approaches achieve overall competitive performance compared to model-based algorithms, demonstrating the potential of such paradigm. (2) For diffusion-based methods, the better performance of class-attribute prompts over class prompts reveals the critical role of knowing dataset imbalanced attributes for alleviating subpopulation shift via image generation, primary due to more targeted generated images. However, its requirement for pre-identifying manually annotated imbalanced dataset attributes underscore the value of our SSD-LLM. We not only automate the discovery of dataset attributes but also provides more detailed annotations than manual labeling, leading to enhanced performance. (3) The better performance of CiP over Class Prompt highlights the significance of diverse text prompts. However, the more superior performance of our method can owe to the comprehensive analysis of the imbalance subpopulation distribution inherent within the dataset. Therefore, the text diversity in our approach (brought by attributes sampling and LLM making-sentence) is not arbitrary but rather strategic, tackling the subpopulation shift.

Method\|Categories	Boats	Birds	Cars	Cats	Dogs	Trucks	Topic Pass Rate
ImageNet	4.33	0.81	11.33	11.14	0.69	11.71	6.72
General Prompt	39.85	19.00	31.00	19.75	20.50	24.81	25.81
Domino(Bert)	42.34	23.21	29.87	24.56	26.86	32.19	29.83
SSD-LLM (Ours)	53.31	33.67	37.34	33.69	21.08	39.57	36.44

Table 3: Results of slice discovery on Imagenet-1K with various SDMs.The best accuracy is highlighted in bold, while the second one is underlined.

4.3 Evaluation on Slice Discovery

Setup In contrast to typical slice discovery tasks, we redefine evaluation pipeline following Gao et al. (2023a), showcasing superiority of our approach. In this study, we evaluate bugs found of ImageNet models. Specifically, a classification is deemed incorrect when an image containing target object is erroneously identified by the model as containing an unrelated object.

Comparison Methods Domino Eyuboglu et al. (2022) represents a state-of-the-art method in slice discovery, which effectively clusters errors identified in the validation set and characterizes them through captions generated automatically.

Results and Analysis We evaluate the effectiveness of our method in slice discovery on 6 representative superclasses in imagenet with Resnet50 He et al. (2016) backbone. As shown in Table. 3, our SSD-LLM overcomes all other SDMs, including Domino Eyuboglu et al. (2022), by a significant margin. We find topics given by Domino is tend to encounter two unsatisfactory cases: loss of semantics, and missing class information. These cases also be discussed in Gao et al. (2023a), where they reasoned this phenomenon into the inherent difficulties of automatic SDMs. However, evidence suggests that our SSD-LLM can handle these problems, while keeping a high error found rate and maintaining automation. Specifically, SSD-LLM achieves an average error rate of 36.44%, surpassing generic prompt and Domino Eyuboglu et al. (2022) by 10.63% and 6.61%, respectively. Furthermore, when we trace back to the origin dataset, the discovered slice is also very consistent(detailed visualizations included in appendix). Interestingly, we find the data mining process of SSD-LLM is just the same as human data scientists, who take up hypotheses and improve model performance by viewing batches of bad subpopulations. Experiments show that our method overcame the inherent difficulties while maintaining automaton, paving the way for data-centric methods.

4.4 Ablation Study

Hyperparameters of Criteria Initialization/Refinement The N $\times$ M | 1/T in Table. 4 represent suggestion rounds, suggestion samples at one time and threshold for majority vote. A is the main setting of our methods. For A $\rightarrow$ B, the performance drop indicates the effectiveness of Criteria Refinement, which can bring about more comprehensive subpopulation structure. For A $\rightarrow$ C, we double the suggestion rounds for dimension, with performance basically unchanged, illustrating the dimension of dataset relatively stable. For A $\rightarrow$ D and A $\rightarrow$ E, the performance decreased slightly, demonstrating the necessity of suitable hyperparameters for attribute analysis. For A $\rightarrow$ F, the performance drop indicates the significance of high quality attribute selection.

Methods	Index	Criteria Initialization		Criteria Refinement	Accuracy
Methods	Index	Dimension	Attribute	Criteria Refinement	Average	Worst Group
ERM	—	—	—	—	84.1	69.1
SSD-LLM	A	10 $\times$ 50 \| 1/4	20 $\times$ 20 \| 1/6	✓	90.1	78.3
	B	10 $\times$ 50 \| 1/4	20 $\times$ 20 \| 1/6	$\times$	88.3	77.2
	C	20 $\times$ 50 \| 1/4	20 $\times$ 20 \| 1/6	✓	90.3	78.4
	D	10 $\times$ 50 \| 1/4	10 $\times$ 40 \| 1/6	✓	89.3	77.6
	E	10 $\times$ 50 \| 1/4	40 $\times$ 10 \| 1/6	✓	89.5	78.0
	F	10 $\times$ 50 \| 1/4	20 $\times$ 20 \| 1/20	✓	89.0	77.5

Table 4: Ablation study on the number of samples (NUM per round

\times

NUM of rounds), majority voting ratio, and criteria refinement(with or without).

Diffusion Generation Strategy From Table. 5, we can analyze the key components for employing diffusion model to help subpopulation shift. For A $\rightarrow$ B, we verify the command of imbalance attributes within datasets is helpful to subpopulation shift tasks. For B $\rightarrow$ D, we illustrate the comprehensive subpopulation structure with diverse attributes are more benefitial for the tasks. For B $\rightarrow$ C and D $\rightarrow$ E, we demonstrate the effectiveness of balanced subpopulation sampling strategy, which helps to balance the subpopulation distribution of datasets and improve the data quality for better model training. For E $\rightarrow$ F, the prompt from LLM sentence-making can generate more reasonable images and thus achieving better results. For F $\rightarrow$ G and H, we verify the scaling capability of our SSD-LLM.

Methods	Index	Attribute Mode	Sample Mode	SD Prompt Mode	Number	Average Accuracy	Worst Group Accuracy
ERM	——	——	——	——	x1	84.1	69.1
SSD-LLM	A	GT Attribute Unknown	random	Direct	×1	85.9	71.3
	B	GT Attribute Known	random	Direct	×1	89.1	73.5
	C	GT Attribute Known	weighted	Direct	×1	89.3	73.8
	D	SSD-LLM Attribute	random	Direct	×1	89.5	76.2
	E	SSD-LLM Attribute	weighted	Direct	×1	89.8	77.4
	F	SSD-LLM Attribute	weighted	LLM Sentence	×1	90.1	78.3
	G	SSD-LLM Attribute	weighted	LLM Sentence	×2	91.1	79.2
	H	SSD-LLM Attribute	weighted	LLM Sentence	×4	91.4	79.6

Table 5: Ablation study on the sample mode and SD prompt mode.

5 Conclusion

In this work, we present the first systematic exploration of subpopulation structure discovery. We provide a precise definition of subpopulation structure and introduce a fine-grained criteria to determine the structures. We have proposed a novel method, SSD-LLM, for automatic subpopulation structure discovery incorporating elaborate prompt engineering techniques . SSD-LLM can be combined with various subsequent operations to better tackle several downstream tasks, including dataset subpopulation organization, subpopulation shift and slice discovery, with minor task specific tuning. We have conducted extensive experiments to verify our proposed methods, demonstrating the remarkable effectiveness and and generality of SSD-LLM.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Chen et al. (2023) Muxi Chen, Yu Li, and Qiang Xu. Hibug: On human-interpretable model debug. 2023.
Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 702–703, 2020.
Cui et al. (2023) Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
Cui et al. (2019) Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9268–9277, 2019.
Dai et al. (2023) Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023.
Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
Dunlap et al. (2023) Lisa Dunlap, Alyssa Umino, Han Zhang, Jiezhi Yang, Joseph E Gonzalez, and Trevor Darrell. Diversify your vision datasets with automatic diffusion-based augmentation. arXiv preprint arXiv:2305.16289, 2023.
Eyuboglu et al. (2022) Sabri Eyuboglu, Maya Varma, Khaled Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, and Christopher Ré. Domino: Discovering systematic errors with cross-modal embeddings. arXiv preprint arXiv:2203.14960, 2022.
Fu et al. (2023) Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023.
Gao et al. (2023a) Irena Gao, Gabriel Ilharco, Scott Lundberg, and Marco Tulio Ribeiro. Adaptive testing of computer vision models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4003–4014, 2023a.
Gao et al. (2023b) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023b.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Japkowicz (2000) Nathalie Japkowicz. The class imbalance problem: Significance and strategies. In Proc. of the Int’l Conf. on artificial intelligence, volume 56, pp. 111–117, 2000.
Kim et al. (2023) Younghyun Kim, Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jaeho Lee, and Jinwoo Shin. Bias-to-text: Debiasing unknown visual biases through language interpretation. arXiv preprint arXiv:2301.11104, 2023.
Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS, 35:22199–22213, 2022.
Kwon et al. (2023a) Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K Ryu, and Kangwook Lee. Image clustering conditioned on text criteria. 2023a.
Kwon et al. (2023b) Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K Ryu, and Kangwook Lee. Image clustering conditioned on text criteria. arXiv preprint arXiv:2310.18297, 2023b.
Lei et al. (2023) Shiye Lei, Hao Chen, Sen Zhang, Bo Zhao, and Dacheng Tao. Image captions are natural prompts for text-to-image models. arXiv preprint arXiv:2307.08526, 2023.
Li & Qiu (2023) Xiaonan Li and Xipeng Qiu. Mot: Pre-thinking and recalling enable chatgpt to self-improve with memory-of-thoughts. arXiv preprint arXiv:2305.05181, 2023.
Liang & Zou (2022) Weixin Liang and James Zou. Metashift: A dataset of datasets for evaluating contextual distribution shifts and training conflicts. arXiv preprint arXiv:2202.06523, 2022.
Lin et al. (2017) Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
Liu et al. (2021) Evan Z Liu, Behzad Haghgoo, Annie S Chen, Aditi Raghunathan, Pang Wei Koh, Shiori Sagawa, Percy Liang, and Chelsea Finn. Just train twice: Improving group robustness without training group information. In International Conference on Machine Learning, pp. 6781–6792. PMLR, 2021.
Liu et al. (2023a) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
Liu et al. (2023b) Jiaxiang Liu, Tianxiang Hu, Yan Zhang, Xiaotang Gai, Yang Feng, and Zuozhu Liu. A chatgpt aided explainable framework for zero-shot medical image diagnosis. arXiv preprint arXiv:2307.01981, 2023b.
Liu et al. (2023c) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023c.
Luo et al. (2023) Yifan Luo, Yiming Tang, Chengfeng Shen, Zhennan Zhou, and Bin Dong. Prompt engineering through the lens of optimal control. arXiv preprint arXiv:2310.14201, 2023.
Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
Nam et al. (2020) Junhyun Nam, Hyuntak Cha, Sungsoo Ahn, Jaeho Lee, and Jinwoo Shin. Learning from failure: De-biasing classifier from biased classifier. Advances in Neural Information Processing Systems, 33:20673–20684, 2020.
Park et al. (2023) Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
Qian et al. (2023) Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
Qiao et al. (2022) Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597, 2022.
Ren et al. (2020) Jiawei Ren, Cunjun Yu, Xiao Ma, Haiyu Zhao, Shuai Yi, et al. Balanced meta-softmax for long-tailed visual recognition. Advances in neural information processing systems, 33:4175–4186, 2020.
Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
Sagawa et al. (2019) Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731, 2019.
Santurkar et al. (2020) Shibani Santurkar, Dimitris Tsipras, and Aleksander Madry. Breeds: Benchmarks for subpopulation shift, 2020.
Shipard et al. (2023) Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, and Clinton Fookes. Diversity is definitely needed: Improving model-agnostic zero-shot classification via stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 769–778, 2023.
Van Gansbeke et al. (2020) Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, and Luc Van Gool. Scan: Learning to classify images without labels. In European conference on computer vision, pp. 268–285. Springer, 2020.
Vapnik (1999) Vladimir N Vapnik. An overview of statistical learning theory. IEEE transactions on neural networks, 10(5):988–999, 1999.
Wah et al. (2011) Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Xie et al. (2021) Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
Yang et al. (2023) Yuzhe Yang, Haoran Zhang, Dina Katabi, and Marzyeh Ghassemi. Change is hard: A closer look at subpopulation shift, 2023.
Yao et al. (2022) Huaxiu Yao, Yu Wang, Sai Li, Linjun Zhang, Weixin Liang, James Zou, and Chelsea Finn. Improving out-of-distribution robustness via selective augmentation. In International Conference on Machine Learning, pp. 25407–25437. PMLR, 2022.
Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Zhang et al. (2023) Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, and Yu Qiao. Llama-adapter: Efficient finetuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
Zhang et al. (2024) Rongyu Zhang, Zefan Cai, Huanrui Yang, Zidong Liu, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Baobao Chang, Yuan Du, et al. Vecaf: Vlm-empowered collaborative active finetuning with training objective awareness. arXiv preprint arXiv:2401.07853, 2024.
Zhang et al. (2022) Xingxuan Zhang, Yue He, Renzhe Xu, Han Yu, Zheyan Shen, and Peng Cui. Nico++: Towards better benchmarking for domain generalization, 2022.
Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.