Improving Text Classification with Large Language Model-Based Data Augmentation

Zhao, Huanhuan; Chen, Haihua; Ruggles, Thomas A.; Feng, Yunhe; Singh, Debjani; Yoon, Hong-Jun

doi:10.3390/electronics13132535

Open AccessArticle

Improving Text Classification with Large Language Model-Based Data Augmentation

by

Huanhuan Zhao

^1,*,

Haihua Chen

²

,

Thomas A. Ruggles

³

,

Yunhe Feng

⁴,

Debjani Singh

³ and

Hong-Jun Yoon

^5,*

¹

Data Science and Engineering, The University of Tennessee, Knoxville, TN 37996, USA

²

Department of Information Science, The University of North Texas, Denton, TX 76203, USA

³

Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA

⁴

Computational Sciences and Engineering, The University of North Texas, Denton, TX 76203, USA

⁵

Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(13), 2535; https://doi.org/10.3390/electronics13132535

Submission received: 30 April 2024 / Revised: 17 June 2024 / Accepted: 25 June 2024 / Published: 28 June 2024

(This article belongs to the Special Issue AI Test)

Download

Browse Figures

Versions Notes

Abstract

:

Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model’s classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model’s performance.

Keywords:

data augmentation; large language model; ChatGPT; imbalanced data; text classification; natural language processing; machine learning; artificial intelligence

1. Introduction

The classification of natural language texts is a highly researched topic in the fields of artificial intelligence (AI) and machine learning (ML). Since the emergence of deep learning, various applications such as automatic data collection, filtering, and curation have been significantly improved. Recent developments in self-attention mechanism-based language models [1,2,3] have made significant strides and have had a profound impact on our daily lives. In essence, ML models rely heavily on the corpus of training data. Thus, their ability to produce accurate inferences is limited to the knowledge included in the training data. In scenarios where the dataset is imbalanced, with hundreds or thousands of training samples for certain labels and few or zero training samples for others, machine learning models typically generate acceptable predictions for the majority classes but encounter challenges in making accurate predictions for the minority classes. Data Augmentation (DA), which aims at increasing the volume, quality, and diversity of training data, has emerged as an effective technique to address this issue, and numerous studies and efforts have been made thus far [4,5,6,7,8]. Earlier DA methods usually focus on obtaining augmentation data by manipulating the original training data through techniques like random deletion, insertion, swapping, synonym replacement [9,10,11,12], and back translation [13]. However, with the advent of recently developed large language models (LLMs) that exhibit advanced language understanding and text generation capabilities, researchers can leverage them for rewriting, rephrasing, or summarizing the training data or generating entirely new samples as augmentation data [14]. Sarker et al. [15] instructed ChatGPT to rewrite the clinical notes to help improve both medication identification and medication event classification. Yuan et al. [16] asked ChatGPT to rewrite the clinical notes to help improve compatibility between electronic health records (EHRs) and clinical trial descriptions. Cohen et al. [17] combined back translation with GPT-3 rewritten samples to enhance social network hate detection. Dai et al. [18] tasked ChatGPT to rewrite each sentence in the training samples into multiple conceptually similar but semantically different samples. The rewritten samples were used as augmentation data to aid the classification of the target dataset. Yoo et al. [19] randomly selected samples (sentences with the corresponding labels) from the original training dataset and embeded the samples into the prompt. Following these prompts, GPT-3 first generated sentences influenced by the samples then assigned soft labels to the generated sentences. The generated samples were then used as augmentation data for classification tasks.

Although there are various ways to instruct the LLM to generate desired data, these methods fall into two bigger categories: rewrite the original training data and generate entirely new data from scratch. Rewritten samples are more similar to the original dataset and new generated samples infuse new information (features) to the training dataset. It remains unclear which method benefits the model more. In previous studies, authors usually utilize one method without comparing with the other method. To maximum the effectiveness of the LLM-based DA, this study conduct experiments with both DA methods. There’s an intuition that, for some domain specific topic, it is hard for the LLM to synthesis samples from scratch. Therefore, we chose one general topic dataset—Reuters news data and one domain specific data—Mitigation dataset to perform the experiments.

The primary contribution of this study is an analysis of two main LLM-based DA methods for enhancing the classification of imbalanced dataset. To be more specific:

We conduct experiments with two main LLM-based DA methods: rewrite samples and generate entirely new samples using ChatGPT, with both general and domain specific datasets.
We further investigate optimum new generated samples’ size for DA.
We proposed combining new samples with rewritten samples to further improve the classification result for minority classes.

Section 2 provides a detailed overview of the approach, including data, classification model, and performance measurement. Comparative experimental results are presented in Section 3, and Section 4 discusses the findings and potential for further improvement. The paper is a substantially extended version of the IEEE AITest 2023 conference paper “Enhancing Text Classification Models with Generative AI-aided Data Augmentation” [20].

2. Materials and Methods

We conducted an experimental study to evaluate the effectiveness of two main LLM-based DA methods for enhancing the text classification model’ performance on two datasets. This section provides details about the ML model we tested and the experimental design employed in the study.

2.1. Dataset

For the study, we utilized the Reuters news data and the Mitigation dataset.

2.1.1. Reuters News Dataset

The Reuters corpus provided by the Natural Language Tool Kit (NLTK) [21] Python library. The corpus consisted of 10,788 news articles, with a total of 1.3 million words. The corpus contains pre-defined “training” and “test” sets with 7769 and 3019 cases, respectively. Note that we randomly held out 10% of the training samples for validation of the model training. Each news article belongs to one or more of the 90 pre-defined categories, forming multi-class labels. The corpus provides a multi-labeled dataset for text classification tasks. Each article is labeled with a number from one to fifteen. However, this is a long-tail imbalanced dataset, as the number of samples (articles) for each label (topic) varies greatly, ranging from 1 to 2877. Figure 1 displays the count of samples for each label through a bar plot. The number of words in each article ranges from 2 to 1316, with an average of 130 words per article. The distribution of the articles’ lengths is shown in Figure 2.

2.1.2. Mitigation Dataset

Environmental mitigation strategies, including water quality standards, fish passage infrastructure, and species conservation, are critical for the ecologically sustainable advancement of hydropower resources. The Federal Energy Regulatory Commission (FERC) necessitates these mitigations during the licensing procedure for non-federal hydropower projects [22], highlighting the need for unified, countrywide data regarding these mandates. Hydropower licensing documents serve as comprehensive reservoirs of scientific data, encapsulating critical information on environmental conditions key to the sustainable progression of hydropower resources. Each license document, extending beyond 15,000 words, comprises 135 class labels demanding identification.

Identification and collation of the environmental mitigation data have historically been conducted by human experts possessing profound scientific knowledge in the respective field. Nevertheless, the manual curation of this information poses a significant challenge, given the extensive nature of each licensing document and the large quantity of mitigation labels requiring identification. The implementation of Natural Language Processing (NLP) models can potentially ease the burden of manual labor and also decrease the variability of annotation due to differences between individual observers.

In this study, a trained analyst annotated 1869 segments. Each segment corresponds to at least one of the 93 out of 135 mitigation categories. The analyst identified sentences and paragraphs related to mitigation terms and mapped them to their respective mitigation IDs. These segments were extracted from mitigation license documents from the period 2014 to 2017. Note that each segment may include specific terminology and phrases that detail the requirements of environmental mitigation plans. This often leads to the allocation of multiple mitigation IDs to a single segment. Consequently, the annotation process is inherently imbalanced, posing additional complexities for the implementation of machine learning models.

2.2. Machine Learning Model for Text Comprehension and Classification

Given that the datasets are labeled as multi-class, it is imperative for the model to be designed to enable multiple choices of labels. The final decision layer’s output nodes should be equipped with a sigmoid activation function that applies binary cross-entropy for optimization in the back-propagation. To implement the proposed approach of augmenting augmentation data for natural language text classification, we utilized the BERT models. The BERT model has been implemented using the PyTorch [23] platform on Python 3.10. Bidirectional encoder representations from transformers (BERT) [24] is currently the most successful ML model for NLP. It has achieved superb classification accuracy scores across many applications. BERT applies multiple layers of self-attention mechanism to identify keywords that characterize documents at scale. We apply a fully-connected layer at the top to make final inferences. For our study, we utilized the pre-trained bert_base_uncased model from the HuggingFace [25] library, which is widely recognized for its high performance in NLP tasks.

2.3. Augmenting Generated Data to the Text Classification

2.3.1. Obtain Augmentation Data from ChatGPT

We utilized two approaches to obtain augmentation datasets from ChatGPT (GPT3.5). 1. Asking ChatGPT to generate new samples from scratch according to the given labels. 2. Asking ChatGPT to rewrite the samples from the training data. For the Reuters dataset, we directly instructed ChatGPT to write a new article according to the given topic (Appendix A). However, for the Mitigation dataset, we did prompts engineering to obtain desired augmentation data. The mitigation classification system consists of six Tier 1 (T1) categories, twenty Tier 2 (T2) categories, and a total of 135 sub-categories (Tier 3) [26]. Each mitigation category is assigned a unique six-digit ID. Our task is to predict the Tier 3 mitigation IDs for given text data. When we instruct ChatGPT to write a sample according to the given Tier 3 mitigation ID, the generated samples usually contain other mitigation IDs that are under the same Tier 2 ID which introduced too much noise to the augmentation data. For example, there are four Tier 3 IDs under Tier 2 ID “Riparian”: “Riparian habitat monitoring or planning, Establish riparian buffers, Riparian habitat enhancement, Dust control and abatement”. Since these four mitigation requirements are about the same topic “Riparian”, when giving “Establish riparian buffers” to ChatGPT, it generates text mixing these four mitigation requirements together. However, we found that when listing all the mitigation requirements under one Tier 2 ID for ChatGPT, it is able to generate the samples for each label separately. Here is the prompt we used: “{list of mitigation requirements} is a list of mitigation requirements at hydropower project, write a paragraph for each of them as if they are extracted from the requirement license. The format of the answer should be like A: B where A represents the mitigation requirement, B represents the corresponding generated text. Do not change the mitigation requirement name in the list”. The prompts are presented in the Appendix A.

To ensure consistency, we applied the same text data pre-processing pipeline, which included tokenization and vectorization, as used in the training corpus from the Reuters dataset.

2.3.2. Integrate the Generated Data to the Model

In order to incorporate augmentation data into our model, we added an augmentation data training loop after each batch of the original data. During a given training update within each epoch, the following steps are taken:

The binary cross-entropy loss is calculated from the given minibatch of training samples and backpropagation is performed.
A minibatch of augmentation samples is randomly selected, the binary cross-entropy loss is calculated, and backpropagation is performed.

2.4. Experimental Design

To do a comprehensive study of these two LLM-based DA methods, we experimented with both general topic and domain specific datasets. We further investigated how the new generated sample’s size affect the effectiveness of the DA method. The following are experiments related to this procedure.

2.4.1. Evaluate the DA Effectiveness of Rewritten Samples and New Generated Samples

To evaluate the DA effectiveness of the rewritten samples and new generated samples, we generated 20 samples for each label in the Reuters news data and the Mitigation data using the prompts described in Section 2.3.1. For the number of the rewritten samples we referenced to the Easy Data Augmentation(EDA) paper [4], as rewritten samples shares similarities with EDA, both involve modifying the original data to generate augmented data. According to the recommendation in [4], the recommended number of augmented samples depends on the size of the original sample. We instructed ChatGPT to generate four rewritten samples for each training sample in the Reuters dataset and Mitigation dataset. Then we integrated the augmentation data into the training procedure as described in Section 2.3.2.

2.4.2. Investigate the Optimum New Generated Samples’ Size for DA

Adequate training samples for a label can significantly impact the classification performance of a model. Generally, more new generated samples brings more new information and allow the model to learn more features and produce better results. However, there may be a point at which adding more augmentation data will not improve the results any further, as all useful features have been covered. Furthermore, for labels that already have sufficient training samples and high accuracy, adding augmentation data may introduce noise and decrease performance. To learn how the augmentation sample size affect the DA effectiveness, we experimented with different augmentation samples size for both datasets.

2.4.3. Combining Rewritten Data with New Generated Data

According to the similarity analysis in Figure 3, ChatGPT-generated data exhibits strong intrinsic similarity but is less similar to the training data. This suggests that ChatGPT-generated data introduces novel information, contributing to improved classification results, but may induce topic drift for minority classes. Including rewritten sample can help maintain feature consistency for minority classes. We hypothesize that combining rewritten data with newly generated data would further improve the DA effectiveness.

2.5. Performance Measure

Due to the severe class imbalance and multi-label annotations present in our data corpus, it is necessary to calculate both macro- and micro-averaged F1 metrics using a class-wise multi-label confusion matrix. In this context, macro-averaged F1 scores are equally weighted among the class labels, while micro-averaged F1 scores are equally weighted among individual decisions. To calculate these scores, we use the Scikit-Learn [27] Python library.

For each class label i, we obtain

a_{i}

,

b_{i}

,

c_{i}

, and

d_{i}

, where a stands for true positives, b represents true negatives, c represents false negatives, and d represents false positives. To calculate the macro-averaged precision, recall, and F1 scores, we computed those scores for each label separately, and then took their average over all the labels. In contrast, to compute the micro-averaged precision, recall, and F1 scores, we aggregated the

a_{i}

,

b_{i}

,

c_{i}

, and

d_{i}

across all labels and computed the corresponding overall scores. Equations (1) and (2) illustrate the method of calculating macro- and micro-averaged scores.

p_{m a c r o} = \frac{\sum_{i = 1}^{N} p_{i}}{N} p_{i} = \frac{a_{i}}{a_{i} + d_{i}} (i = 1, \dots, N)

(1)

p_{m i c r o} = \frac{\sum_{i = 1}^{N} a_{i}}{\sum_{i = 1}^{N} a_{i} + \sum_{i = 1}^{N} d_{i}} (i = 1, \dots, N)

(2)

where N is the total number of the labels. In our case,

N = 90

.

3. Results

3.1. Evaluate the DA Effectiveness of Rewritten Samples and New Generated Samples

Table 1 shows the BERT model’s performance without augmentation data, with ChatGPT rewritten samples and with new generated samples for the Reuters news data. From the table we can see that both rewritten data and new generated data lead to improved accuracy across both macro- and micro-averaged metrics. With new generated data the macro-F1 increased from 49.87 to 65.73 and with rewritten data the macro-F1 increased from 49.87 to 61.7. New generated data leads to better DA effectiveness. Table 2 shows the BERT model’s performance without augmentation data, with ChatGPT rewritten samples and with new generated samples for the Mitigation data. With new generated data the macro-F1 increased from 13.32 to 15.42 and with rewritten data the macro-F1 decreased. We can also observe that the enhancement was more pronounced in macro-averaged scores than in micro-averaged ones, suggesting that the DA methods significantly improves the accuracy of minor class labels.

3.2. Investigate the Optimum Samples Size of the LLM-Based DA Method

In Table 3, we present classification accuracy scores for the Reuters dataset with 5, 10, 15, and 20 newly generated samples. A notable increase in Macro F1 score is observed from 5 to 10 samples (64.72, falling outside the confidence interval [60.03, 62.29]); however, we noticed a plateau from 10 samples to 20 samples. Table 4 illustrates similar improvements in the Mitigation dataset, where 10 samples yield comparable results to 20 samples (15.13 vs. 15.42). These findings suggest that when integrating ChatGPT-generated samples as augmentation data, generating 10 new samples for each label suffices, while additional samples may provide marginal enhancements.

3.3. Combining Rewritten Data with New Generated Data

Table 5 presents the BERT model’s performance for Reuters dataset with three different augmentation datasets: ChatGPT rewritten samples, ChatGPT newly generated samples, and generated samples plus rewritten samples. As depicted in Table 5, this combination resulted in a noteworthy enhancement of the macro F1 score from 65.73% to 67.14%, a substantial improvement compared to solely utilizing newly generated samples for augmentation. However, such an increase was not evident for the Mitigation dataset, as demonstrated in Table 6.

3.4. Difference Analysis of the Newly Generated Data and the Rewritten Data

To quantitatively evaluate the differences between the newly generated data and the rewritten data, we performed a vocabulary analysis on the training dataset, the newly generated data, and the rewritten data. For the Reuters dataset, the original training data contains 21,764 unique words, the newly generated data contains 8597 unique words, and the rewritten data contains 18,376 unique words. We then calculated the words that appeared in the augmentation data but did not exist in the training dataset and plotted them in a heatmap. From Figure 4a, we can see that 1619 words are present in the rewritten data but not in the training data, and 2768 words are present in the newly generated data but not in the training data. This indicates that the newly generated data introduced more information into the training. A similar trend is observed in the mitigation dataset (Figure 4b), with 536 words in the rewritten data but not in the training data, and 964 words in the newly generated data but not in the training data. We also noticed that in the mitigation dataset, the rewritten vocabulary has a much smaller size compared to the training vocabulary (1892 vs. 4577), this is consistent with our previous observation that when rewriting domain-specific data, ChatGPT tends to replace sophisticated terminology with general words, which may harm the model’s performance (as shown in Table 6).

3.5. Categorical Analysis

To analyze how different DA methods impacts prediction results for each class, a categorical analysis was conducted using the Reuters data results, as presented in Table 7. The first column denotes the label, the second column shows the number of samples in the original training data for each label, and the subsequent columns display F1 scores corresponding to different DA methods. Each F1 score was computed by averaging results from ten runs. ChatGPT-generated samples, created from scratch, prove highly informative and beneficial. As revealed in Table 7, introducing 20 newly generated samples not only enhanced the F1 score for most minority classes but also improved scores for majority classes, such as ‘money-fx’ (from 78.56% to 83.91%) and ’grain’ (from 90.31% to 94.20%), even with a much smaller augmentation sample size compared to the original training data.

Additionally, the average influences of DA on majority and minority classes were assessed by calculating the average F1 scores for labels with samples exceeding and falling below a threshold (set at 40), as depicted in Table 8. Without DA, the average F1 score for minority classes is significantly lower than that for majority classes. Introducing DA enhances the average F1 score for minority classes from 0.3026 to 0.6064. Notably, the improvement achieved with rephrased samples plus new samples (0.6064) surpasses that with only new samples (0.5701) and rephrased samples alone (0.5209). For majority classes, all three DA methods contribute to performance improvement (from 0.7627 to 0.8217). However, these three methods exhibit similar levels of improvement.

4. Discussion and Conclusions

LLMs have gained popularity since the debut of ChatGPT. With their large number of trainable parameters, pre-training with a substantial amount of articles and documents, they achieved noteworthy performance in chatting, question answering, and information retrieval. Early adoption studies have already shown remarkable results, making it clear that these models have great potential for various NLP tasks. However, it is still too early to expect that GPT models can solve complex real-world problems independently. Nonetheless, with proper guidance, we can leverage the vast amounts of information they provide to enhance various NLP tasks, such as data augmentation for text data classification. This paper evaluated the effectiveness of two main LLM-based DA methods for natural language text classification—rewriting samples and generating new samples using ChatGPT. Furthermore, we found the optimum samples size for DA when using ChatGPT generated samples. Finally, we investigated a hybrid data augmentation approach that may further improve the model’s classification results.

The results from Section 3.1 indicate that newly generated data produced better performance compared to rewritten data, with much fewer augmentation samples. The rewritten data for the Mitigation dataset diminished the model’s performance, which goes against intuition. This reveals a drawback of rewriting samples for data augmentation. In text classification tasks, there are chances that the model classifies the text content according to several critical words in the sentence, especially for domain-specific data. Rewritten samples may replace these critical words with other synonyms, thus losing important information. However, this issue may not arise for general topic dataset like Reuter news data. For the Reuters news data, rewritten minority classes and combining them with new generated data further boost the model’s performance. This verified our hypothesis that rewritten data helps maintain feature consistency for minority classes, while newly generated data introduces new information to the entire dataset. The outcomes from Section 3.2 indicate that the sample size does affect the model’s performance. However, the margin of the improvement decreases when increasing the sample size, and an optimal augmentation size is attained with 10–20 newly generated samples for each label. From Section 3.4 categorical analysis, it is evident that these two DA methods has substantially enhanced the prediction scores of the Reuters minority classes, aligning with the primary objective of data augmentation.

This study not only underscores the strengths and limitations of two main LLM-based DA methods but also guides optimal strategies for employing LLMs in enhancing text classification models. This study could be particularly useful in text classification tasks that suffer from severe class imbalance issues. The rise of other LLMs trained with domain knowledge provides good resources for DA. For instance, Med-PaLM2 [28] demonstrates impressive capabilities in answering medical questions, suggesting its potential use for generating medical data to enhance the classification and information extraction of clinical and health-related documents.

Author Contributions

Conceptualization, H.-J.Y.; methodology, H.Z., H.C. and H.-J.Y.; software, H.-J.Y.; validation, H.Z.; formal analysis, H.Z. and H.-J.Y.; investigation, H.Z., T.A.R. and H.C.; resources, Y.F., T.A.R. and D.S.; data curation, H.-J.Y., T.A.R. and D.S.; writing—original draft preparation, H.Z. and H.-J.Y.; writing—review and editing, H.-J.Y., H.C. and Y.F.; visualization, H.Z.; supervision, H.-J.Y.; project administration, H.-J.Y. and D.S.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by US Department of Energy’s Water Power Technologies Office.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data sets generated and/or analysed during the current study are available from the corresponding author on reasonable request.

Acknowledgments

The paper is a substantially extended version of the IEEE AITest 2023 conference paper “Enhancing Text Classification Models with Generative AI-aided Data Augmentation” [20]. This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan (accessed on 30 April 2024)).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLM	Large Language Model
DA	Data Augmentation
NLP	Natural Language Processing

Appendix A

Prompt 1: “write an article with N words about LABEL in Reuters news format”. Here, LABEL represents the topic for which we aimed to create data, and N represents a designated word count. We applied three specific word counts (50, 150, and 250) in our experiment.

Prompt 2: “You are a technique writer, {} is a list of mitigation requirements at hydropower project, write a paragraph for each of them as if they are extracted from the requirement license. The format of the answer should be like A: B where A represents the mitigation requirement, B represents the corresponding generated text. Do not change the mitigation requirement name in the list”.

Prompt 3: “rewrite the following content: samples from the original training dataset”.

References

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv 2023, arXiv:1910.10683. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. Available online: https://api.semanticscholar.org/CorpusID:160025533 (accessed on 30 April 2024).
Wei, J.; Zou, K. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar]
Akkaradamrongrat, S.; Kachamas, P.; Sinthupinyo, S. Text generation for imbalanced text classification. In Proceedings of the 2019 16th International Joint Conference on Computer Science and Software Engineering (JCSSE), Chonburi, Thailand, 10–12 July 2019; pp. 181–186. [Google Scholar]
Hu, Z.; Tan, B.; Salakhutdinov, R.R.; Mitchell, T.M.; Xing, E.P. Learning data manipulation for augmentation and weighting. Adv. Neural Inf. Process. Syst. 2019, 32, 15764–15775. [Google Scholar]
Xu, B.; Qiu, S.; Zhang, J.; Wang, Y.; Shen, X.; de Melo, G. Data augmentation for multiclass utterance classification—A systematic study. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; pp. 5494–5506. [Google Scholar]
Chen, H.; Pieptea, L.F.; Ding, J. Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches. IEEE Trans. Reliab. 2022, 71, 657–673. [Google Scholar] [CrossRef]
Karimi, A.; Rossi, L.; Prati, A. AEDA: An Easier Data Augmentation Technique for Text Classification. arXiv 2021, arXiv:2108.13230. [Google Scholar]
Kolomiyets, O.; Bethard, S.; Moens, M.F. Model-Portability Experiments for Textual Temporal Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011. [Google Scholar]
Xie, Z.; Wang, S.I.; Li, J.; Levy, D.; Nie, A.; Jurafsky, D.; Ng, A.Y. Data Noising as Smoothing in Neural Network Language Models. arXiv 2017, arXiv:1703.02573. [Google Scholar]
Li, Y.; Cohn, T.; Baldwin, T. Robust Training under Linguistic Adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3–7 April 2017. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. arXiv 2016, arXiv:1511.06709. [Google Scholar]
Ye, J.; Gao, J.; Li, Q.; Xu, H.; Feng, J.; Wu, Z.; Yu, T.; Kong, L. ZEROGEN: Efficient Zero-shot Learning via Dataset Generation. arXiv 2022, arXiv:2202.07922. [Google Scholar]
Sarker, S.; Qian, L.; Dong, X. Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification. arXiv 2023, arXiv:2306.07297. [Google Scholar]
Yuan, J.; Tang, R.; Jiang, X.; Hu, X. Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching. arXiv 2023, arXiv:2303.16756. [Google Scholar]
Cohen, S.; Presil, D.; Katz, O.; Arbili, O.; Messica, S.; Rokach, L. Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time. Inf. Fusion 2023, 99, 101887. [Google Scholar] [CrossRef]
Dai, H.; Liu, Z.; Liao, W.; Huang, X.; Cao, Y.; Wu, Z.; Zhao, L.; Xu, S.; Liu, W.; Liu, N.; et al. AugGPT: Leveraging ChatGPT for Text Data Augmentation. arXiv 2023, arXiv:2302.13007. [Google Scholar]
Yoo, K.M.; Park, D.; Kang, J.; Lee, S.W.; Park, W. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual, 16–20 November 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 2225–2239. [Google Scholar] [CrossRef]
Zhao, H.; Chen, H.; Yoon, H.J. Enhancing Text Classification Models with Generative AI-aided Data Augmentation. In Proceedings of the 2023 IEEE International Conference on Artificial Intelligence Testing (AITest), Athens, Greece, 17–20 July 2023; pp. 138–145. [Google Scholar] [CrossRef]
Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
Pracheil, B.M.; Levine, A.L.; Curtis, T.L.; Aldrovandi, M.S.; Uría-Martínez, R.; Johnson, M.M.; Welch, T. Influence of project characteristics, regulatory pathways, and environmental complexity on hydropower licensing timelines in the US. Energy Policy 2022, 162, 112801. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv 2019, arXiv:1910.03771. [Google Scholar]
Schramm, M.P.; Bevelhimer, M.S.; DeRolph, C.R. A synthesis of environmental and recreational mitigation requirements at hydropower projects in the United States. Environ. Sci. Policy 2016, 61, 87–96. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Hou, L.; Clark, K.; Pfohl, S.; Cole-Lewis, H.; Neal, D.; et al. Towards Expert-Level Medical Question Answering with Large Language Models. arXiv 2023, arXiv:2305.09617. [Google Scholar]

Figure 1. The number of articles for each topic in the Reuters corpus, illustrates that the dataset is severely imbalanced.

Figure 2. The article length in the Reuters corpus.

Figure 3. Selected heatmap examples for cosine similarities between training data and ChatGPT generated data. Row 1–5 are training data and row 6–10 are ChatGPT generated data. The analysis was performed with Reuters news data.

Figure 4. Heatmap showing the non-overlap words among training, new generated and rewritten data for the Reuters and Mitigation dataset.

Table 1. BERT Model’s performance with rewritten and new generated data—Reuters data.

	Macro (Unit: %)			Micro (Unit: %)
	Precision	Recall	F1 Score	Precision	Recall	F1 Score
without DA	57.17	47.25	49.87	91.30	87.69	89.45
without DA	(53.82, 60.53)	(44.02, 50.48)	(46.70, 53.03)	(90.84, 91.75)	(86.80, 88.58)	(89.15, 89.75)
with rewritten data	68.25	59.32	61.7	90.83	89.16	89.98
with rewritten data	(65.91, 70.58)	(56.46, 62.19)	(59.27, 64.13)	(90.48, 91.17)	(88.68, 89.64)	(89.67, 90.30)
with new data	75.23	61.44	65.73	92.50	87.90	90.13
with new data	(74.00, 76.46)	(59.65, 63.23)	(64.46, 66.99)	(91.80, 92.62)	(87.74, 89.06)	(90.07, 90.45)

The table above shows the mean scores and 95% confidence intervals of macro precision, recall, f1, micro precision, recall and f1 for both datasets, without augmentation data, with rewritten data and with new generated dataset. The new generated data contains 20 samples for each label and the rewritten data contain 4 samples for each training sample. The abbreviations DA represents data augmentation.

Table 2. BERT Model’s performance with rewritten and new generated data—Mitigation data.

	Macro (Unit: %)			Micro (Unit: %)
	Precision	Recall	F1 Score	Precision	Recall	F1 Score
without DA	15.12	12.77	13.32	75.14	64.01	69.13
without DA	(14.69, 16.04)	(12.34, 13.20)	(12.82, 13.82)	(73.75, 76.52)	(62.11, 65.92)	(67.46, 70.80)
with rewritten data	10.95	8.94	9.40	76.05	65.12	70.16
with rewritten data	(9.54, 12.36)	(8.21, 9.67)	(8.55, 10.25)	(74.22, 77.88)	(63.84, 66.40)	(68.91, 71.41)
with new data	17.86	14.23	15.42	77.69	64.71	70.60
with new data	(16.34, 19.39)	(13.19, 15.28)	(13.91, 16.38)	(75.46, 79.92)	(64.11, 65.31)	(69.46, 71.75)

The table above shows the mean scores and 95% confidence intervals of macro precision, recall, f1, micro precision, recall and f1 for both datasets, without augmentation data, with rewritten data and with new generated dataset. The new generated data contains 20 samples for each label and the rewritten data contain 4 samples for each training sample. The abbreviations DA represents data augmentation.

Table 3. Reuters dataset with different sample size.

		Macro (Unit: %)			Micro (Unit: %)
	Precision	Recall	F1 Score	Precision	Recall	F1 Score
5 samples	70.10	57.22	61.16	91.43	88.50	89.94
5 samples	(68.76, 71.43)	(55.92, 58.53)	(60.03, 62.29)	(91.00, 91.85)	(88.01, 89.00)	(89.72, 90.16)
10 samples	73.93	59.41	63.72	92.27	88.11	90.13
10 samples	(72.63, 75.22)	(57.51, 61.30)	(62.22, 65.23)	(91.74, 92.80)	(87.35, 88.86)	(89.1, 90.3)
15 samples	74.43	60.53	64.84	91.73	88.52	90.19
15 samples	(73.01, 75.85)	(59.04, 62.02)	(63.52, 66.16)	(91.61, 92.24)	(88.09, 88.96)	(89.99, 90.39)
20 samples	75.23	61.44	65.73	92.50	87.90	90.13
20 samples	(74.00, 76.46)	(59.65, 63.23)	(64.46, 66.99)	(91.80, 92.62)	(87.74, 89.06)	(90.07, 90.45)

The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model. The first row shows the result of adding 5 distinct samples for each label. The following row shows the result of 10, 15, 20 distinct samples.

Table 4. Mitigation dataset with different sample size.

		Macro (Unit: %)			Micro (Unit:%)
	Precision	Recall	F1 Score	Precision	Recall	F1 Score
10 samples	18.14	14.17	15.13	77.15	64.58	70.29
10 samples	(16.51, 19.76)	(13.44, 14.90)	(14.63, 15.64)	(73.46, 80.83)	(61.13, 68.04)	(67.07, 73.52)
20 samples	17.86	14.23	15.42	77.69	64.71	70.60
20 samples	(16.34, 19.39)	(13.19, 15.28)	(13.91, 16.38)	(75.46, 79.92)	(64.11, 65.31)	(69.46, 71.75)

The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model. The first row shows the result of adding 10 distinct samples for each label. The second row shows the result of 20 distinct samples.

Table 5. Reuters dataset with ChatGPT rewritten data, ChatGPT generated new data, combination of rewritten data and new data.

		Macro (Unit: %)			Micro (Unit: %)
	Precision	Recall	F1 Score	Precision	Recall	F1 Score
without DA	57.17	47.25	49.87	91.30	87.69	89.45
without DA	(53.82, 60.53)	(44.02, 50.48)	(46.70, 53.03)	(90.84, 91.75)	(86.80, 88.58)	(89.15, 89.75)
rewritten samples	68.25	59.32	61.7	90.83	89.16	89.98
rewritten samples	(65.91, 70.58)	(56.46, 62.19)	(59.27, 64.13)	(90.48, 91.17)	(88.68, 89.64)	(89.67, 90.30)
new samples	75.23	61.44	65.73	92.50	87.90	90.13
new samples	(74.00, 76.46)	(59.65, 63.23)	(64.46, 66.99)	(91.80, 92.62)	(87.74, 89.06)	(90.07, 90.45)
rewritten + new	76.05	63.02	67.14	92.31	88.40	90.31
rewritten + new	(73.86, 78.25)	(61.04, 65.01)	(65.62, 68.66)	(91.01, 93.62)	(87.44, 89.36)	(89.99, 90.62)

The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model without augmentation data, with ChatGPT rewritten samples, with

90 \times 20

ChatGPT generated new samples, with

90 \times 20

ChatGPT generated new samples plus selected rewritten samples.

Table 6. Mitigation dataset with ChatGPT rewritten data, ChatGPT generated new data, combination of rewritten data and new data.

		Macro (Unit: %)			Micro (Unit: %)
	Precision	Recall	F1 Score	Precision	Recall	F1 Score
without DA	15.12	12.77	13.32	75.14	64.01	69.13
without DA	(14.69, 16.04)	(12.34, 13.20)	(12.82, 13.82)	(73.75, 76.52)	(62.11, 65.92)	(67.46, 70.80)
rewritten samples	10.95	8.94	9.40	76.05	65.12	90.16
rewritten samples	(9.54, 12.36)	(8.21, 9.67)	(8.55, 10.25)	(74.22, 77.88)	(63.84, 66.40)	(68.91, 71.41)
new samples	17.86	14.23	15.42	77.69	64.71	70.60
new samples	(16.34, 19.39)	(13.19, 15.28)	(13.91, 16.38)	(75.46, 79.92)	(64.11, 65.31)	(69.46, 71.75)
rewritten + new	11.9	9.92 10.23	79.18	65.83	71.89
rewritten + new	(11.45, 12.89)	(8.85, 10.69)	(9.45, 11.02)	(73.57, 79.98)	(61.90, 66.76)	(67.73, 72.24)

The table above shows the mean scores of macro precision, recall, f1 and micro precision, recall, f1 of running the BERT model without augmentation data, with ChatGPT rewritten samples, with

90 \times 20

ChatGPT generated new samples, with

90 \times 20

ChatGPT generated new samples plus selected rewritten samples.

Table 7. Categorical F1 scores of BERT model with no augmentation, ChatGPT rephrased data, ChatGPT generated new data, combination of rephrased data and new data.

Category	Samples	Avg_Noaug	Avg_Rephrase	Avg_New	Avg_Rephrase_New
earn	2877	0.9810	0.9854	0.9863	0.9825
acq	1650	0.9520	0.9732	0.9760	0.9767
money-fx	538	0.7856	0.8585	0.8391	0.8427
grain	433	0.9031	0.9420	0.9264	0.9451
crude	389	0.8723	0.9058	0.9143	0.9068
trade	368	0.7567	0.7995	0.8157	0.8037
interest	347	0.7535	0.8280	0.8594	0.8527
wheat	212	0.8537	0.8724	0.8584	0.8591
ship	197	0.8000	0.8854	0.8902	0.8932
corn	181	0.8761	0.8785	0.8692	0.8777
money-supply	140	0.7822	0.7966	0.8339	0.7831
dlr	131	0.6849	0.7733	0.7846	0.8182
sugar	126	0.8934	0.9100	0.8768	0.9011
oilseed	124	0.6273	0.7256	0.7231	0.7269
coffee	111	0.9524	0.9479	0.9641	0.9487
gnp	101	0.8186	0.8588	0.8169	0.8186
gold	94	0.8577	0.9075	0.9087	0.9325
veg-oil	87	0.6287	0.7074	0.6799	0.6616
soybean	78	0.6122	0.7203	0.7145	0.6813
nat-gas	75	0.6465	0.7660	0.6925	0.7284
livestock	75	0.5350	0.6998	0.7125	0.7188
bop	75	0.6772	0.7391	0.6831	0.6609
cpi	69	0.6121	0.7145	0.6757	0.6607
cocoa	55	0.9916	1.0000	1.0000	1.0000
reserves	55	0.6975	0.7945	0.8182	0.8311
carcass	50	0.5926	0.6547	0.6169	0.6114
copper	47	0.8628	0.9149	0.9261	0.9304
jobs	46	0.6790	0.6701	0.7224	0.7275
yen	45	0.3556	0.6460	0.6195	0.6448
ipi	41	0.8382	0.9212	0.9042	0.9246
iron-steel	40	0.7032	0.7926	0.8688	0.8572
cotton	39	0.7110	0.7386	0.7489	0.7310
gas	37	0.6808	0.8646	0.8625	0.8278
barley	37	0.6652	0.7463	0.7873	0.8225
rubber	37	0.8312	0.8775	0.8886	0.9579
alum	35	0.7176	0.9045	0.8871	0.9006
rice	35	0.7118	0.8204	0.7259	0.7939
meal-feed	30	0.2092	0.6469	0.4990	0.6037
palm-oil	30	0.7022	0.8571	0.8388	0.8487
sorghum	24	0.3567	0.5814	0.5387	0.6040
retail	23	0.1333	0.6000	0.6429	0.6334
silver	21	0.6590	0.7663	0.7776	0.7786
zinc	21	0.8908	0.9173	0.8854	0.9364
pet-chem	20	0.1943	0.7401	0.4578	0.6516
wpi	19	0.7067	0.9146	0.9140	0.9123
tin	18	0.8351	0.9565	0.9497	0.9565
rapeseed	18	0.7157	0.7695	0.6549	0.7629
strategic-metal	16	0.0333	0.3770	0.5645	0.5450
housing	16	0.7131	0.8571	0.7755	0.8190
hog	16	0.5438	0.6273	0.7567	0.7503
orange	16	0.7469	0.9124	0.9000	0.9210
lead	15	0.3336	0.8764	0.8143	0.9513
soy-oil	14	0.0507	0.3505	0.2468	0.3417
heat	14	0.6372	0.6616	0.7443	0.7073
fuel	13	0.3428	0.6963	0.6613	0.6833
soy-meal	13	0.0842	0.5265	0.6156	0.6171
lei	12	0.9457	1.0000	1.0000	0.9714
sunseed	11	0.3076	0.5428	0.4609	0.6367
dmk	10	0.0333	0.0800	0.0000	0.0800
lumber	10	0.2143	0.8436	0.8468	0.8436
tea	9	0.3067	0.8857	0.9592	0.9428
income	9	0.5578	0.7485	0.7273	0.7126
nickel	8	0.1500	0.7000	1.0000	1.0000
oat	8	0.2364	0.2794	0.4141	0.5200
l-cattle	6	0.0900	0.4667	0.5381	0.6600
rape-oil	5	0.0000	0.0000	0.0714	0.1000
sun-oil	5	0.0000	0.0000	0.1905	0.0000
groundnut	5	0.0000	0.0800	0.4000	0.4000
instal-debt	5	0.0000	1.0000	0.9524	1.0000
platinum	5	0.1100	0.6255	0.6697	0.7230
coconut	4	0.1000	0.6667	0.3572	0.7000
coconut-oil	4	0.1300	0.3000	0.1286	0.4000
jet	4	0.0000	0.2333	0.4048	0.7333
propane	3	0.0000	0.1000	0.6143	0.8400
potato	3	0.3600	0.7200	1.0000	1.0000
cpu	3	0.4000	1.0000	0.8571	1.0000
dfl	2	0.0000	0.0000	0.0000	0.0000
nzdlr	2	0.0000	0.0000	0.0952	0.5334
palmkernel	2	0.0000	0.0000	0.0000	0.0000
copra-cake	2	0.0000	0.0000	0.5714	0.0000
palladium	2	0.0000	0.4000	0.7143	0.4000
naphtha	2	0.0000	0.0800	0.5143	0.6667
rand	2	0.0000	0.6000	1.0000	1.0000
castor-oil	1	0.0000	0.0000	0.0000	0.0000
nkr	1	0.0000	0.0000	0.0000	0.0000
sun-meal	1	0.0000	0.0000	0.0000	0.0000
groundnut-oil	1	0.0000	0.0000	0.1429	0.0000
lin-oil	1	0.0000	0.0000	0.0000	0.0000
cotton-oil	1	0.0000	0.0000	0.0000	0.0000
rye	1	0.0000	0.0000	0.0000	0.0000

Table 8. Average scores of the majority and minority classes. In this context, majority classes pertain to those with more training samples than the specified threshold, while minority classes refer to those with fewer training samples than the threshold.

	Avg_Noaug	Avg_Rephrase	Avg_New	Avg_Rephrase_New
majority (threshold = 40)	0.7627	0.8266	0.8203	0.8217
minority (threshold = 40)	0.3026	0.5209	0.5701	0.6064

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, H.; Chen, H.; Ruggles, T.A.; Feng, Y.; Singh, D.; Yoon, H.-J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics 2024, 13, 2535. https://doi.org/10.3390/electronics13132535

AMA Style

Zhao H, Chen H, Ruggles TA, Feng Y, Singh D, Yoon H-J. Improving Text Classification with Large Language Model-Based Data Augmentation. Electronics. 2024; 13(13):2535. https://doi.org/10.3390/electronics13132535

Chicago/Turabian Style

Zhao, Huanhuan, Haihua Chen, Thomas A. Ruggles, Yunhe Feng, Debjani Singh, and Hong-Jun Yoon. 2024. "Improving Text Classification with Large Language Model-Based Data Augmentation" Electronics 13, no. 13: 2535. https://doi.org/10.3390/electronics13132535

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Text Classification with Large Language Model-Based Data Augmentation

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Reuters News Dataset

2.1.2. Mitigation Dataset

2.2. Machine Learning Model for Text Comprehension and Classification

2.3. Augmenting Generated Data to the Text Classification

2.3.1. Obtain Augmentation Data from ChatGPT

2.3.2. Integrate the Generated Data to the Model

2.4. Experimental Design

2.4.1. Evaluate the DA Effectiveness of Rewritten Samples and New Generated Samples

2.4.2. Investigate the Optimum New Generated Samples’ Size for DA

2.4.3. Combining Rewritten Data with New Generated Data

2.5. Performance Measure

3. Results

3.1. Evaluate the DA Effectiveness of Rewritten Samples and New Generated Samples

3.2. Investigate the Optimum Samples Size of the LLM-Based DA Method

3.3. Combining Rewritten Data with New Generated Data

3.4. Difference Analysis of the Newly Generated Data and the Rewritten Data

3.5. Categorical Analysis

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI