Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Chataug: Leveraging Chatgpt For Text Data Augmentation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1

ChatAug: Leveraging ChatGPT for Text Data


Augmentation
Haixing Dai∗, Zhengliang Liu∗, Wenxiong Liao∗, Xiaoke Huang, Zihao Wu, Lin Zhao, Wei Liu, Ninghao Liu,
Sheng Li, Dajiang Zhu, Hongmin Cai, Quanzheng Li, Dinggang Shen, Tianming Liu, and Xiang Li

Abstract—Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language
processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain
is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data
augmentation on the training data to better capture the data invariance and increase the sample size. However, current text data
augmentation methods either can not ensure the correct labeling of the generated data (lacking faithfulness) or can not ensure sufficient
arXiv:2302.13007v2 [cs.CL] 28 Feb 2023

diversity in the generated data (lacking completeness), or both. Inspired by the recent success of large language models, especially
the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data
augmentation approach based on ChatGPT (named ChatAug). ChatGPT is trained on data with unparalleled linguistic richness and
employs a reinforcement training process with large-scale human feedback, which endows the model with affinity to the naturalness
of human language. Our text data augmentation approach ChatAug rephrases each sentence in the training samples into multiple
conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed ChatAug approach over
state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

Index Terms—Large language model, few-shot learning, nature language processing, data augmentation.

1 I NTRODUCTION

T HE effectiveness of natural language processing (NLP)


heavily relies on the quality and quantity of the train-
ing data. With limited training data available, which is
shown promising results in overcoming this challenge in
various tasks [2]. Existing FSL methods mainly focus on
improving the learning and generalization capability of the
a common issue in practice due to privacy concerns or model via better architectural design [3], [4], [5], leveraging
the high cost of human annotation, it can be challenging pre-trained language models as the basis and then fine-
to train an accurate NLP model that generalizes well to tuning it using limited samples [6] with meta-learning [4],
unseen samples. The challenge of training data insufficiency [7] or prompt-based methods [8], [9], [10], [11]. However, the
is especially prominent in few-shot learning (FSL) scenarios, performance of these methods is still intrinsically limited by
where the model trained on the original (source) domain the data quality and quantity in both the source and target
data is expected to generalize from only a few examples domains.
in the new (target) domain [1]. Many FSL methods have Besides model development, text data augmentation can
also overcome the sample size limit and work together with
• ∗ Co-first authors. other FSL methods in NLP [12], [13]. Data augmentation
• Haixing Dai, Zhengliang Liu, Zihao Wu, Lin Zhao, Ninghao Liu and is usually model-agnostic and involves no change to the
Tianming Liu are with the School of Computing, University of Georgia,
Athens, GA, USA. (e-mail: {hd54134, zl18864 ,zw63397, lin.zhao, ning- underlying model architecture, which makes this approach
hao.liu, tliu}@uga.edu). particularly practical and applicable to a wide range of
• Wenxiong Liao, Xiaoke Huang, Hongmin Cai are with the tasks. In NLP, there are several types of data augmentation
School of Computer Science and Engineering, South China Uni-
versity of Technology, China. (e-mail: {cswxliao@mail.scut.edu.cn,
methods. Traditional text-level data augmentation methods
csxkhuang@mail.scut.edu.cn, hmcai@scut.edu.cn). rely on direct operations on the existing sample base. Some
• Wei Liu is with the Department of Radiation Oncology, Mayo Clinic, frequently used techniques include synonym replacement,
Phoenix, AZ, USA. (e-mail: liu.wei@mayo.edu) random deletion, and random insertion [14]. More recent
• Sheng Li is with the School of Data Science, University of Virginia,
Charlottesville, VA, USA. (email: shengli@virginia.edu) methods utilize language models to generate reliable sam-
• Dajiang Zhu is with the Department of Computer Science and Engineer- ples for more effective data augmentation, including back-
ing, The University of Texas at Arlington, Arlington, TX, USA. (e-mail: translation [15] and word vector interpolation in the latent
dajiang.zhu@uta.edu)
• Quanzheng Li and Xiang Li are with the Department of Radiol-
space [16]. However, existing data augmentation methods
ogy, Massachusetts General Hospital and Harvard Medical School, are limited in the accuracy and diversity of the generated
Boston, Massachusetts, USA. (e-mail: li.quanzheng@mgh.harvard.edu, text data, and human annotation is still mandatory in many
xiangli.shaun@gmail.com) application scenarios [14], [17], [18].
• Dinggang Shen is with School of Biomedical Engineering, ShanghaiTech
University, Shanghai 201210, China. He is also with Shanghai United The advent of (very) large language models (LLMs) such
Imaging Intelligence Co., Ltd., Shanghai 200230, China, and Shanghai as the GPT family [8], [19] brings new opportunities for
Clinical Research and Trial Center, Shanghai, 201210, China. (e-mail: generating text samples that resemble human-labeled data,
Dinggang.Shen@gmail.com)
which significantly alleviates the burden of human anno-
2

tators [20]. LLMs are trained in self-supervised manners, are considered (e.g., GoogleNews Lexical Embeddings [29]).
which scale up with the near-infinite amount of text corpus This method is based on the principle that words close to
available in the open domains. The large parameter space each other in the embedding space often appear in similar
of LLMs also allows them to store a large amount of knowl- contexts, which might help with maintaining grammatical
edge, while large-scale pre-training (e.g., the autoregressive consistency.
objective in training GPTs) enables LLMs to encode rich However, a serious limitation of word embedding-based
factual knowledge for language generation. Furthermore, methods is that close words in the embedding space are
the training of ChatGPT follows that of Instruct-GPT [21], not necessarily semantically similar, yet semantic changes
which utilizes reinforcement learning with human feedback can affect the classification results. For example, “hot” and
(RLHF), thus enabling it to produce more informative and “cold” usually appear in similar contexts, so their word em-
impartial responses to input. beddings are close, but they have exactly opposite seman-
Inspired by the success of applying language models tic meanings. The counter-fitting embedding augmentation
in text generation, we propose a new data augmentation [30], [31] solves this problem by using a synonym dictio-
method named ChatAug, which leverages ChatGPT to gen- nary and an antonym dictionary to adjust the initial word
erate auxiliary samples for few-shot text classification. We embeddings. Specifically, the distance between embeddings
have tested the performance of ChatAug via experiments on of synonyms will be shortened, and the distance between
both general domain and medical domain datasets. Perfor- embeddings of antonyms will become enlarged.
mance comparison of the proposed ChatAug approach with Contextual augmentation [32], [33] is another word-
existing data augmentation methods shows double-digit level data augmentation method, which uses masked lan-
improvements in sentence classification accuracy. Further guage models (MLMs) such as BERT [34], DistilBERT [35]
investigation into the faithfulness and completeness of the and RoBERTA [36] to generate new text based on the con-
generated text samples reveals that ChatAug can generate text. Specifically, they insert < mask > tokens in some
more diversified augmented samples while simultaneously positions of the text, or replace some words in the text
maintaining their accuracy (i.e., semantic similarity to the with < mask > tokens, and then let the MLM predict
data labels). We envision that the development of LLMs will what words should be put in these masked positions. Since
lead to human-level annotation performance, thus revolu- MLMs are pre-trained on a large number of texts, contextual
tionizing the field of few-shot learning and many tasks in augmentation can usually generate meaningful new texts.
NLP. Some text data augmentation methods work at the sen-
tence and document level. For example, back translation
augmentation [37] uses language translation models for
2 R ELATED WORK data augmentation. Specifically, the language model first
2.1 Data Augmentation translates the text into another language, and then translates
Data augmentation, the artificial generation of new text it back to the original language. Due to the randomness of
through transformations, is widely used to improve model the translation process, the augmented text is different from
training in text classification. In NLP, existing data aug- the original text, but semantic consistency is maintained. At
mentation methods work at different granularity levels: the document level, Gangal et al. [38] proposed a method to
characters, words, sentences and documents. paraphrase the entire document to preserve document level
Data augmentation at the character level refers to the consistency.
method of randomly inserting, exchanging, replacing or In general, regardless of the granularity level or the
deleting some characters in the text [22], which improves text generation backbone (i.e., rule-based methods or lan-
the robustness of the NLP model against noises in text data. guage models), the goal of data augmentation is to produce
Another method called optical character recognition (OCR) sensible and diverse new samples that maintain semantic
data augmentation generates new text by simulating the er- consistency.
rors that occur when using OCR tools to recognize text from
pictures. Spelling augmentation [23] deliberately misspells 2.2 Few-shot Learning
some frequently misspelled words. Keyboard augmentation Deep learning has achieved remarkable success in various
[22] simulates random typo errors by replacing a selected data-intensive applications. However, the performance of
key with another key close to it on the QWERTY layout deep models could be affected if the dataset size is small
keyboard. in the downstream tasks. Few-shot Learning is a branch
Data augmentation also works at the word level. Ran- of science that focuses on developing solutions to address
dom swap augmentation randomly exchanges two words the challenge of small sample sizes [1], [39]. FSL research
in the text, and random deletion augmentation randomly aims to leverage prior knowledge to rapidly generalize to
deletes some words [24]. Synonym augmentation uses syn- new tasks that contain only a few labeled samples. A classic
onym databases such as PPDB [25] to replace randomly application scenario for few-shot learning is when obtaining
selected words [26]. WordNet [27] is also widely used as supervised examples is difficult or not possible due to
a reference for synonym augmentation. This method main- privacy, safety, or ethical considerations. The development
tains semantic consistency in samples and is suitable for text of few-shot learning enables practitioners to improve the
classification tasks. Wang et al. [28] proposed a data aug- efficiency and accuracy of text classification in various sce-
mentation method based on word embeddings, which re- narios and deploy practical applications.
places selected words with their top-n similar words to cre- Recent advances in few-shot learning have shown
ate a new sentence. Different pre-trained word embeddings promising results in overcoming the challenges of limited
3

Fig. 1. The framework of ChatAug. a (top panel): First, we apply ChatGPT for data augmentation. We input samples of all classes into ChatGPT
and prompt ChatGPT to generate samples that preserves semantic consistency with existing labelled instance. b (bottom panel): In the next
step, we train a BERT-based sentence classifier on the few-shot samples and the generated data samples and evaluate the model’s classification
performance.

training data for text classification. For example, a common 2.3 Very Large Language Models
approach in NLP is to use a pre-trained language model Pre-trained language models (PLMs) based on the trans-
such as BERT [6] as a starting point and then fine-tune it former architecture, such as the BERT [6] and GPT [46]
with limited samples. Some of the most recent methodolog- model families, have revolutionized natural language pro-
ical developments [2], [4], [40] approaches that have gained cessing. Compared to previous methods, they deliver state-
traction include prompt-tuning [8], [9], [10], [11] and meta- of-the-art performance on a wide range of downstream tasks
learning [4], [7]. In general, existing FSL methods target and contribute to the rising popularity and democratization
either architectural design [3], [4], [5], data augmentation of language models. In general, there are three classes of pre-
[12], [13] or the training process [41]. trained language models: autoregressive language models
(e.g., the decoder-based GPT), masked language models
(e.g., the encoder-based BERT) and encoder-decoder mod-
els(e.g., BART [47] and T5 [48]). These models typically
Despite the recent development of prompt-tuning and contain between 100M and 1B parameters [19].
meta-learning methods, they suffer from some major limi- In recent years, NLP communities have witnessed the
tations. For example, prompt engineering is a cumbersome rise of very large language models such as GPT-3 (175B
art that requires extensive experience and manual trial-and- parameters) [8], PaLM (540B parameters) [49], Bloom (176B
errors [42]. Meta-learning, on the other hand, suffers from parameters) [50], OPT (up to 175B parameters) [51], and
problems such as training instability [43], [44], [45] and sen- the FLAN series (FLAN has 137B parameters) [52]. At their
sitivity to hyper-parameters [43], [44]. In addition, all these core, these large language models are transformer models
FSL pipelines demand deep machine learning expertise and inspired by BERT and GPT, albeit at a much larger scale.
acquaintance with complex model architectures and train- Large language models aim to learn accurate latent fea-
ing strategies, which are not attainable by common practi- ture representations of input text. These representations are
tioners and general developers. As discussed in section 2.1, often context-dependent and domain-dependent. For exam-
data augmentation is an effective solution for FSL and can ple, the vector representation of the word ”treat” might be
be combined with other FSL models. Thus, the ChatAug vastly different between medical domains and the general
method proposed in this paper, which has demonstrated the domain. For smaller pre-trained language models, it is often
capability to generate accurate and comprehensive training necessary to continuously pre-train and fine-tune such mod-
samples, can overcome the issues of current FSL methods els to attain acceptable performance [53]. However, very
and potentially change the landscape of few-shot learning large language models can potentially eliminate the need for
in NLP. fine-tuning while maintaining competitive performance [8],
4

TABLE 1
Generated Samples From ChatGPT on PubMed20K

Original Text Altered Text


1. The study showed a follow-up rate of 85% after 5 years.
2. The 5-year follow-up rate was found to be significantly high at 85%.
3. The participants had an impressive follow-up rate of 85% after 5 years.
The follow-up rate after 5 years was 85%.
4. The data indicated a follow-up rate of 85% 5 years post-study.
5. The 5-year follow-up rate was recorded as 85%, according to the findings.
6. The results of the study showed a follow-up rate of 85% after a 5-year period.
1. The LIPUS group showed a 27% acceleration in callus maturation.
2. The results revealed that the callus maturation process was 27% faster in the LIPUS group.
This means that the maturation of calluses
3. The study found that the callus maturation rate in the LIPUS group was 27% quicker.
was 27% faster in the LIPUS group.
4. Callus maturation was 27% more rapid in the LIPUS group compared to others.
5. The callus maturation process in the LIPUS group was 27% more efficient, according to the
results.
1. The trial was registered with the IRCT under the identifier IRCT2012071010230N1.
2. The study was officially registered with IRCT.ir, with the registration number
Test Registration: IRCT.ir IRCT2012071010230N1.
IRCT2012071010230N1. 3. The trial was documented with IRCT.ir with the identification code IRCT2012071010230N1.
4. IRCT.ir served as the official registrar for the trial, with the registration number
IRCT2012071010230N1.
5. The study was recorded with IRCT under the registration number IRCT2012071010230N1.
6. IRCT.ir recorded the trial with the identifier IRCT2012071010230N1.
1. The study found that although behavioral and technological interventions led to some
While behavioral and technological slight improvements in glycemic control, they were not significantly more effective
interventions can lead to some modest than typical care.
improvements in glycemic control, 2. Despite the modest improvement in glycemic control through behavioral and
these interventions have not performed technological interventions, they did not perform better than the standard care.
much better than conventional prevention 3. The results showed that while behavioral and technological interventions resulted
in achieving glycemic control. in some minimal gains in glycemic control, they did not surpass the usual care in
achieving glycemic control.
4. Although behavioral and technological interventions showed some improvement
in glycemic control, they were not found to be significantly superior to the usual care.
5. The study showed that the usual care was not outperformed by behavioral and
technological interventions in terms of achieving glycemic control, despite some
small improvements.

[54]. ChatGPT has emerged as a general-purpose problem


Existing studies indicate that pre-trained language mod- solver for many NLP applications [55]. Qin et al. [55] eval-
els can help augment a dataset with new samples with uated ChatGPT on a comprehensive set of NLP tasks, in-
similar semantic meaning [14], [18], which is of significant cluding common benchmarks in natural language inference,
practical value to real-world applications. In this study, arithmetic reasoning, named entity recognition, sentiment
we aim to use ChatGPT, a popular LLM to conduct data analysis, question answering, dialogue and summarization.
augmentation. ChatGPT is based on GPT-3 [8], which was They conclude that ChatGPT excels in most tasks, except for
trained on massive web data with diverse and rich infor- tasks that focus on specific details (e.g., sequence tagging).
mation. Furthermore, ChatGPT was trained through Rein- ChatGPT is also a valuable solution for multilingual
forcement learning from Human Feedback (RLHF). During tasks. A recent empirical study [56] reports that ChatGPT
RLHF, human feedback is incorporated into the process of excels at tasks involving high-resource languages (various
generating and selecting the best results. More specifically, European languages and Chinese) and is comparable with
a reward model is trained based on human annotators’ Google Translate, DeepL Translate and Tencent TranSmart.
ranking or generated results. In turn, this reward model Nonetheless, ChatGPT performs poorly on low-resource
rewards model outputs that are most aligned with human languages and faces extra challenges handling distant lan-
preference and human values. We believe these innovations guage translation (i.e., English-German translation is con-
make ChatGPT the best candidate for generating human- sidered to be less ”distant”, compared to English-Hindi
level quality data samples. translation). A later study [57] confirms that ChatGPT strug-
gles with low-resource languages, although the authors
observe that ChatGPT does better in understanding non-
2.4 ChatGPT: Present and Future Latin scripts than generating them.
ChatGPT is a game changer in natural language processing. In addition, it is also possible to use the purely text-
Indeed, for the first time in human history, the power of based ChatGPT to interact with multimodal data. A group
large language models is accessible to the general public of researchers [57] use HTML Canvas and Python Turtle
through a user-friendly chatbot interface. In turn, this com- graphics as media for text-to-image generation. ChatGPT
mon accessibility contributes to ChatGPT’s unprecedented can faithfully generate HTML and Python code, which
popularity. Millions of users further unlock the potential of can be then used to generate desired images. The authors
language models, which introduces myriad possibilities for designed a flag drawing task that required ChatGPT to
new use cases. generate code that can generate country flags. It was found
5

that ChatGPT could generate better flags when the prompt Algorithm 1 The framework of ChatAug for few-shot text
for code was preceded by a prompt that queries ChatGPT classification.
for the flag’s description. In other words, descriptive text Input: base dataset Db and novel dataset Dn
prompts could improve multimodal task performance. Initialize: Initialized pre-trained BERT model
0
Beyond computer science, ChatGPT can be readily ap- Definition: D is the dataset with the base dataset Db
plied to medical report generation and comprehension [58], and augmented dataset Dnaug , and chatGPT aug is the data
[59], education [60], [61], [62], rigorous math research [63] augmentation method based on ChatGPT
and finance [64]. Overall, ChatGPT is a versatile tool that Parameters: Fine-tuning epochs of base dataset epochb , fine-
promotes general AI usage. tuning epochs of FSL epochf
However, researchers are also cautious about the possi- for epoch in epochb do
ble negative impact of ChatGPT. Some of the more promi- train(model, Db )
nent concerns are related to bias [65], [66], ethics [67], [68], end for
plagiarism [69], [70] and job replacement en masse [71], [72]. Dnaug = chatGPT aug(Dn )
0
In response, a commentary published in Nature advocates D = Dn ∪ Dnaug
for urgent attention to accountability, open-source large for epoch in epochf do
0
language models and societal embrace of AI [65]. train(model,D )
end for
3 DATASET
In this work, we use clinical natural language processing Given a base dataset Db = {(xi , yi )}N b
i=1 with a label
(clinical NLP) as the task and carry out our experiments space yi ∈ Yb , a novel dataset Dn = {(xj , yj )}N n
j=1 with
on two popular public benchmarks. Data augmentation is a label space yj ∈ Yn , and Yb ∩ Yn = ∅. In the few-shot
particularly in demand in clinical NLP, because the signifi- classification scenario, the base dataset Db has a relatively
cant burden of expert annotation and stringent privacy reg- larger set of labeled samples, while the novel dataset Dn
ulations make large-scale data labeling infeasible. We will has only a few labeled samples. The performance of few-
describe these datasets in detail in the following sections. shot learning is evaluated on the novel dataset. Our goal is
to train a model with both base and limited novel datasets,
3.1 Symptoms Dataset while achieving satisfying generalizability on the novel
This dataset is published on Kaggle1 . It contains the au- dataset.
dio data of common medical symptom descriptions over The overall framework of ChatAug is shown in Fig 1,
8 hours. We use the text transcripts corresponding to the and the training steps are shown in Algorithm 1. First of
audio data and perform sample de-duplication. The dataset all, we fine-tune BERT on Db . Then, the Dnaug is generated
after preprocessing includes 231 samples of 7 symptom by data augmentation with ChatGPT. Finally, we fine-tune
0
categories. BERT with D = Dn ∪ Dnaug .

3.2 PubMed20k Dataset 4.2 Data Augmentation with ChatGPT


PubMed20K is a widely used dataset in natural language Similar to GPT [46], GPT-2 [74], and GPT-3 [8], ChatGPT
processing (NLP) and text mining research. It consists of ap- belongs to the family of autoregressive language models and
proximately 20,000 scientific abstracts from the biomedical uses transformer decoder blocks [75] as the model backbone.
domain that have been annotated with task-specific labels, During pre-training, ChatGPT is regarded as an un-
such as named entities (e.g., genes, diseases, chemicals), supervised distribution estimation from a set of samples
relations between entities, and other semantic roles. The X = {x1 , x2 , ..., xn }, and sample xi composed of m tokens
dataset has been used for developing and evaluating ma- is defined as xi = (s1 , s2 , ..., sm ). The objective of pre-
chine learning models for various NLP tasks, such as named training is to maximize the following likelihood:
entity recognition, relation extraction, and text classification. m
X
PubMed20K is constructed based on the PubMed L(xi ) = log P (si |s1 , ..., si−1 ; θ) (1)
database, which is a large collection of biomedical literature i=1
maintained by the US National Library of Medicine. The where θ represents the trainable parameters of ChatGPT.
abstracts in PubMed20K cover a wide range of topics in The tokens are represented by token embedding and posi-
biomedicine, including genomics, pharmacology, and clin- tion embedding:
ical medicine. Due to its size, diversity, and high-quality h0 = xi We + Wp (2)
annotations, PubMed20K has become a popular benchmark
dataset for evaluating the performance of machine learning where We is the token embedding matrix and Wp is the
models in biomedical NLP [73]. position embedding matrix. Then N transformer blocks are
used to extract the features of the sample:

4 M ETHOD hn = transf ormer blocks(hn−1 ) (3)


4.1 Overall Framework where n ∈ [1, N ].
Finally, the target token is predicted:
1. https://www.kaggle.com/datasets/paultimothymooney/medical-
speech-transcription-and-intent si = sof tmax(hN WeT ) (4)
6

where hN is the output of top transformer blocks. 4.3 Few-shot Text Classification
After pre-training, the developers of ChatGPT apply We apply BERT [77] to train a few-shot text classification
Reinforcement Learning from Human Feedback (RLHF) [21] model. The output features h of the top layer of BERT can
to fine-tune the pre-trained language model. The RLHF be written as:
aligns language models with user intent on a wide range
of tasks by fine-tuning them according to human feedback. z = [zc , z1 , z2 , ..., zn ], (7)
The RLHF of ChatGPT contains three steps: where the zc is the representation of the class special token
Supervised Fine-tuning (SFT): Unlike GPT, GPT-2, and CLS. For text classification, the zc is usually fed into a task-
GPT-3, ChatGPT uses labeled data for further training. The specific classifier header for final prediction. However, in the
AI trainers play as users and AI assistants to build the an- scenario of FSL, it is difficult to achieve satisfactory perfor-
swers based on prompts. The answers with prompts build as mance through fine-tuning BERT because few-shot samples
supervised data for further training the pre-trained model. will easily lead to over-fitting and lack of generalization
After further pre-training, SFT model can be obtained. ability.
Reward Modeling (RM): Based on the SFT method, a To effectively address the challenge of few-shot text
reward model is trained to input a prompt and response, classification, many approaches have been proposed. Gen-
and output a scalar reward. The labelers rank the outputs erally, there are four categories of methods for few-shot
from best to worst to build a ranking dataset. The loss text classification based on large language models: meta-
function between two outputs is defined as follows: learning, prompt-tuning, model design, and data augmen-
tation. meta-learning refers to the process of learning to learn
loss(θr ) = E(x,yw ,yl )∼Dc [log (σ (rθr (x, yw ) − rθr (x, yl )))]
with tasks that update meta-parameters [4], [7]. Prompt-
(5)
based methods guide large language models to predict
where θr is the parameters of reward model; x is the prompt,
correct results by designing templates [8], [9], [10], [11].
yw is the preferred completion out of the pair of yw and yl ;
Model design methods guide the model to learn from
Dc is the dataset of human comparisons.
few-shot samples by changing the structure of the model
Reinforcement Learning (RL): By using reward models,
[78]. Data augmentation uses similar characters [22], similar
ChatGPT can be fine-tuned using Proximal Policy Optimiza-
word semantics [30], [31], or knowledge base [54], [79] to
tion (PPO) [76]. In order to fix the performance regressions
expand samples. Our method directly data augmentation
on public NLP datasets, the RLHF mix the pretraining
through the language capabilities of large language models,
gradients into the PPO gradients, which also known as PPO-
which is a simple and efficient data augmentation method.
ptx:
Objective Function: Our objective function of few-shot
objective(φ) = γEx∼Dpretrain log πφRL (x) + learning consists of two parts: cross entropy and contrastive
 
learning loss. We feed zc into a fully connected layer as the
E(x,y)∼DπRL rθr (x, y) − β log πφRL (y | x)/θSFT (y | x)
 
φ
classifier for the final prediction:
(6)
where πφRL is the learned RL policy, θSFT is the supervised ŷ = WcT zc + bc , (8)
trained model, and Dpretrain is the pretraining distribu-
tion. The γ is the pre-training loss coefficient that controls where Wc and bc are trainable parameters, and take cross-
the strength of pre-training gradients, and the β is the entropy as one of the objective functions:
KL (Kullback-Leibler) reward coefficient that controls the C
XX
strength of the KL penalty. LCE = − ydc ln ŷdc , (9)
Compared with previous data augmentation methods, d∈D 0 c=1
ChatGPT is more suitable for data augmentation because of
the following reasons: where C is the output dimension, which is equal to the
union of label spaces of the base dataset and novel dataset,
• ChatGPT is pre-trained with large-scale corpus, so it and yd is the ground truth.
has a broader semantic expression space, and is help- Then, to make full use of the prior knowledge in the
ful to enhance the diversity of data augmentation. base dataset to guide the learning of the novel dataset, we
• Since the fine-tuning stage of ChatGPT introduces introduce the contrastive loss function to make the sample
a large number of manual annotation samples, the representation of the same category more compact, and the
language generated by ChatGPT is more in line with sample representation of different categories more separate.
human expression habits. The contrastive loss between pairs of samples in the same
• Through reinforcement learning, ChatGPT can com- batch is defined as follows:
pare the advantages and disadvantages of different P cos(vi ,v 0 )
expressions and ensure that the augmentative data e i
LCL = − log P cos(v ,v 0 ) P cos(v ,v ) , (10)
with higher quality. e i i + e i j

0
Under the BERT framework, we introduce ChatGPT as where vi and vi are the zc of samples that belong to the same
the data augmentation tool for few-shot text classification. category; vi and vj are the zc of samples belong to different
Specifically, ChatGPT is applied to rephrase each input categories; cos(·; ·) is the cosine similarity.
sentence into six additional sentences, thereby augmenting In the BERT fine-tuning stage on the base dataset, we
the few-shot samples. only use cross entropy as the objective function. In the
7

few-shot learning stage, we combine cross entropy and • ContextualWordAugUsingBert(Insert) [32], [33].
contrastive learning loss as the objective function: This method uses BERT to insert words based on
context, that is, add < mask > token at random
L = LCE + λLCL . (11)
position of the input text, and then let BERT predict
the token at that position.
4.4 Baseline Methods • ContextualWordAugUsingDistilBERT(Insert). This
method uses DistilBERT to replace BERT for predic-
In the experiment section, we compared our method with
tion, and the rest is the same as ContextualWordAu-
other popular data augmentation methods. For these meth-
gUsingBert(Insert).
ods, we use the implementation in open source libraries
• ContextualWordAugUsingRoBERTA(Insert).
including nlpaug [80] and textattack [81].
This method uses RoBERTA to replace BERT
• InsertCharAugmentation. This method inserts ran- for prediction, and the rest is the same as
dom characters at random locations in text, which ContextualWordAugUsingBert(Insert).
improves the generalization ability of the model by • ContextualWordAugUsingBert(Substitute). This
injecting noise into the data. method [32], [33] uses BERT to replace words based
• SubstituteCharAugmentation. This method ran- on context, that is, replace randomly selected words
domly replaces selected characters with other ones. in text with < mask > token, and then let BERT
• SwapCharAugmentation [22]. This method ran- predict the token at that position.
domly exchanges two characters. • ContextualWordAugUsingDistilBERT(Substitute).
• DeleteCharAugmentation. This method randomly This method uses DistilBERT to replace BERT
deletes characters. for prediction, and the rest is the same as
• OCRAugmentation. OCRAugmentation simulates ContextualWordAugUsingBert(Substitute).
possible errors during OCR recognition. For exam- • ContextualWordAugUsingRoBERTA(Substitute).
ple, OCR tool may wrongly identify “0” as “o”, and This method uses RoBERTA to replace BERT
wrongly identify “I” as “l”. for prediction, and the rest is the same as
• SpellingAugmentation [23]. It creates new text by ContextualWordAugUsingBert(Substitute).
deliberately misspelling some words. The method • BackTranslationAug. The method [37] translates the
uses a list of English words that are most likely to text into German and then into English, resulting
be misspelled provided by Oxford Dictionary, for in a new text that is different from the original but
example, misspelling “because” as “becouse”. has the same semantics. We use wmt19-en-de and
• KeyboardAugmentation [22]. It simulates typo error facebook/wmt19-de-en language translation models
by replacing randomly selected characters with the [82] developed by Facebook for translation.
adjacent characters in the QWERTY layout keyboard.
For example, replacing ‘g’ with ‘r’, ‘t’, ‘y’, ‘f’, ‘h’, ‘v’, 4.5 Evaluation Metrics
‘b’ or ‘n’.
We employed cosine similarity and TransRate [83] as metrics
• SwapWordAug [24]. It randomly exchanges words
to assess the completeness (i.e., whether features contain
in text. This method is a submethod of Easy Data
sufficient information about a target task) and compactness
Augmentation (EDA) proposed by Wei et al.
(i.e., whether features of each class are compact enough for
• DeleteWordAug. DeleteWordAug randomly deletes
good generalization) of our augmented data.
words in the text, which is also a submethod of EDA.
• PPDBSynonymAug [26]. It replaces words with their 4.5.1 Embedding Similarity
synonym in PPDB thesaurus. Synonym replacement
To evaluate the semantic similarity between the samples
can ensure semantic consistency and is suitable for
generated by data augmentation methods and actual sam-
classification tasks.
ples, we adopt embedding similarity between the generated
• WordNetSynonymAug. It replaces words with their
samples and the actual samples of the test dataset. Some
synonym in WordNet thesaurus.
of the most common similarity metrics include Euclidean
• SubstituteWordByGoogleNewsEmbeddings [28]. It
distance, cosine similarity and dot product similarity. In this
replaces words with their top-n similar words in the
study, we select cosine similarity to capture the distance
embedding space. The word embeddings used are
relationship in the latent space. The cosine similarity mea-
pre-trained with GoogleNews corpus.
sures the cosine value of the angle between two vectors.
• InsertWordByGoogleNewsEmbeddings [80]. It ran-
This value increases when two vectors are more similar, and
domly selects word from vocabulary of GoogleNews
is bounded by a range between 0 and 1. We input sample
corpus and inserts it the random position of the text.
into pre-trained BERT, and use the representation of the CLS
• CounterFittedEmbeddingAug [30], [31]. It replaces
token as sample embedding. The cosine similarity metric is
words with their neighbors in counter-fitting em-
commonly used in NLP [84] and we follow this convention.
bedding space. Compared with GoogleNews word
vectors used by SubstituteWordByGoogleNewsEm-
A·B
beddings, counter-fitting embedding introduces the cos(θ) = , (12)
constraint of synonyms and antonyms, that is, the kAk2 kBk2
embedding between synonyms will be pulled closer, where A and B denote the two embedding vectors in com-
and vice versa. parison, respectively.
8

TABLE 2
Data Augmentation Ablation Study on Symptoms

Data Augmentation BERT BERT+Constractive


Raw 0.636 0.606
BackTranslationAug 0.778 0.747
ContextualWordAugUsingBert(Insert) 0.697 0.677
ContextualWordAugUsingBert(Substitute) 0.626 0.667
ContextualWordAugUsingDistilBERT(Insert) 0.707 0.747
ContextualWordAugUsingDistilBERT(Substitute) 0.667 0.646
ContextualWordAugUsingRoBERTA(Insert) 0.758 0.707
ContextualWordAugUsingRoBERTA(Substitute) 0.727 0.667
CounterFittedEmbeddingAug 0.667 0.626
InsertCharAugmentation 0.404 0.475
InsertWordByGoogleNewsEmbeddings 0.636 0.677
KeyboardAugmentation 0.545 0.505
OCRAugmentation 0.768 0.778
PPDBSynonymAug 0.697 0.758
SpellingAugmentation 0.697 0.707
SubstituteCharAugmentation 0.535 0.586
SubstituteWordByGoogleNewsEmbeddings 0.727 0.727
SwapCharAugmentation 0.475 0.485
SwapWordAug 0.687 0.727
WordNetSynonymAug 0.616 0.758
ChatAug 0.889 0.899

TABLE 3
Data Augmentation Ablation Study on PubMed20K

Data Augmentation BERT BERT+Constractive


Raw 0.792 0.798
BackTranslationAug 0.812 0.830
ContextualWordAugUsingBert(Insert) 0.802 0.811
ContextualWordAugUsingBert(Substitute) 0.815 0.830
ContextualWordAugUsingDistilBERT(Insert) 0.796 0.796
ContextualWordAugUsingDistilBERT(Substitute) 0.797 0.800
ContextualWordAugUsingRoBERTA(Insert) 0.815 0.814
ContextualWordAugUsingRoBERTA(Substitute) 0.782 0.782
CounterFittedEmbeddingAug 0.805 0.805
InsertCharAugmentation 0.826 0.831
InsertWordByGoogleNewsEmbeddings 0.786 0.784
KeyboardAugmentation 0.809 0.815
OCRAugmentation 0.789 0.789
PPDBSynonymAug 0.795 0.829
SpellingAugmentation 0.808 0.811
SubstituteCharAugmentation 0.816 0.821
SubstituteWordByGoogleNewsEmbeddings 0.807 0.822
SwapCharAugmentation 0.797 0.801
SwapWordAug 0.798 0.794
WordNetSynonymAug 0.761 0.757
ChatAug 0.835 0.835

4.5.2 TransRate pre-trained feature extractor g . T rR means the TransRate


value. H(·) denotes the Shannon entropy [85].
TransRate is a metric that quantifies transferability based
on the mutual information between the features extracted
by a pre-trained model and their labels, with a single pass 5 E XPERIMENT R ESULTS
through the target data. The metric achieves a minimum In our experiments, we use BERT as the base model. First
value when the data covariance matrices of all classes are we train our model on the base dataset to get the pretrained
identical, making it impossible to distinguish between the model. Then we fine-tune the model with the few-shot
data from different classes and preventing any classifier samples, where we employ different data augmentation
from achieving better than random guessing. Thus, a higher methods to generate the augmented samples. We feed those
TransRate could indicate better learnability of the data. More samples into BERT model to fine-tune the pretrained mod-
specifically, knowledge transfer from a source task Ts to a els. To evaluate the effectiveness of different data augmen-
target task Tt is measured as shown below: tation methods, we apply two different settings. The first
one is the bare BERT model. In the second setting, we add
T rRTs →Tt (g) = H(Z) − H(Z|Y ), (13) contrastive loss during the training. In our experiments on
the Symptoms dataset, we use a batch size of 8 for 150
where Y represents the labels of augmented examples, and epochs, set the maximum sequence length to 25, λ as 1 and
Z denotes the latency embedding features extracted by the use a learning rate of 4e-5. Similarly, in our experiments
9

(a) Symptoms

(b) PubMed20K

Fig. 2. We employed two evaluation metrics to assess the completeness and compactness of our newly augmented data. For the top left plot, we
displayed the cosine similarity metric and final accuracy of all data augmentation methods on the Symptoms dataset. For the top right plot, we
showed the TransRate metric and final accuracy of all data augmentation methods on the Symptoms dataset. In the bottom panel, we plotted the
cosine similarity and TransRate values of all data augmentation methods on the PubMed20K dataset. And on the right side of the picture, we listed
all the augmented methods with different colors and shapes.

on the PubMed20K dataset, we adopt the same training and with higher completeness and compactness. As higher
configuration, with the maximum sequence length set to 40. TransRate could indicate better learnability of the data, the
higer TransRate means the augmentative data with higer
5.1 Classification Performance Comparison quality. The most ideal candidate method should be posi-
tioned at the top-right of the visualization. As shown in Fig
Table 2 and Table 3 show that ChatAug achieves the highest
2, ChatAug produces high-quality samples in terms of both
accuracy for both Symptoms and PubMed20K datasets.
completeness and compactness on the Symptoms dataset
In the PubMed20K dataset, ChatAug achieves accuracies
and the PubMed20K dataset.
of 83.5% for both BERT and BERT with contrastive loss,
whereas without data augmentation, the accuracy is only
79.2% and 79.8%, respectively. In the Symptoms dataset, the
accuracy for BERT without data augmentation is only 63.6%, 6 C ONCLUSION AND D ISCUSSION
and 60.6% with Contrastive loss. However, our ChatAug
In this paper, we proposed a novel data augmentation
approach significantly improves the accuracy to 88.9% and
approach for few-shot classification. Unlike other methods,
89.9%, respectively. These results suggest that data aug-
our model expands the limited data at the semantic level to
mentation using ChatGPT is more effective for enhancing
enhance data consistency and robustness, which results in a
the performance of machine learning models in various
better-performing trained model.
applications.
Although ChatAug has shown promising results in data
augmentation, it has certain limitations. For example, in
5.2 Evaluation of Augmented Datasets recognizing and expanding medical texts, it may produce
In this section, we evaluate the performance of our aug- incorrect augmentation data due to the lack of domain
mented data in the latent space and visualize the results knowledge. In future research, we may fine-tune the original
in Fig 2. Latent embeddings are evaluated using cosine model first and then perform data augmentation to address
similarity and the TransRate metric (see section 4.5 for this issue.
more details). The horizontal axis represents the cosine The proposed ChatAug method has shown promising
similarity values and Transrate values, and the vertical results in text classification. A promising direction for future
axis describes the classification accuracy. Since embedded research is to investigate the effectiveness of ChatAug on a
similarity measures the similarity between the augmentative wider range of downstream tasks. For example, given the
data and the test dataset, the higher similarity means that strong ability of ChatGPT to extract key points and under-
the augmentative data more matched with the real data, stand sentences, we can foresee potential promising results
10

in text summarization. Specifically, ChatGPT might be valu- [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
able for domain-specific science paper summarization [86] training of deep bidirectional transformers for language under-
standing,” arXiv preprint arXiv:1810.04805, 2018.
and clinical report summarization [87]. Publicly available [7] H.-y. Lee, S.-W. Li, and N. T. Vu, “Meta learning for natural
domain-specific science paper summarization datasets and language processing: A survey,” arXiv preprint arXiv:2205.01500,
clinical report datasets are rare and are often provided at 2022.
small scales due to privacy concerns and the need for expert [8] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari-
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan-
knowledge to generate annotated summaries. However, guage models are few-shot learners,” Advances in neural informa-
ChatGPT could address this challenge by generating diverse tion processing systems, vol. 33, pp. 1877–1901, 2020.
augmented summarization samples in different representa- [9] B. Lester, R. Al-Rfou, and N. Constant, “The power of scale
tion styles. The data generated from ChatGPT are typically for parameter-efficient prompt tuning,” in Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing,
concise, which can be valuable for further enhancing the 2021, pp. 3045–3059.
generalization capabilities of the trained model. [10] X. Han, W. Zhao, N. Ding, Z. Liu, and M. Sun, “Ptr: Prompt tuning
The dramatic rise of generative image models such as with rules for text classification,” AI Open, vol. 3, pp. 182–192,
2022.
DALLE2 [88] and Stable Diffusion [89] provides oppor- [11] J. Wang, C. Wang, F. Luo, C. Tan, M. Qiu, F. Yang, Q. Shi, S. Huang,
tunities for applying ChatAug to few-shot learning tasks and M. Gao, “Towards unified prompt tuning for few-shot text
in computer vision. For example, accurate language de- classification,” arXiv preprint arXiv:2205.05313, 2022.
scriptions may be used to guide the generative model to [12] J. Wei and K. Zou, “Eda: Easy data augmentation techniques for
boosting performance on text classification tasks,” arXiv preprint
generate images from text or to generate new images based arXiv:1901.11196, 2019.
on existing images as a data augmentation method for few- [13] V. Kumar, H. Glaude, C. de Lichy, and W. Campbell, “A closer look
shot learning tasks, especially when combined with efficient at feature space data augmentation for few-shot intent classifica-
fine-tuning methods [90], [91] such as LoRA for Stable tion,” in Proceedings of the 2nd Workshop on Deep Learning Approaches
for Low-Resource NLP (DeepLo 2019), 2019, pp. 1–10.
Diffusion. Thus, prior knowledge from a large language [14] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura,
model can facilitate faster domain adaptation and better and E. Hovy, “A survey of data augmentation approaches for nlp,”
few-shot learning of generative models in computer vision. arXiv preprint arXiv:2105.03075, 2021.
[15] R. Sennrich, B. Haddow, and A. Birch, “Improving neural ma-
Recent research shows that large language models
chine translation models with monolingual data,” arXiv preprint
(LLMs), such as GPT-3 and ChatGPT, are capable of solv- arXiv:1511.06709, 2015.
ing Theory of Mind (ToM) tasks, which were previously [16] A. Jindal, A. G. Chowdhury, A. Didolkar, D. Jin, R. Sawhney,
thought to be unique to humans [92]. While the ToM-like and R. Shah, “Augmenting nlp models using latent feature in-
terpolations,” in Proceedings of the 28th International Conference on
capabilities of LLMs may be an unintended byproduct of Computational Linguistics, 2020, pp. 6931–6936.
improved performance, the underlying connection between [17] C. Shorten, T. M. Khoshgoftaar, and B. Furht, “Text data augmen-
cognitive science and the human brain is an area ripe for tation for deep learning,” Journal of big Data, vol. 8, pp. 1–34, 2021.
exploration. Advancements in cognitive and brain science [18] M. Bayer, M.-A. Kaufhold, and C. Reuter, “A survey on data
augmentation for text classification,” ACM Computing Surveys,
can also be used to inspire and optimize the design of vol. 55, no. 7, pp. 1–39, 2022.
LLMs. For example, it has been suggested that the activation [19] B. Min, H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz,
patterns of the neurons in the BERT model and those in E. Agirre, I. Heinz, and D. Roth, “Recent advances in natural
the human brain networks may share similarities and could language processing via large pre-trained language models: A
survey,” arXiv preprint arXiv:2111.01243, 2021.
be coupled together [93]. This presents a promising new [20] Z. Liu, M. He, Z. Jiang, Z. Wu, H. Dai, L. Zhang, S. Luo, T. Han,
direction for developing LLMs by utilizing prior knowledge X. Li, X. Jiang et al., “Survey on natural language processing in
from brain science. As researchers continue to investigate medical image analysis.” Zhong nan da xue xue bao. Yi xue ban=
Journal of Central South University. Medical Sciences, vol. 47, no. 8,
the connections between LLMs and the human brain, we
pp. 981–993, 2022.
may discover new means to enhance the performance and [21] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
capabilities of AI systems, leading to exciting breakthroughs C. Zhang, S. Agarwal, K. Slama, A. Gray et al., “Training language
in the field. models to follow instructions with human feedback,” in Advances
in Neural Information Processing Systems, 2022.
[22] Y. Belinkov and Y. Bisk, “Synthetic and natural noise both break
neural machine translation,” arXiv preprint arXiv:1711.02173, 2017.
R EFERENCES [23] C. Coulombe, “Text Data Augmentation Made Simple By
Leveraging NLP Cloud APIs,” Dec. 2018. [Online]. Available:
[1] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a http://arxiv.org/abs/1812.04718
few examples: A survey on few-shot learning,” ACM computing [24] J. Wei and K. Zou, “EDA: Easy Data Augmentation Techniques for
surveys (csur), vol. 53, no. 3, pp. 1–34, 2020. Boosting Performance on Text Classification Tasks,” in Proceedings
[2] M. Yang, “A survey on few-shot learning in natural language of the 2019 Conference on Empirical Methods in Natural Language Pro-
processing,” in 2021 International Conference on Artificial Intelligence cessing and the 9th International Joint Conference on Natural Language
and Electromechanical Automation (AIEA). IEEE, 2021, pp. 294–297. Processing (EMNLP-IJCNLP). Hong Kong, China: Association for
[3] S. Sun, Q. Sun, K. Zhou, and T. Lv, “Hierarchical attention proto- Computational Linguistics, Nov. 2019, pp. 6382–6388. [Online].
typical networks for few-shot text classification,” in Proceedings of Available: https://aclanthology.org/D19-1670
the 2019 conference on empirical methods in natural language processing [25] E. Pavlick, P. Rastogi, J. Ganitkevitch, B. Van Durme, and
and the 9th international joint conference on natural language processing C. Callison-Burch, “Ppdb 2.0: Better paraphrase ranking, fine-
(EMNLP-IJCNLP), 2019, pp. 476–485. grained entailment relations, word embeddings, and style classifi-
[4] W. Yin, “Meta-learning for few-shot natural language processing: cation,” in Proceedings of the 53rd Annual Meeting of the Association
A survey,” arXiv preprint arXiv:2007.09604, 2020. for Computational Linguistics and the 7th International Joint Conference
[5] C. Wang, J. Wang, M. Qiu, J. Huang, and M. Gao, “Transprompt: on Natural Language Processing (Volume 2: Short Papers), 2015, pp.
Towards an automatic transferable prompting framework for few- 425–430.
shot text classification,” in Proceedings of the 2021 Conference on [26] T. Niu and M. Bansal, “Adversarial Over-Sensitivity and
Empirical Methods in Natural Language Processing, 2021, pp. 2792– Over-Stability Strategies for Dialogue Models,” in Proceedings
2802. of the 22nd Conference on Computational Natural Language Learning.
11

Brussels, Belgium: Association for Computational Linguistics, [44] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning
2018, pp. 486–496. [Online]. Available: http://aclweb.org/ for fast adaptation of deep networks,” in International conference on
anthology/K18-1047 machine learning. PMLR, 2017, pp. 1126–1135.
[27] G. A. Miller, “Wordnet: a lexical database for english,” Communi- [45] X. Yao, J. Zhu, G. Huo, N. Xu, X. Liu, and C. Zhang, “Model-
cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995. agnostic multi-stage loss optimization meta learning,” Interna-
[28] W. Y. Wang and D. Yang, “That’s so annoying!!!: A lexical and tional Journal of Machine Learning and Cybernetics, vol. 12, no. 8,
frame-semantic embedding based data augmentation approach to pp. 2349–2363, 2021.
automatic categorization of annoying behaviors using# petpeeve [46] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al.,
tweets,” in Proceedings of the 2015 conference on empirical methods in “Improving language understanding by generative pre-training,”
natural language processing, 2015, pp. 2557–2563. 2018.
[29] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, [47] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
“Distributed representations of words and phrases and their hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart:
compositionality,” Advances in neural information processing systems, Denoising sequence-to-sequence pre-training for natural lan-
vol. 26, 2013. guage generation, translation, and comprehension,” arXiv preprint
[30] N. Mrkšić, D. Ó Séaghdha, B. Thomson, M. Gašić, arXiv:1910.13461, 2019.
L. M. Rojas-Barahona, P.-H. Su, D. Vandyke, T.-H. Wen, [48] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
and S. Young, “Counter-fitting Word Vectors to Linguistic Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer
Constraints,” in Proceedings of the 2016 Conference of the North learning with a unified text-to-text transformer,” The Journal of
American Chapter of the Association for Computational Linguistics: Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
Human Language Technologies. San Diego, California: Association [49] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra,
for Computational Linguistics, Jun. 2016, pp. 142–148. [Online]. A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al.,
Available: https://aclanthology.org/N16-1018 “Palm: Scaling language modeling with pathways,” arXiv preprint
[31] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and arXiv:2204.02311, 2022.
K.-W. Chang, “Generating Natural Language Adversarial [50] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow,
Examples,” in Proceedings of the 2018 Conference on Empirical R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A
Methods in Natural Language Processing. Brussels, Belgium: 176b-parameter open-access multilingual language model,” arXiv
Association for Computational Linguistics, 2018, pp. 2890–2896. preprint arXiv:2211.05100, 2022.
[Online]. Available: http://aclweb.org/anthology/D18-1316 [51] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen,
[32] S. Kobayashi, “Contextual Augmentation: Data Augmentation C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained
by Words with Paradigmatic Relations,” in Proceedings of transformer language models,” arXiv preprint arXiv:2205.01068,
the 2018 Conference of the North American Chapter of the Association 2022.
for Computational Linguistics: Human Language Technologies, Volume
[52] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay,
2 (Short Papers). New Orleans, Louisiana: Association for
D. Zhou, Q. V. Le, B. Zoph, J. Wei et al., “The flan collection:
Computational Linguistics, Jun. 2018, pp. 452–457. [Online].
Designing data and methods for effective instruction tuning,”
Available: https://aclanthology.org/N18-2072
arXiv preprint arXiv:2301.13688, 2023.
[33] V. Kumar, A. Choudhary, and E. Cho, “Data Augmentation Using
[53] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Nau-
Pre-trained Transformer Models,” arXiv preprint arXiv:2003.02245,
mann, J. Gao, and H. Poon, “Domain-specific language model
2020.
pretraining for biomedical natural language processing,” ACM
[34] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:
Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1,
Pre-training of Deep Bidirectional Transformers for Language
pp. 1–23, 2021.
Understanding,” May 2019. [Online]. Available: http://arxiv.org/
abs/1810.04805 [54] S. Rezayi, H. Dai, Z. Liu, Z. Wu, A. Hebbar, A. H. Burns, L. Zhao,
D. Zhu, Q. Li, W. Liu et al., “Clinicalradiobert: Knowledge-infused
[35] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled
few shot learning for clinical notes named entity recognition,” in
version of bert: smaller, faster, cheaper and lighter,” ArXiv, vol.
Machine Learning in Medical Imaging: 13th International Workshop,
abs/1910.01108, 2019.
MLMI 2022, Held in Conjunction with MICCAI 2022, Singapore,
[36] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
September 18, 2022, Proceedings. Springer, 2022, pp. 269–278.
M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro-
bustly optimized bert pretraining approach,” arXiv preprint [55] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, and D. Yang,
arXiv:1907.11692, 2019. “Is chatgpt a general-purpose natural language processing task
solver?” arXiv preprint arXiv:2302.06476, 2023.
[37] R. Sennrich, B. Haddow, and A. Birch, “Improving
Neural Machine Translation Models with Monolingual [56] W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu, “Is chat-
Data,” in Proceedings of the 54th Annual Meeting of the Association gpt a good translator? a preliminary study,” arXiv preprint
for Computational Linguistics (Volume 1: Long Papers). Berlin, arXiv:2301.08745, 2023.
Germany: Association for Computational Linguistics, Aug. 2016, [57] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Love-
pp. 86–96. [Online]. Available: https://aclanthology.org/P16-1009 nia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual,
[38] V. Gangal, S. Y. Feng, M. Alikhani, T. Mitamura, and E. Hovy, multimodal evaluation of chatgpt on reasoning, hallucination, and
“Nareor: The narrative reordering problem,” in Proceedings of the interactivity,” arXiv preprint arXiv:2302.04023, 2023.
AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. [58] Y. Shen, L. Heacock, J. Elias, K. D. Hentel, B. Reig, G. Shih, and
10 645–10 653. L. Moy, “Chatgpt and other large language models are double-
[39] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object edged swords,” p. 230163, 2023.
categories,” IEEE transactions on pattern analysis and machine intelli- [59] F. Antaki, S. Touma, D. Milad, J. El-Khoury, and R. Duval, “Eval-
gence, vol. 28, no. 4, pp. 594–611, 2006. uating the performance of chatgpt in ophthalmology: An analysis
[40] Y. Ge, Y. Guo, Y.-C. Yang, M. A. Al-Garadi, and A. Sarker, “Few- of its successes and shortcomings,” medRxiv, pp. 2023–01, 2023.
shot learning for medical text: A systematic review,” arXiv preprint [60] T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon,
arXiv:2204.14081, 2022. C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido,
[41] J. Wei, C. Huang, S. Vosoughi, Y. Cheng, and S. Xu, “Few-shot J. Maningo et al., “Performance of chatgpt on usmle: Potential for
text classification with triplet networks, data augmentation, and ai-assisted medical education using large language models,” PLOS
curriculum learning,” in Proceedings of the 2021 Conference of the Digital Health, vol. 2, no. 2, p. e0000198, 2023.
North American Chapter of the Association for Computational Linguis- [61] J. V. Pavlik, “Collaborating with chatgpt: Considering the im-
tics: Human Language Technologies, 2021, pp. 5493–5500. plications of generative artificial intelligence for journalism and
[42] T. Gao, A. Fisch, and D. Chen, “Making pre-trained language media education,” Journalism & Mass Communication Educator, p.
models better few-shot learners,” in Proceedings of the 59th Annual 10776958221149577, 2023.
Meeting of the Association for Computational Linguistics and the 11th [62] D. Baidoo-Anu and L. Owusu Ansah, “Education in the era of
International Joint Conference on Natural Language Processing (Volume generative artificial intelligence (ai): Understanding the potential
1: Long Papers), 2021, pp. 3816–3830. benefits of chatgpt in promoting teaching and learning,” Available
[43] A. Antoniou, H. Edwards, and A. Storkey, “How to train your at SSRN 4337484, 2023.
maml,” arXiv preprint arXiv:1810.09502, 2018. [63] S. Frieder, L. Pinchetti, R.-R. Griffiths, T. Salvatori, T. Lukasiewicz,
12

P. C. Petersen, A. Chevalier, and J. Berner, “Mathematical capabil- [89] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
ities of chatgpt,” arXiv preprint arXiv:2301.13867, 2023. “High-resolution image synthesis with latent diffusion models,”
[64] M. Dowling and B. Lucey, “Chatgpt for (finance) research: The in Proceedings of the IEEE/CVF Conference on Computer Vision and
bananarama conjecture,” Finance Research Letters, p. 103662, 2023. Pattern Recognition, 2022, pp. 10 684–10 695.
[65] E. A. van Dis, J. Bollen, W. Zuidema, R. van Rooij, and C. L. [90] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
Bockting, “Chatgpt: five priorities for research,” Nature, vol. 614, and W. Chen, “Lora: Low-rank adaptation of large language
no. 7947, pp. 224–226, 2023. models,” arXiv preprint arXiv:2106.09685, 2021.
[66] R. W. McGee, “Is chat gpt biased against conservatives? an empir- [91] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aber-
ical study,” An Empirical Study (February 15, 2023), 2023. man, “Dreambooth: Fine tuning text-to-image diffusion models
[67] A. Blum, “Breaking chatgpt with dangerous questions under- for subject-driven generation,” arXiv preprint arXiv:2208.12242,
standing how chatgpt prioritizes safety, context, and obedience,” 2022.
2022. [92] M. Kosinski, “Theory of mind may have spontaneously emerged
[68] H. Y. Jabotinsky and R. Sarel, “Co-authoring with an ai? ethical in large language models,” arXiv preprint arXiv:2302.02083, 2023.
dilemmas and artificial intelligence,” Ethical Dilemmas and Artificial [93] X. Liu, M. Zhou, G. Shi, Y. Du, L. Zhao, Z. Wu, D. Liu, T. Liu, and
Intelligence (December 15, 2022), 2022. X. Hu, “Coupling artificial neurons in bert and biological neurons
[69] T. Susnjak, “Chatgpt: The end of online exam integrity?” arXiv in the human brain,” in Proceedings of the 37th AAAI Conference on
preprint arXiv:2212.09292, 2022. Artificial Intelligence, AAAI, 2023.
[70] M. Khalil and E. Er, “Will chatgpt get you caught? rethinking of
plagiarism detection,” arXiv preprint arXiv:2302.04335, 2023.
[71] D. Castelvecchi, “Are chatgpt and alphacode going to replace
programmers?” Nature, 2022.
[72] A. Zarifhonarvar, “Economics of chatgpt: A labor market view
on the occupational impact of artificial intelligence,” Available at
SSRN 4350925, 2023.
[73] F. Dernoncourt and J. Y. Lee, “Pubmed 200k rct: a dataset for
sequential sentence classification in medical abstracts,” in Proceed-
ings of the Eighth International Joint Conference on Natural Language
Processing (Volume 2: Short Papers), 2017, pp. 308–313.
[74] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al.,
“Language models are unsupervised multitask learners,” OpenAI
blog, vol. 1, no. 8, p. 9, 2019.
[75] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
Advances in neural information processing systems, vol. 30, 2017.
[76] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” arXiv preprint
arXiv:1707.06347, 2017.
[77] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
training of deep bidirectional transformers for language under-
standing,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
[78] W. Liao, Z. Liu, H. Dai, Z. Wu, Y. Zhang, X. Huang, Y. Chen,
X. Jiang, D. Zhu, T. Liu, S. Li, X. Li, and H. Cai, “Mask-guided bert
for few shot text classification,” arXiv preprint arXiv:2302.10447,
2023.
[79] S. Rezayi, Z. Liu, Z. Wu, C. Dhakal, B. Ge, C. Zhen, T. Liu, and
S. Li, “Agribert: Knowledge-infused agricultural language models
for matching food and nutrition,” International Joint Conference on
Artificial Intelligence, July 23-29, 2022, Vienna, Austria, 2022.
[80] E. Ma, “Nlp augmentation,” https://github.com/makcedward/nlpaug,
2019.
[81] J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi,
“Textattack: A framework for adversarial attacks, data augmen-
tation, and adversarial training in nlp,” in Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing:
System Demonstrations, 2020, pp. 119–126.
[82] N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov,
“Facebook fair’s wmt19 news translation task submission,” in
Proc. of WMT, 2020.
[83] L.-K. Huang, J. Huang, Y. Rong, Q. Yang, and Y. Wei, “Frustrat-
ingly easy transferability estimation,” in International Conference
on Machine Learning. PMLR, 2022, pp. 9201–9225.
[84] J. Wang and Y. Dong, “Measurement of text similarity: a survey,”
Information, vol. 11, no. 9, p. 421, 2020.
[85] T. M. Cover, Elements of information theory. John Wiley & Sons,
1999.
[86] X. Cai, S. Liu, J. Han, L. Yang, Z. Liu, and T. Liu, “Chestxraybert:
A pretrained language model for chest radiology report summa-
rization,” IEEE Transactions on Multimedia, pp. 1–1, 2021.
[87] X. Cai, S. Liu, L. Yang, Y. Lu, J. Zhao, D. Shen, and
T. Liu, “Covidsum: A linguistically enriched scibert-based
summarization model for covid-19 scientific papers,” Journal
of Biomedical Informatics, vol. 127, p. 103999, 2022. [Online].
Available: https://www.sciencedirect.com/science/article/pii/
S1532046422000156
[88] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hier-
archical text-conditional image generation with clip latents,” arXiv
preprint arXiv:2204.06125, 2022.

You might also like