Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Continual Learning of Large Language Models: A Comprehensive Survey

Haizhou Shi haizhou.shi@rutgers.edu Zihao Xu Hengyi Wang Weiyi Qin Rutgers University Wenyuan Wang WHU Yibin Wang HUSTWuhan Zifeng Wang Sayna Ebrahimi Google Cloud AI ResearchMountain ViewCalifornia  and  Hao Wang hw488@cs.rutgers.edu Rutgers University
Abstract.

The challenge of effectively and efficiently adapting statically pre-trained Large Language Models (LLMs) to ever-evolving data distributions remains predominant. When tailored for specific needs, pre-trained LLMs often experience significant performance degradation in previous knowledge domains – a phenomenon known as “catastrophic forgetting”. While extensively studied in the continual learning (CL) community, this problem presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview and detailed discussion of the current research progress on LLMs within the context of CL. Besides the introduction of the preliminary knowledge, this survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). Following vertical continuity, we summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). This survey sheds light on the relatively understudied domain of continually pre-training, adapting, and fine-tuning large language models, suggesting the necessity for greater attention from the community. Key areas requiring immediate focus include the development of practical and accessible evaluation benchmarks, along with methodologies specifically designed to counter forgetting and enable knowledge transfer within the evolving landscape of LLM learning paradigms. The full list of papers examined in this survey is available at https://github.com/Wang-ML-Lab/llm-continual-learning-survey.

Large Language Models, Continual Learning.
copyright: nonejournal: CSURccs: Computing methodologies Lifelong machine learningccs: Computing methodologies Natural language processingccs: Computing methodologies Neural networks

1. Introduction

Recent advances in large language models (LLMs) have demonstrated considerable potential for achieving artificial general intelligence (AGI) (radford2019language, ; brown2020language, ; achiam2022chatgpt, ; achiam2023gpt, ; chowdhery2023palm, ; anil2023palm, ; touvron2023llama, ; touvron2023llama2, ; jiang2024empowering, ). Researchers have observed that complex abilities such as multi-step reasoning, few-shot in-context learning, and instruction following improve as the scale of parameter size increases (wei2022chain, ; wei2022emergent, ; yao2024tree, ; wei2021finetuned, ; min2022rethinking, ). The development of LLMs is impactful and revolutionary, prompting machine learning practitioners to reconsider traditional computational paradigms for once-challenging human-level tasks such as question answering, machine translation, and dialogue systems (kwok2001scaling, ; bahdanau2014neural, ; deng2023survey, ). However, LLMs are typically trained on static, pre-collected datasets encompassing general domains, leading to gradual performance degradation over time (loureiro2022timelms, ; jang2022towards, ; jin2022lifelong, ; jang2022temporalwiki, ; amba2021dynamic, ; dhingra2022time, ) and across different content domains (gupta2023continual, ; jin2022lifelong, ; ke2022continual-train, ; sun2020ernie, ; cossu2022continual, ; gururangan2022demix, ; qin2023recyclable, ; chen2023lifelong, ; qin2022elle, ). Additionally, a single pre-trained large model cannot meet every user need and requires further fine-tuning (weyssow2023usage, ; winata2023overcoming, ; zheng2023learn, ; winata2023overcoming, ; biderman2023pythia, ; zheng2023learn, ; bai2023enhancing, ; ke2021achieve, ; wei2022circle, ; qin2021lfpt5, ; chen2024parameterizing, ). While one potential solution is re-collecting pre-training data and re-training models with additional specific needs, this approach is prohibitively expensive and impractical in real-world scenarios.

To efficiently adapt LLMs to downstream tasks while minimizing performance degradation on previous knowledge domains, researchers employ the methodology of continual learning, also known as lifelong learning or incremental learning (pentina2016theoretical, ; chen2018lifelong, ; van2022three, ; wang2024comprehensive, ). Continual learning, inspired by the incremental learning pattern observed in human brains (mcclelland1995there, ; kandel2000principles, ; pallier2003brain, ; yang2009stably, ; constantinescu2016organizing, ; olafsdottir2018role, ; liu2019human, ; mccaffary2021towards, ), involves training machine learning models sequentially on a series of tasks with the expectation of maintaining performance across all tasks (kirkpatrick2017overcoming, ; li2017learning, ; zenke2017continual, ; riemer2018learning, ; buzzega2020dark, ; garg2023in, ; ebrahimi2020adversarial, ; ebrahimi2019uncertainty, ). Throughout training, models have limited or no access to previous data, posing a challenge in retaining past knowledge as optimization constraints from unseen previous data are absent during current-task learning (li2017learning, ; smith2023closer, ; hayes2020lifelong, ; lomonaco2020rehearsalfree, ; chaudhry2019tiny, ; riemer2018learning, ; buzzega2020dark, ; shi2024unified, ). This challenge, known as catastrophic forgetting (mccloskey1989catastrophic, ), has been a central focus in continual learning research since its inception. Over the years, researchers have explored various techniques to mitigate forgetting in machine learning models. These include replay-based methods (chaudhry2019tiny, ; schwarz2018progress, ; riemer2018learning, ; buzzega2020dark, ; shi2024unified, ), parameter regularization (kirkpatrick2017overcoming, ; ritter2018online, ; aljundi2018memory, ; sprechmann2018memory, ), and model architecture expansion (ramesh2021model, ; wang2022coscl, ). Together, these techniques have significantly advanced the goal of achieving zero forgetting in continual learning across diverse tasks, model architectures, and learning paradigms.

In the context of training and adapting LLMs sequentially, the significance of CL is undergoing semantic shifts of its own as well. To better highlight this ongoing shift, in this survey, we provide a comprehensive overview and detailed discussion of the current research progress on LLMs within the context of CL. For the general picture of continually learning LLMs, we for the first time divide it into two directions of continuity that need to be addressed by practitioners (Section 3):

  • Vertical continuity (or vertical continual learning), which refers to the ongoing adaptation of LLMs as they transition from large-scale general domains to smaller-scale specific domains, involving shifts in learning objectives and entities of execution. For example, healthcare institutions may develop LLMs tailored to the medical domain while retaining their general reasoning and question answering capabilities for users.

  • Horizontal continuity (or horizontal continual learning), which refers to continual adaptation across time and domains, often entails multiple training stages and increased vulnerability to forgetting. For example, social media platforms continuously update LLMs to reflect recent trends, ensuring accurate targeting of downstream services like advertising and recommendations without compromised experience for existing users.

The explicit separation of vertical and horizontal CL extends beyond a trivial modification of existing CL types, such as domain-incremental learning, which might be considered analogous to horizontal continuity. It offers a robust conceptual framework for analyzing and describing complex learning paradigms in continual LLMs. For example, Recyclable Tuning aims to preserve both vertical and horizontal continuity simultaneously (qin2023recyclable, ), and future designs could include zigzag CL, alternating between horizontal and vertical CL.

In Fig. 1, following vertical continuity, we delineate three key stages of LLM learning within modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). In CPT, existing research primarily investigates three types of distributional shifts: temporal, content-level, and language-level. Each presents distinct focuses and challenges. In DAP, while it is primarily seen as the procedure of preparing LLMs for downstream tasks, CL evaluation and techniques are frequently utilized. However, there is a noticeable lack of diversity in these techniques, considering the maturity of the conventional CL community. In CFT, our focus is on the emerging field of learning LLMs, covering topics such as Continual Instruction Tuning (CIT), Continual Model Refinement (CMR), Continual Model Alignment (CMA), and Continual Multimodal LLMs (CMLLMs). Next, we present a compilation of publicly available evaluation protocols and benchmarks (Section 5). We conclude our survey with a discussion covering emergent properties of continual LLMs, changes in the roles of conventional CL types and memory constraints within the context of continual LLMs, and prospective research directions for this subject (Section 6).

In summary, this paper provides a comprehensive view of existing continual learning studies for LLMs in detail, which significantly distinguishes itself from existing literature on related topics (biesialska2020continual, ; ke2023continual, ; wang2024comprehensive, ; wu2024continual, ; yang2024recent, ). Our survey highlights the underexplored research area of continually developing LLMs, especially in the field of continual pre-training (CPT) and domain adaptive pre-training (DAP). We emphasize the needs for increased attention from the community, with urgent needs including the development of practical, accessible, and widely acknowledged evaluation benchmarks. Additionally, methodologies need to be tailored to address forgetting in emerging large language model learning paradigms. We hope this survey can provide a systematic and novel view of continual learning in the rapidly-changing field of LLMs and can help the continual learning community contribute to the challenging goals of developing LLMs in a more efficient, reliable, and sustainable manner (jang2022temporalwiki, ; su2023efficient, ; xie2023efficient, ; Cao2023InstructMol, ; attanasio2023worth, ).

Organization. The rest of this paper is organized as follows. We will first start by introducing the background and preliminaries of large language models and continual learning in Section 2. Then we present the overview of continual learning in the modern era of large language models in Section 3. Vertically, it can be roughly divided into three stages of continual training LLMs, and we will present a one-by-one survey of each stage in Section 4. In Section 4.3, the unique aspects of continual fine-tuning LLMs will be introduced, including continual instruction tuning (Section 4.3.3), continual model refinement (Section 4.3.4), continual model alignment (Section 4.3.5), and continual multimodal large language models (Section 4.3.6). In Section 5, we give an inclusive introduction to the evaluation protocols and benchmarks of continual learning for LLMs that are publicly available. Finally, in Section 6, we present a series of discussion of the role of continual learning in the era of large language models, including emergent abilities in large-scale continual LLMs (Section 6.1), three types of continual learning (Section 6.2), roles of memory in continual learning of LLMs (Section 6.3), and prospective future directions (Section 6.4).

2. Preliminaries

In this section, we provide an overview of the fundamental concepts of large language models (LLMs) and continual learning (CL) We begin by introducing the notation used in this paper. Subsequently, we discuss the pre-training and downstream adaptation of LLMs, as well as mainstream LLM families (Section 2.1), followed by an introduction to basic continual learning techniques studied by the community (Section 2.2).

Notation. We denote scalars with lowercase letters, vectors with lowercase boldface letters, and matrices with uppercase boldface letters. The l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of vectors and the Frobenius norm of a matrix are represented by 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For a vector 𝒗=[v1,v2,,vn]𝒗superscriptsubscript𝑣1subscript𝑣2subscript𝑣𝑛top{\bm{v}}=[v_{1},v_{2},\cdots,v_{n}]^{\top}bold_italic_v = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, 𝒗2=(i=1nvi2)1/2subscriptnorm𝒗2superscriptsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑣𝑖212\|{\bm{v}}\|_{2}=(\sum_{i=1}^{n}v_{i}^{2})^{\nicefrac{{1}}{{2}}}∥ bold_italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT; for a matrix 𝑨m×n𝑨superscript𝑚𝑛{\bm{A}}\in\mathbb{R}^{m\times n}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT, 𝑨2=(ijAij2)1/2subscriptnorm𝑨2superscriptsubscript𝑖𝑗superscriptsubscript𝐴𝑖𝑗212\|{\bm{A}}\|_{2}=(\sum_{ij}A_{ij}^{2})^{\nicefrac{{1}}{{2}}}∥ bold_italic_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT / start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. We use ϵ𝒟subscriptitalic-ϵ𝒟\epsilon_{{\mathcal{D}}}italic_ϵ start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT, 𝒟subscript𝒟{\mathcal{L}}_{{\mathcal{D}}}caligraphic_L start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT to denote the error function, and loss function that is deployed for training, respectively, where the subscript is used to denote the error/loss measured by taking the expectation on the data distribution 𝒟𝒟{\mathcal{D}}caligraphic_D. We further use ^Ssubscript^𝑆\widehat{{\mathcal{L}}}_{S}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT to represent the empirical evaluation of the loss function {\mathcal{L}}caligraphic_L over the set of examples S𝑆Sitalic_S. Probability and expectation are denoted by P𝑃Pitalic_P and 𝔼𝔼\mathbb{E}blackboard_E, respectively. We use [m]delimited-[]𝑚[m][ italic_m ] to denote the set of positive integers up to m𝑚mitalic_m, {1,,m}1𝑚\{1,\cdots,m\}{ 1 , ⋯ , italic_m }.

2.1. Large Language Models

In the past two decades, neural language modeling has emerged as the dominant field of deep learning, marked by significant and rapid advancements. Primarily built on the transformer architecture, pre-trained language models (PLMs) like BERT have established a universal hidden embedding space through extensive pre-training on large-scale unlabeled text corpora. Following the pre-training and fine-tuning paradigms, PLMs exhibit promising performance across various natural language processing tasks after being fine-tuned upon small amounts of task-specific data (devlin2018bert, ; liu2019roberta, ; raffel2020exploring, ). Research on scaling laws indicates that increasing model size enhances the capacity of language modelss (kaplan2020scaling, ; hoffmann2022training, ). By scaling parameters to billions or even hundreds of billions and training on massive text datasets, PLMs not only demonstrate superior language understanding and generation capabilities but also manifest emergent abilities such as in-context learning, instruction following, and multi-step reasoning, which are absent in small-scale language models like BERT (wei2022chain, ; wei2022emergent, ; yao2024tree, ; wei2021finetuned, ; min2022rethinking, ). These larger models are commonly referred to as Large Language Models (LLMs).

2.1.1. Pre-Training of LLMs

Pre-training is essential for language models to acquire broad language representations. Decoder-only models typically employ probability language modeling (LM) tasks during pre-training. LM, in this context, specifically refers to auto-regressive LM. Given a sequence of tokens 𝒙=[x1,x2,,xN]𝒙subscript𝑥1subscript𝑥2subscript𝑥𝑁{\bm{x}}=[x_{1},x_{2},\cdots,x_{N}]bold_italic_x = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ], LM predicts the next token xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT autoregressively based on all preceding tokens 𝒙<t=[x1,x2,,xt1]subscript𝒙absent𝑡subscript𝑥1subscript𝑥2subscript𝑥𝑡1{\bm{x}}_{<t}=[x_{1},x_{2},\cdots,x_{t-1}]bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ], and trains the entire network by minimizing the negative log-likelihood:

(1) LM(𝒙)subscriptLM𝒙\displaystyle\mathcal{L}_{{\rm LM}}({\bm{x}})caligraphic_L start_POSTSUBSCRIPT roman_LM end_POSTSUBSCRIPT ( bold_italic_x ) t=1NlogP(xt|𝒙<t),absentsubscriptsuperscript𝑁𝑡1𝑃conditionalsubscript𝑥𝑡subscript𝒙absent𝑡\displaystyle\triangleq-\sum^{N}_{t=1}\log P(x_{t}|{\bm{x}}_{<t}),≜ - ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT roman_log italic_P ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) ,

where P(x1|𝒙<1)P(x1)𝑃conditionalsubscript𝑥1subscript𝒙absent1𝑃subscript𝑥1P(x_{1}|{\bm{x}}_{<1})\triangleq P(x_{1})italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < 1 end_POSTSUBSCRIPT ) ≜ italic_P ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the unconditional probability estimation of the first token. The three most popular families of decoder-only models are GPT, PaLM, and LLaMA. The GPT family, developed by OpenAI, includes models such as GPT-2 (radford2019language, ), GPT-3  (brown2020language, ), ChatGPT (achiam2022chatgpt, ), and GPT-4  (achiam2023gpt, ). Notably, GPT-3 was the first LLM to exhibit emergent abilities not found in smaller PLMs. Another notable family, Gemini, developed by Google, is comparable to the GPT family (team2023gemini, ; reid2024gemini, ). While both GPT and Gemini families are closed-source, LLaMA, released by Meta, is currently the most popular open-source family of LLMs (touvron2023llama, ; touvron2023llama2, ). The weights of these models are made available to the research community under non-commercial licenses.

Masked language modeling (MLM) task serves as a common pre-training objective for encoder-only models like BERT (devlin2018bert, ; liu2019roberta, ). In MLM, for the input sequence 𝒙𝒙{\bm{x}}bold_italic_x, a subset of input tokens m(𝒙)𝑚𝒙m({\bm{x}})italic_m ( bold_italic_x ) are masked and replaced with the special [MASK] token. The pre-training goal is to utilize the unmasked parts 𝒙\m(𝒙)subscript𝒙\absent𝑚𝒙{\bm{x}}_{\backslash m({\bm{x}})}bold_italic_x start_POSTSUBSCRIPT \ italic_m ( bold_italic_x ) end_POSTSUBSCRIPT to predict the masked portions m(𝒙)𝑚𝒙m({\bm{x}})italic_m ( bold_italic_x ). In summary, the overarching goal of MLM is to minimize the negative log-likelihood:

(2) MLM(𝒙)subscriptMLM𝒙\displaystyle\mathcal{L}_{{\rm MLM}}({\bm{x}})caligraphic_L start_POSTSUBSCRIPT roman_MLM end_POSTSUBSCRIPT ( bold_italic_x ) x^m(𝒙)logP(x^|𝒙\m(𝒙)).absentsubscript^𝑥𝑚𝒙log𝑃conditional^𝑥subscript𝒙\absent𝑚𝒙\displaystyle\triangleq-\sum_{\widehat{x}\in m({\bm{x}})}{\rm log}\,P(\widehat% {x}|{\bm{x}}_{\backslash m({\bm{x}})}).≜ - ∑ start_POSTSUBSCRIPT over^ start_ARG italic_x end_ARG ∈ italic_m ( bold_italic_x ) end_POSTSUBSCRIPT roman_log italic_P ( over^ start_ARG italic_x end_ARG | bold_italic_x start_POSTSUBSCRIPT \ italic_m ( bold_italic_x ) end_POSTSUBSCRIPT ) .

Some encoder-decoder architecture models, such as T5 (raffel2020exploring, ), also utilize Sequence-to-Sequence MLM task as the pre-training objective. They take masked sentences as encoder inputs and utilize the decoder to sequentially predict the masked tokens.

2.1.2. Adaptation of LLMs

After pre-training, LLMs need to be effectively adapted to better serve downstream tasks. A series of adaptation methods have been proposed for specific objectives. Due to the fact that LLMs primarily focus on generating linguistically coherent text during pre-training, their performance may not necessarily align with the actual needs of human users or conform to human values, preferences, and principles. Additionally, due to issues such as the timeliness of pre-training data, LLMs may also encounter knowledge cutoff or fallacy issues. Therefore, instruction tuning, model refinement, and model alignment have been proposed to address these issues (zhang2024instruction, ; ouyang2022rlhf, ; rafailov2024dpo, ; de2021editing, ). Below are the formal definitions of the three adaptation tasks for LLMs.

Definition 2.0 (Instruction Tuning, IT).

Let h(𝐱)𝐱h({\bm{x}})italic_h ( bold_italic_x ) be a language model that takes as input data 𝐱𝐱{\bm{x}}bold_italic_x, typically consisting of natural language instructions or queries. Instruction Tuning (IT) is a specialized training approach designed to enhance the model’s ability to accurately and effectively respond to specific instructions. The objective of IT is to refine hhitalic_h by adjusting its parameters using a designated set of training examples ={(𝐱i,𝐲^i)}i=1Nsuperscriptsubscriptsubscript𝐱𝑖subscript^𝐲𝑖𝑖1𝑁{\mathcal{I}}=\{({\bm{x}}_{i},\widehat{{\bm{y}}}_{i})\}_{i=1}^{N}caligraphic_I = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT drawn from the IT data distribution 𝒟subscript𝒟{\mathcal{D}}_{\mathcal{I}}caligraphic_D start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT, where 𝐲^isubscript^𝐲𝑖\widehat{{\bm{y}}}_{i}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the desired output for 𝐱𝐱{\bm{x}}bold_italic_x. This set is curated to target specific tasks or functionalities that require improved performance. Formally, IT seeks to find an optimal refined hypothesis hsuperscripth^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that satisfies:

(3) hsuperscript\displaystyle h^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT argminh𝔼(𝒙,𝒚)𝒟[logP(𝒚^|𝒙,h)]argminhi=1NlogP(𝒚^i|𝒙i,h).absentsubscriptsuperscriptsubscript𝔼similar-to𝒙𝒚subscript𝒟delimited-[]𝑃conditional^𝒚𝒙superscriptsubscriptsuperscriptsuperscriptsubscript𝑖1𝑁𝑃conditionalsubscript^𝒚𝑖subscript𝒙𝑖superscript\displaystyle\triangleq\arg\min_{h^{\prime}}\mathbb{E}_{({\bm{x}},{\bm{y}})% \sim{\mathcal{D}}_{\mathcal{I}}}\left[-\log P(\widehat{{\bm{y}}}|{\bm{x}},h^{% \prime})\right]\approx\arg\min_{h^{\prime}}\sum_{i=1}^{N}-\log P(\widehat{{\bm% {y}}}_{i}|{\bm{x}}_{i},h^{\prime}).≜ roman_arg roman_min start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , bold_italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT caligraphic_I end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_P ( over^ start_ARG bold_italic_y end_ARG | bold_italic_x , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ≈ roman_arg roman_min start_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT - roman_log italic_P ( over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .
Remark 0.

The task of Model Alignment (MA) is usually formulated in the same problem definition as IT, with an alignment dataset of size M𝑀Mitalic_M as 𝒜={(𝐱a,𝐲a,𝐲^a)}a=1M𝒜superscriptsubscriptsubscript𝐱𝑎subscript𝐲𝑎subscript^𝐲𝑎𝑎1𝑀{\mathcal{A}}=\{({\bm{x}}_{a},{\bm{y}}_{a},\widehat{{\bm{y}}}_{a})\}_{a=1}^{M}caligraphic_A = { ( bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_a = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where 𝐲asubscript𝐲𝑎{\bm{y}}_{a}bold_italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represents the model’s original decision for input 𝐱asubscript𝐱𝑎{\bm{x}}_{a}bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and 𝐲^asubscript^𝐲𝑎\widehat{{\bm{y}}}_{a}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT denotes the aligned decision that adheres to specified ethical guidelines or desired outcomes.

Definition 2.0 (Model Refinement, MR).

Suppose we have a model h(𝐱)𝐱h({\bm{x}})italic_h ( bold_italic_x ) taking data 𝐱𝐱{\bm{x}}bold_italic_x (e.g., natural language queries) as inputs. Consider a size-N𝑁Nitalic_N editing set ={(𝐱e,𝐲e,𝐲^e)}e=1Nsuperscriptsubscriptsubscript𝐱𝑒subscript𝐲𝑒subscript^𝐲𝑒𝑒1𝑁{\mathcal{E}}=\{({\bm{x}}_{e},{\bm{y}}_{e},\widehat{{\bm{y}}}_{e})\}_{e=1}^{N}caligraphic_E = { ( bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where 𝐲^esubscript^𝐲𝑒\widehat{{\bm{y}}}_{e}over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denotes the true label of 𝐱esubscript𝐱𝑒{\bm{x}}_{e}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, but the model incorrectly outputs 𝐲esubscript𝐲𝑒{\bm{y}}_{e}bold_italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT for 𝐱esubscript𝐱𝑒{\bm{x}}_{e}bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. Model Refinement (MR) aims to efficiently update the model from hhitalic_h to hsuperscripth^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that it correctly predicts the editing set {\mathcal{E}}caligraphic_E, while preserving the original outputs outside {\mathcal{E}}caligraphic_E. Formally, we aims to find hsuperscripth^{\prime}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfying

(4) h(𝒙0)={𝒚^0if (𝒙0,𝒚^0),h(𝒙0)o.w.superscriptsubscript𝒙0casessubscript^𝒚0if subscript𝒙0subscript^𝒚0subscript𝒙0formulae-sequence𝑜𝑤\displaystyle h^{\prime}({\bm{x}}_{0})=\begin{cases}\widehat{{\bm{y}}}_{0}&% \text{if }({\bm{x}}_{0},\widehat{{\bm{y}}}_{0})\in{\mathcal{E}},\\ h({\bm{x}}_{0})&o.w.\end{cases}italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = { start_ROW start_CELL over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL if ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ caligraphic_E , end_CELL end_ROW start_ROW start_CELL italic_h ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_o . italic_w . end_CELL end_ROW

2.2. Continual Learning

Humans gradually accumulate knowledge and skills across tasks without significant performance decline on previous tasks (mcclelland1995there, ; kandel2000principles, ; pallier2003brain, ; yang2009stably, ; constantinescu2016organizing, ; olafsdottir2018role, ; liu2019human, ; mccaffary2021towards, ). In contrast, machine learning models are usually data-centric, minimizing the training loss on the subsequent tasks will cause the model fail on the old ones, which phenomenon is phrased as “catastrophic forgetting”. Addressing this challenge is a focal point in continual learning research. The problem of efficiently adapting models to a sequence of tasks without forgetting is extensively studied in the continual learning community (pentina2016theoretical, ; chen2018lifelong, ; van2022three, ; wang2024comprehensive, ). These studies are typically conducted under the following memory constraint of CL.

Definition 2.0 (Memory Constraint of Continual Learning).

Suppose T𝑇Titalic_T sets of observations {St𝒯t}t=1Tsuperscriptsubscriptsimilar-tosubscript𝑆𝑡subscript𝒯𝑡𝑡1𝑇\{S_{t}\sim{\mathcal{T}}_{t}\}_{t=1}^{T}{ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT come in as a sequence, where {𝒯t}t=1Tsuperscriptsubscriptsubscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}_{t}\}_{t=1}^{T}{ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the T𝑇Titalic_T task distributions . At the learning stage t>1𝑡1t>1italic_t > 1, the sets of observations {Si}i=1t1superscriptsubscriptsubscript𝑆𝑖𝑖1𝑡1\{S_{i}\}_{i=1}^{t-1}{ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT are not accessible (strong) or only partially accessible (relaxed).

Remark 0.

In early stages of CL, works mostly focused on the strong memory constraint (kirkpatrick2017overcoming, ; li2017learning, ; aljundi2018memory, ; lomonaco2020rehearsalfree, ); as the research field progresses, more focus was put on relaxing the memory constraint to a small buffer for replay (rebuffi2017icarl, ; chaudhry2019tiny, ; buzzega2020dark, ; shi2024unified, ); some modern CL works completely discard the memory constraint but put focus on the computational budget (cai2021online, ; prabhu2023online, ; verwimp2024continual, ).

2.2.1. Three Types of Continual Learning

There are three outstanding types of continual learning scenarios: task-incremental learning (TIL), domain-incremental learning (DIL), and class-incremental learning (CIL). To establish a groundwork for subsequent discussions (as illustrated in Table 3 and Section 6.2), we adhere to the conceptual framework proposed by (van2022three, ; kim2022theoretical, ; wang2024comprehensive, ) and offer formal definitions for these three continual learning scenarios.

Definition 2.0 (Task-Incremental Learning, TIL).

Suppose T𝑇Titalic_T task distributions {𝒯t}t=1Tsuperscriptsubscriptsubscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}_{t}\}_{t=1}^{T}{ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT come in as a sequence, where 𝒯tsubscript𝒯𝑡{\mathcal{T}}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the joint distribution over the t𝑡titalic_t-th task’s input space and the label space (𝒳t,𝒴t)subscript𝒳𝑡subscript𝒴𝑡({\mathcal{X}}_{t},{\mathcal{Y}}_{t})( caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denote 𝒳t=1T𝒳t𝒳superscriptsubscript𝑡1𝑇subscript𝒳𝑡{\mathcal{X}}\triangleq\bigcup_{t=1}^{T}{\mathcal{X}}_{t}caligraphic_X ≜ ⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒴t=1T𝒴t𝒴superscriptsubscript𝑡1𝑇subscript𝒴𝑡{\mathcal{Y}}\triangleq\bigcup_{t=1}^{T}{\mathcal{Y}}_{t}caligraphic_Y ≜ ⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the union of the input and label spaces, respectively. Under the memory constraint defined in Definition 2.3, Task-Incremental Learning (TIL) aims to find the optimal hypothesis h:𝒳×[T]𝒴:superscript𝒳delimited-[]𝑇𝒴h^{*}:{\mathcal{X}}\times[T]\rightarrow{\mathcal{Y}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_X × [ italic_T ] → caligraphic_Y that satisfies:

(5) hsuperscript\displaystyle h^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argminht=1T𝔼(𝒙,y)𝒯t[𝟙h(𝒙,t)y].absentsubscriptsuperscriptsubscript𝑡1𝑇subscript𝔼similar-to𝒙𝑦subscript𝒯𝑡delimited-[]subscript1𝒙𝑡𝑦\displaystyle=\arg\min_{h}\sum_{t=1}^{T}\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{% T}}_{t}}\left[\mathbbm{1}_{h({\bm{x}},t)\neq y}\right].= roman_arg roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∼ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_h ( bold_italic_x , italic_t ) ≠ italic_y end_POSTSUBSCRIPT ] .
Definition 2.0 (Domain-Incremental Learning, DIL).

Suppose T𝑇Titalic_T domain distributions {𝒟t}t=1Tsuperscriptsubscriptsubscript𝒟𝑡𝑡1𝑇\{{\mathcal{D}}_{t}\}_{t=1}^{T}{ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT come in as a sequence, where 𝒟tsubscript𝒟𝑡{\mathcal{D}}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the t𝑡titalic_t-th joint distribution over the shared input space and label space (𝒳,𝒴)𝒳𝒴({\mathcal{X}},{\mathcal{Y}})( caligraphic_X , caligraphic_Y ). Under the memory constraint defined in Definition 2.3, Domain-Incremental Learning (DIL) aims to find the optimal hypothesis h:𝒳𝒴:superscript𝒳𝒴h^{*}:{\mathcal{X}}\rightarrow{\mathcal{Y}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_X → caligraphic_Y that satisfies:

(6) hsuperscript\displaystyle h^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argminht=1T𝔼(𝒙,y)𝒟t[𝟙h(𝒙)y].absentsubscriptsuperscriptsubscript𝑡1𝑇subscript𝔼similar-to𝒙𝑦subscript𝒟𝑡delimited-[]subscript1𝒙𝑦\displaystyle=\arg\min_{h}\sum_{t=1}^{T}\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{% D}}_{t}}\left[\mathbbm{1}_{h({\bm{x}})\neq y}\right].= roman_arg roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_h ( bold_italic_x ) ≠ italic_y end_POSTSUBSCRIPT ] .
Definition 2.0 (Class-Incremental Learning, CIL).

Suppose T𝑇Titalic_T task distributions {𝒯t}t=1Tsuperscriptsubscriptsubscript𝒯𝑡𝑡1𝑇\{{\mathcal{T}}_{t}\}_{t=1}^{T}{ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT come in as a sequence, where 𝒯tsubscript𝒯𝑡{\mathcal{T}}_{t}caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the joint distribution over the t𝑡titalic_t-th task’s input space and the label space (𝒳t,𝒴t)subscript𝒳𝑡subscript𝒴𝑡({\mathcal{X}}_{t},{\mathcal{Y}}_{t})( caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Denote 𝒳t=1T𝒳t𝒳superscriptsubscript𝑡1𝑇subscript𝒳𝑡{\mathcal{X}}\triangleq\bigcup_{t=1}^{T}{\mathcal{X}}_{t}caligraphic_X ≜ ⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒴t=1T𝒴t𝒴superscriptsubscript𝑡1𝑇subscript𝒴𝑡{\mathcal{Y}}\triangleq\bigcup_{t=1}^{T}{\mathcal{Y}}_{t}caligraphic_Y ≜ ⋃ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the union of the input and label spaces, respectively. Under the memory constraint defined in Definition 2.3, Class-Incremental Learning (CIL) aims to find the optimal hypothesis h:𝒳[T]×𝒴:superscript𝒳delimited-[]𝑇𝒴h^{*}:{\mathcal{X}}\rightarrow[T]\times{\mathcal{Y}}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT : caligraphic_X → [ italic_T ] × caligraphic_Y that satisfies:

(7) hsuperscript\displaystyle h^{*}italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT =argminht=1T𝔼(𝒙,y)𝒯t[𝟙h(𝒙)(t,y)].absentsubscriptsuperscriptsubscript𝑡1𝑇subscript𝔼similar-to𝒙𝑦subscript𝒯𝑡delimited-[]subscript1𝒙𝑡𝑦\displaystyle=\arg\min_{h}\sum_{t=1}^{T}\mathbb{E}_{({\bm{x}},y)\sim{\mathcal{% T}}_{t}}\left[\mathbbm{1}_{h({\bm{x}})\neq(t,y)}\right].= roman_arg roman_min start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT ( bold_italic_x , italic_y ) ∼ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_h ( bold_italic_x ) ≠ ( italic_t , italic_y ) end_POSTSUBSCRIPT ] .
Remark 0.

In TIL, it is common to have a shared input space 𝒳=𝒳t,t[T]formulae-sequence𝒳subscript𝒳𝑡for-all𝑡delimited-[]𝑇{\mathcal{X}}={\mathcal{X}}_{t},\forall t\in[T]caligraphic_X = caligraphic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ italic_t ∈ [ italic_T ], but the space of the label distribution 𝒴tsubscript𝒴𝑡{\mathcal{Y}}_{t}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be distinct (𝒴i𝒴j=,ijformulae-sequencesubscript𝒴𝑖subscript𝒴𝑗for-all𝑖𝑗{\mathcal{Y}}_{i}\cap{\mathcal{Y}}_{j}=\emptyset,\forall i\neq jcaligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ caligraphic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅ , ∀ italic_i ≠ italic_j), partially shared (𝒴i𝒴j,ijformulae-sequencesubscript𝒴𝑖subscript𝒴𝑗𝑖𝑗{\mathcal{Y}}_{i}\cap{\mathcal{Y}}_{j}\neq\emptyset,\exists i\neq jcaligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ caligraphic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ ∅ , ∃ italic_i ≠ italic_j), or shared across different tasks (𝒴=𝒴t,t[T]formulae-sequence𝒴subscript𝒴𝑡for-all𝑡delimited-[]𝑇{\mathcal{Y}}={\mathcal{Y}}_{t},\forall t\in[T]caligraphic_Y = caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∀ italic_t ∈ [ italic_T ]). In DIL, the tasks are defined in the same format, i.e., same input space 𝒳𝒳{\mathcal{X}}caligraphic_X and same output space 𝒴𝒴{\mathcal{Y}}caligraphic_Y. During the inference, no task IDs are provided for the hypothesis, which means the continual learning model needs to capture the pattern between the domain-invariant features and the labels. DIL is commonly perceived as more difficult than TIL. CIL is commonly viewed as the most challenging continual learning scenario, as the model needs to infer the label and the task ID at the same time. Another possible formulation of CIL is to represent it as DIL but the output label spaces are disjoint, 𝒴i𝒴j=,ijformulae-sequencesubscript𝒴𝑖subscript𝒴𝑗for-all𝑖𝑗{\mathcal{Y}}_{i}\cap{\mathcal{Y}}_{j}=\emptyset,\forall i\neq jcaligraphic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ caligraphic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∅ , ∀ italic_i ≠ italic_j.

2.2.2. Techniques of Continual Learning

The objective of CL is to find a hypothesis that minimizes risk across all tasks/domains. Consider DIL as an example (shi2024unified, ), at t𝑡titalic_t-th learning stage, the ideal training objective (h){\mathcal{L}}(h)caligraphic_L ( italic_h ) is defined as

(8) (h)\displaystyle{\mathcal{L}}(h)caligraphic_L ( italic_h ) i=1t1𝒟i(h)past domains+𝒟t(h)current domain.absentsubscriptsuperscriptsubscript𝑖1𝑡1subscriptsubscript𝒟𝑖past domainssubscriptsubscriptsubscript𝒟𝑡current domain\displaystyle\triangleq\underbrace{\sum_{i=1}^{t-1}{\mathcal{L}}_{{\mathcal{D}% }_{i}}(h)}_{\text{past domains}}+\underbrace{{\mathcal{L}}_{{\mathcal{D}}_{t}}% (h)\vphantom{\sum_{i=1}^{t-1}{\mathcal{L}}_{{\mathcal{D}}_{i}}(h)}}_{\text{% current domain}}.≜ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h ) end_ARG start_POSTSUBSCRIPT past domains end_POSTSUBSCRIPT + under⏟ start_ARG caligraphic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h ) end_ARG start_POSTSUBSCRIPT current domain end_POSTSUBSCRIPT .

The objectives for past domains are often challenging to measure or optimize due to the memory constraints (Definition 2.3). Therefore, the core of designing CL algorithms lies in identifying a proxy learning objective for the first term without violating the memory constraint. Existing CL techniques can be roughly categorized into 5 groups: (i) replay-based, (ii) regularization-based, (iii) architecture-based, (iv) optimization-based, and (v) representation-based (de2021continual, ; wang2024comprehensive, ). Here, we provide a concise yet comprehensive introduction to the first three categories of continual learning techniques, as they find extensive application in continually learning large language models.

Replay-Based Methods. Replay-based methods adopt the relaxed memory constraint by keeping a small buffer of observed data {Mi}i=1t1superscriptsubscriptsubscript𝑀𝑖𝑖1𝑡1\{M_{i}\}_{i=1}^{t-1}{ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT for each task 𝒯isubscript𝒯𝑖{\mathcal{T}}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Formally, they seek to optimize the following empirical training objective:

(9) ^replay(h)subscript^replay\displaystyle\widehat{{\mathcal{L}}}_{\text{replay}}(h)over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT replay end_POSTSUBSCRIPT ( italic_h ) i=1t1^Mi(h)proxy for past domains+^St(h)current domain,absentsubscriptsuperscriptsubscript𝑖1𝑡1subscript^subscript𝑀𝑖proxy for past domainssubscriptsubscript^subscript𝑆𝑡current domain\displaystyle\triangleq\underbrace{\sum_{i=1}^{t-1}\widehat{{\mathcal{L}}}_{M_% {i}}(h)}_{\begin{subarray}{c}\text{proxy for past domains}\end{subarray}}+% \underbrace{\widehat{{\mathcal{L}}}_{S_{t}}(h)\vphantom{\sum_{i=1}^{t-1}% \widehat{{\mathcal{L}}}_{M_{i}}(h)}}_{\text{current domain}},≜ under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h ) end_ARG start_POSTSUBSCRIPT start_ARG start_ROW start_CELL proxy for past domains end_CELL end_ROW end_ARG end_POSTSUBSCRIPT + under⏟ start_ARG over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h ) end_ARG start_POSTSUBSCRIPT current domain end_POSTSUBSCRIPT ,

where ^Ssubscript^𝑆\widehat{{\mathcal{L}}}_{S}over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT denotes the empirical loss term evaluated on the set of examples S𝑆Sitalic_S. Often regarded as a simplistic solution to CL, replay-based methods may theoretically lead to loose generalization bounds (shi2024unified, ). Despite this, they are valued for their simplicity, stability, and high performance, even with a small episodic memory (chaudhry2019tiny, ; riemer2018learning, ). For instance, DER++ (buzzega2020dark, ) demonstrates consistent performance enhancement by replaying a small set of past examples along with their logits (known as dark experience replay). ESM-ER (sarfraz2023error, ) introduces error sensitivity modulation (ESM) to mitigate abrupt representational drift caused by high-error new examples. A significant focus in replay-based CL is enhancing sample efficiency for buffer maintenance. For instance,  (rebuffi2017icarl, ) prioritizes exemplar selection based on herding to accurately model class mean throughout class-incremental learning. (zhao2022memory, ) propose storing low-fidelity examples to achieve memory-efficient exemplar set maintenance. RM (Rainbow Memory) (bang2021rainbow, ) introduces diversity-aware memory updates based on per-sample uncertainty estimation and data augmentation for class-incremental learning.

Regularization-Based Methods. Suppose h𝜽t1subscriptsubscript𝜽𝑡1h_{{\bm{\theta}}_{t-1}}italic_h start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the hypothesis yielded after the t1𝑡1t-1italic_t - 1-th stage of training, parameterized by 𝜽t1subscript𝜽𝑡1{\bm{\theta}}_{t-1}bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Regularization-based methods utilize a regularization term as a proxy for past domain losses, determined by the distance in the parameter space.

(10) ^reg(h𝜽)subscript^regsubscript𝜽\displaystyle\widehat{{\mathcal{L}}}_{\text{reg}}(h_{\bm{\theta}})over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) λ𝜽𝜽t1𝚺proxy for past domains+^St(h𝜽)current domain,absentsubscript𝜆subscriptnorm𝜽subscript𝜽𝑡1𝚺proxy for past domainssubscriptsubscript^subscript𝑆𝑡subscript𝜽current domain\displaystyle\triangleq\underbrace{\lambda\cdot\left\|{\bm{\theta}}-{\bm{% \theta}}_{t-1}\right\|_{\bm{\Sigma}}}_{\begin{subarray}{c}\text{proxy for past% domains}\end{subarray}}+\underbrace{\widehat{{\mathcal{L}}}_{S_{t}}(h_{\bm{% \theta}})\vphantom{}}_{\text{current domain}},≜ under⏟ start_ARG italic_λ ⋅ ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Σ end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT start_ARG start_ROW start_CELL proxy for past domains end_CELL end_ROW end_ARG end_POSTSUBSCRIPT + under⏟ start_ARG over^ start_ARG caligraphic_L end_ARG start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT current domain end_POSTSUBSCRIPT ,

where 𝒗𝚺=𝒗𝚺𝒗subscriptnorm𝒗𝚺superscript𝒗top𝚺𝒗\|{\bm{v}}\|_{\bm{\Sigma}}={\bm{v}}^{\top}{\bm{\Sigma}}{\bm{v}}∥ bold_italic_v ∥ start_POSTSUBSCRIPT bold_Σ end_POSTSUBSCRIPT = bold_italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_Σ bold_italic_v is the vector norm evaluated on a positive-semi-definite matrix 𝚺𝚺{\bm{\Sigma}}bold_Σ, and λ𝜆\lambdaitalic_λ is the regularization coefficient, a hyper-parameter introduced to balance the past knowledge retention and current knowledge learning. The matrix 𝚺𝚺{\bm{\Sigma}}bold_Σ introduced is to measure the different level of importance of each parameters and their correlations in retaining the past knowledge. In practice, to reduce computational overhead, diagonal matrices are often designed to encode only the importance of each parameter. For example, Elastic Weight Consolidation (EWC) (kirkpatrick2017overcoming, ) adopts a Bayesian perspective, using diagonal values from the Fisher Information Matrix (FIM) as an approximation for the Hessian matrix of parameters. This forms a sequential Maximize A Posteriori (MAP) optimization for continual learning. Memory Aware Synapses (MAS) (aljundi2018memory, ) computes parameter importance in an online and unsupervised manner, defining importance by accumulated absolute gradient during training. It is also worth noting that when 𝚺=𝑰𝚺𝑰{\bm{\Sigma}}={\bm{I}}bold_Σ = bold_italic_I degenerates to an identity matrix, the regularization term simplifies to a basic l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-penalty term, equally penalizing each parameter, which can be surprisingly effective in some cases of continual LLMs (rongali2021continual, ).

Architecture-Based Methods. Expanding the network architecture dynamically to assimilate new knowledge is deemed the most efficient form of CL (wang2022learning, ; wang2022dualprompt, ). This method primarily tackles adaptation challenges and can achieve zero-forgetting when task IDs are available during inference or can be correctly inferred (gururangan2022demix, ; wistuba2023, ). However, due to the difficulty of task ID inference, architecture expansion is predominantly utilized in TIL but is scarcely explored in DIL or CIL. Progressive Neural Networks (PNN) (rusu2016progressive, ) proposes learning laterally connected neurons as new tasks arise, ensuring non-forgetting and enabling transfer of previously learned neurons for future tasks. In conjunction with pre-trained backbone large models like ViT (dosovitskiy2020image, ), CoLoR (wistuba2023, ) trains various low-rank adaptation (LoRA) (hu2021lora, ) modules for different tasks. It estimates and stores prototypes for each task and utilizes the natural clustering ability of the pre-trained model during testing to infer task IDs, selecting the corresponding LoRA component for prediction generation. In the domain of continual LLMs, architecture expansion has resurged in popularity following the rise of parameter-efficient fine-tuning (PEFT) (shazeer2017outrageously, ; aljundi2017expert, ; hu2021lora, ; dettmers2023qlora, ; lester2021power, ; li2021prefix, ), a topic we will delve into shortly (yang2024moral, ; wang2023orthogonal, ; li2024examining, ; jang2022towards, ; jin2022lifelong, ; paul2024ircoder, ; yan2023af, ; wu2024llama, ).

2.2.3. Evaluation Metrics of Continual Learning

In the realm of conventional continual learning, where task streams take the form of classification, many metrics rely on the concept of Accuracy Matrix (lopez2017gradient, ; shi2024unified, ). Extending this notion to the context of continually learning LLMs, we introduce the Performance Matrix 𝑷T×T𝑷superscript𝑇𝑇{\bm{P}}\in\mathbb{R}^{T\times T}bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_T end_POSTSUPERSCRIPT, where T𝑇Titalic_T represents the total number of training stages. Each entry of 𝑷𝑷{\bm{P}}bold_italic_P corresponds to a performance metric evaluated on the models, such as perplexity on pre-training data (jin2022lifelong, ; chen2023lifelong, ; gupta2023continual, ), zero-shot/few-shot evaluation metrics on downstream data without fine-tuning (colombo2024saullm7b, ; wu2023pmc, ; Azerbayev2023LLEMMA, ; deng2023learning, ; nijkamp2022codegen, ; rozière2024code, ), fine-tuned accuracies on downstream tasks (amba2021dynamic, ; qin2023recyclable, ; chen2023lifelong, ; jang2022towards, ), and probing accuracies from fine-tuning add-on components evaluated on downstream tasks (tao2022can, ; luo2023investigating, ; zheng2023learn, ). In 𝑷𝑷{\bm{P}}bold_italic_P, Pi,jsubscript𝑃𝑖𝑗P_{i,j}italic_P start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT denotes the model’s performance after training on task i𝑖iitalic_i and evaluating on task j𝑗jitalic_j. With this Performance Matrix definition, we introduce the primary evaluation protocols widely adopted.

Overall Performance (OP). The Overall Performance (OP) (ke2021achieve, ; zhang2022continual, ; zhang2023copf, ) is a natural extension of the concept of Average Accuracy (lopez2017gradient, ; shi2024unified, ). The OP measured up until training stage t𝑡titalic_t is the average performance of the model trained right after the stage t𝑡titalic_t. Denote it as OPtsubscriptOP𝑡\operatorname{OP}_{t}roman_OP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and we have:

(11) OPtsubscriptOP𝑡\displaystyle\operatorname{OP}_{t}roman_OP start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 1ti=1tPt,i.absent1𝑡superscriptsubscript𝑖1𝑡subscript𝑃𝑡𝑖\displaystyle\triangleq\tfrac{1}{t}\sum_{i=1}^{t}P_{t,i}.≜ divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT .

As noted in (shi2024unified, ), the OP corresponds to the primary optimization objective defined in Definition 2.4, 2.5, and 2.6. In much of the continual learning literature, once all T𝑇Titalic_T tasks are completed, the final OPOP\operatorname{OP}roman_OP (OPTsubscriptOP𝑇\operatorname{OP}_{T}roman_OP start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT) is reported, with the subscript T often omitted for brevity. In some works, OP is weighted by the importance of tasks OP~1Ti=1TwiPt,i~OP1𝑇superscriptsubscript𝑖1𝑇subscript𝑤𝑖subscript𝑃𝑡𝑖\widetilde{\operatorname{OP}}\triangleq\tfrac{1}{T}\sum_{i=1}^{T}w_{i}P_{t,i}over~ start_ARG roman_OP end_ARG ≜ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT, where wi=Ni/j=1TNjsubscript𝑤𝑖subscript𝑁𝑖superscriptsubscript𝑗1𝑇subscript𝑁𝑗w_{i}=N_{i}/\sum_{j=1}^{T}N_{j}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT represents the ratio of data.

Forgetting (F). Define Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the forgetting up to task t𝑡titalic_t, which represents the largest performance drop observed throughout the training process, averaged over t𝑡titalic_t training stages:

(12) Ftsubscript𝐹𝑡\displaystyle F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 1t1j=1t1[maxl[t1]{Pl,jPt,j}].absent1𝑡1superscriptsubscript𝑗1𝑡1delimited-[]subscript𝑙delimited-[]𝑡1subscript𝑃𝑙𝑗subscript𝑃𝑡𝑗\displaystyle\triangleq\tfrac{1}{t-1}\sum_{j=1}^{t-1}\left[\max_{l\in[t-1]}\{P% _{l,j}-P_{t,j}\}\right].≜ divide start_ARG 1 end_ARG start_ARG italic_t - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT [ roman_max start_POSTSUBSCRIPT italic_l ∈ [ italic_t - 1 ] end_POSTSUBSCRIPT { italic_P start_POSTSUBSCRIPT italic_l , italic_j end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT } ] .

Typically, researchers report the average forgetting F=FT𝐹subscript𝐹𝑇F=F_{T}italic_F = italic_F start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT at the end of the entire training process. Forgetting quantifies the impact of learning new tasks on previously acquired knowledge. Ideally, a robust continual learning framework should achieve Backward Transfer (BWT), where learning new tasks enhances performance on prior tasks. This enhancement is typically measured by negating the forgetting, thus indicating an improvement in performance on earlier tasks.

Forward Transfer (FWT). Forward Transfer measures the generalization ability of the continual learning algorithms. Formally, forward transfer FWTtsubscriptFWT𝑡\operatorname{FWT}_{t}roman_FWT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT up to training stage t𝑡titalic_t is defined as

(13) FWTtsubscriptFWT𝑡\displaystyle\operatorname{FWT}_{t}roman_FWT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 1t1i=2tPi1,ibi,absent1𝑡1superscriptsubscript𝑖2𝑡subscript𝑃𝑖1𝑖subscript𝑏𝑖\displaystyle\triangleq\tfrac{1}{t-1}\sum_{i=2}^{t}P_{i-1,i}-b_{i},≜ divide start_ARG 1 end_ARG start_ARG italic_t - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_i - 1 , italic_i end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the baseline performance of the model evaluated on task i𝑖iitalic_i before undergoing continual learning. Strictly speaking, the definition of bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is not the same as defined in the previous work (lopez2017gradient, ; shi2024unified, ), where it is used to denote the performance of a random initialization of the model.

3. Continual Learning Meets Large Language Models: An Overview

Large language models (LLMs) are extensive in various dimensions, including the size of model parameters, pre-training datasets, computational resources, project teams, and development cycles (radford2019language, ; brown2020language, ; achiam2022chatgpt, ; achiam2023gpt, ; chowdhery2023palm, ; anil2023palm, ; touvron2023llama, ; touvron2023llama2, ). The substantial scale of LLMs presents notable challenges for development teams, particularly in keeping them updated amidst rapid environmental changes (amba2021dynamic, ; jin2022lifelong, ; dhingra2022time, ; jang2022towards, ; jang2022temporalwiki, ). To illustrate, in 2023, the average daily influx of new tweets exceeds 500 million111Source: https://www.omnicoreagency.com/twitter-statistics , and training on even a subset of this large volume of data is unaffordable. Recyclable Tuning (qin2023recyclable, ) is the first work to explicitly outline the supplier-consumer structure in the modern LLM production pipeline. This structure allows us to dissect the challenges of continual LLMs from the perspectives of various roles involved. On the supplier side, the model is continually pre-trained over a sequence of large-scale unlabeled datasets. After every release of the pre-trained model, the consumer needs to utilize the stronger and more up-to-date upstream model for downstream tasks. Compared to the upstream supplier, downstream users often lack capacity of collecting and storing large-scale data, maintaining large-scale hardware systems, and training LLMs themselves. Therefore Recyclable Tuning mainly focuses on efficiently adapting an updated pre-trained LLM to downstream tasks continuously. In this survey, we further present a comprehensive framework for a modern production pipeline encompassing various studies on continual LLM pre-training, adaptation, and deployment (Fig. 1). What sets our framework in this survey apart from existing studies (wu2024continual, ) is the incorporation of two directions of continuity: Vertical Continuity and Horizontal Continuity.

3.1. Vertical Continuity (Vertical Continual Learning)

Definition. Vertical continuity (or vertical continual learning) has long been studied, either implicitly or explicitly, in existing literature. Vertical continuity is characterized by a hierarchical structure encompassing data inclusiveness, task scope, and computational resources. Specifically, the training task transitions gradually from general pre-training to downstream tasks, typically undertaken by distinct entities within the production pipeline (qin2023recyclable, ; gururangan2022demix, ; rongali2021continual, ; guo2023continuous, ; yan2023af, ; xie2023efficient, ). Fig. 1 shows a typical pipeline for vertical continuity in LLMs, i.e., “pre-training” \rightarrow “domain-adaptive training” \rightarrow “downstream fine-tuning” (luo2023biomedgpt, ; li2023cfgpt, ; deng2023learning, ; han2021econet, ; zhou2020pre, ; guo2023continuous, ; gururangan2020dont, ; colombo2024saullm7b, ; wu2023pmc, ; wu2024llama, ; yan2023af, ; rongali2021continual, ; ma2023ecomgptct, ; huang2023lawyer, ):

  • Pre-training. During the pre-training stage, a substantial amount of data from diverse domains is required to develop a general-purpose LLM. This phase demands a sizable research and development team dedicated to training and benchmarking the model, along with considerable computational resources.

  • Domain-Adaptive Pre-training. Subsequently, downstream institutions may opt for domain-adaptive pre-training to tailor the model for specific tasks using domain-specific data unavailable to the upstream supplier.

  • Finetuning. Finally, the LLM undergoes fine-tuning on annotated data for downstream tasks before deployment.

Refer to caption
Figure 1. A high-level overview of the modern pipeline for continually pre-training and fine-tuning LLMs, where two dimensions of continuity are described. Vertical Continuity (or Vertical Continual Learning): LLM training can be vertically divided into three stages: (i) Continual Pre-Training (CPT), (ii) Domain-Adaptive Pre-training (DAP), and (iii) Continual Fine-Tuning (CFT). The main focus is the retention of the LLM’s general knowledge (prevention of vertical forgetting). Horizontal Continuity (or Horizontal Continual Learning): After the LLMs are deployed, the models are continually updated when a new set of data samples becomes available. The primary goal is to prevent horizontal forgetting in a long sequence of tasks.

Throughout the process, the unlabeled domain-specific dataset is smaller in scale than the upstream pre-training phase but larger than the final downstream task fine-tuning phase. This pattern extends to computational resources, team size, and other factors. It is important to note that vertical continuity can involve more than three stages (nijkamp2022codegen, ; lin2023geogalactica, ; rozière2024code, ; huang2023lawyer, ). In real-world applications, during domain-adaptive pre-training, additional layers can be added to accommodate multiple entities, such as various departments with distinct objectives but operating within the same domain.

Vertical Forgetting. We term the performance degradation on general knowledge of a model undergoing vertical continual learning “vertical forgetting”. As shown in Fig. 2, for vertical continual learning, the data distribution of upstream tasks partially covers the downstream, meaning the model might start off at a decent initialization for the subsequent stage of training. There are two significant challenges to be addressed to prevent vertical forgetting:

\AddToShipoutPictureFG

* \AddToShipoutPictureFG* \AddToShipoutPictureFG*

Refer to caption
Figure 2. A diagram showing two different directions of continual learning of LLMs. (a) Vertical Continual Learning of LLMs: in this case, the upstream data distribution usually partially covers the subsequent tasks’ data distribution. (b) Horizontal Continual Learning of LLMs: No constraints on the data distributions are present on horizontal continual learning. The continual LLMs need to handle the challenge of abrupt distributional shifts and longer sequence of training.

3.2. Horizontal Continuity (Horizontal Continual Learning)

Definition. Horizontal continuity (or horizontal continual learning) refers to continual adaptation across time and domains, a topic extensively explored within the continual learning community. The primary rationale for preserving horizontal continuity lies in the dynamic nature of data distribution over time. To stay updated with these content shifts, an LLM must incrementally learn newly-emerged data. Otherwise, the cost of re-training will become prohibitively expensive and impractical (chaudhry2019efficient, ; amba2021dynamic, ; su2023efficient, ; xie2023efficient, ). Empirical evidence has consistently shown that despite their impressive capabilities, LLMs struggle to generalize effectively to future unseen data, particularly in the face of temporal or domain shifts (amba2021dynamic, ; jang2022towards, ; jang2022temporalwiki, ; dhingra2022time, ). Additionally, they struggle to retain complete knowledge of past experiences when adapting to new temporal domains, although they do demonstrate a higher level of robustness against catastrophic forgetting (tao2022can, ; luo2023investigating, ; zheng2023learn, ; mehta2023empirical, ). The necessity of employing complex CL algorithms to address challenges in LLMs remains an open question. For instance, during large-scale continual pre-training, large institutions can typically afford the storage costs of retaining all historical data, rendering memory constraints not suitable. Several studies have demonstrated that with full access to historical data, simple sparse replay techniques can effectively mitigate forgetting (thengane2022clip, ; tao2022can, ; scialom2022fine, ; prabhu2023online, ; garg2024tic, ). In contrast, numerous continual learning studies have showcased superior performance compared to naive solutions, suggesting the importance of continual learning techniques in LLM training (jang2022temporalwiki, ; jin2022lifelong, ; qin2022elle, ; chen2023lifelong, ).

Horizontal Forgetting. We informally define “horizontal forgetting” as the performance degradation on the previous tasks when model is undergoing horizontal continual learning. As illustrated in Fig. 2, horizontal continual learning typically involves training stages of similar scales, with potential distributional overlap among their data. In summary, two main challenges need to be addressed for horizontal continual learning of LLMs:

  • Long Task Sequence. Horizontal continual learning ideally involves numerous incremental phases, particularly to accommodate temporal shifts in data distribution. A longer task sequence entails more update steps of the model, leading to inevitable forgetting of previously learned tasks. To address this challenge, researchers employ established continual learning techniques with stronger constraints, such as continual model ensemble (ramesh2021model, ).

  • Abrupt Distributional Shift. In contrast to vertical continuity, where distributional shifts are often predictable, horizontal continual learning does not impose constraints on task properties. Evidence suggests that abrupt changes in task distributions can result in significant horizontal forgetting of the model (caccia2021new, ; sarfraz2023error, ).

4. Learning Stages of Continual Large Language Models

Fig. 1 provides an overview of continually learning LLMs. Along the axis of vertical continuity, three main layers of modern continual learning emerge. The top layer, Continual Pre-Training (CPT), involves continuous pre-training of LLMs by the supplier on newly-collected data alongside existing data (Section 4.1). The middle layer, Domain-Adaptive Pre-training (DAP), prepares LLMs for domain-specific applications through additional pre-training on domain-specific unlabeled data (Section 4.2). The bottom layer, Continual Fine-Tuning (CFT), targets models for final downstream tasks on the consumer side (Section 4.3), where the model needs to be updated after deployment for the specified task.

4.1. Continual Pre-Training (CPT)

4.1.1. CPT: Effectiveness and Efficiency

Before delving into the details of continual pre-training (CPT), it is important to address two fundamental questions: Firstly, regarding effectiveness, can CPT enhance performance on downstream tasks beyond that of the initial training on a wide range of data domains? Extensive studies have not only demonstrated the necessity of CPT for improved downstream performance (qin2022elle, ; gururangan2022demix, ; jang2022towards, ; jang2022temporalwiki, ; jin2022lifelong, ; chen2023lifelong, ), but also shown that when distributional shifts are gradual (jang2022temporalwiki, ; yildiz2024investigating, ) or somewhat correlated (gururangan2022demix, ), CPT can effectively help model generalize to unseen data. The second question is about efficiency: given the large size of LLM’s parameters and data, both old and new, can we achieve adaptation and knowledge retention in a computationally efficient way? Concerning efficiency, most studies focus on techniques for efficient knowledge retention (jin2022lifelong, ; jang2022towards, ; jang2022temporalwiki, ; li2024examining, ), which significantly overlap with the CL literature addressing catastrophic forgetting (schwarz2018progress, ; riemer2018learning, ; buzzega2020dark, ; shi2024unified, ; rebuffi2017icarl, ; ritter2018online, ; aljundi2018memory, ; rusu2016progressive, ; ramesh2021model, ; wang2022coscl, ). In contrast to prior approaches that fully utilize emergent data, some studies recognize the impracticality of this approach in real production environments. Instead, they concentrate on further improving the efficiency of adaptation. For instance, ELLE (qin2022elle, ) employs a function-preserved model expansion to facilitate efficient knowledge growth; (amba2021dynamic, ) and (xie2023efficient, ) sub-sample training data based on novelty and diversity to enhance training efficiency, achieving superior performance to full-data training. Though currently underexplored, efficient adaptation in continual pre-training is poised to become significant, given recent findings emphasizing data quality over quantity for LLM generalization (du2022glam, ; li2023quality, ; xie2024data, ; soldaini2024dolma, ).

4.1.2. General Observations on CPT

Table 1 summarizes the existing studies on continual pre-training (CPT), and here are some key observations we make about CPT.

4.1.3. Distributional Shifts in CPT

This survey categorizes distributional shifts of CPT into three main types: (i) Language Shift: LLMs sequentially learn different language corpora, e.g., English \rightarrow Chinese (gogoulou2024continual, ; li2024examining, ). (ii) Content Shift: LLMs sequentially learn corpora from different fields, e.g., chemistry \rightarrow biology (gururangan2022demix, ; cossu2022continual, ; jin2022lifelong, ; qin2023recyclable, ; chen2023lifelong, ; gupta2023continual, ). (iii) Temporal Shift: Distributional shifts occur over time, e.g., news in 2021 \rightarrow news in 2022, with a major focus on timestamp-sensitive knowledge retention and update (amba2021dynamic, ; jin2022lifelong, ; dhingra2022time, ; jang2022towards, ; jang2022temporalwiki, ). There is some work continually pre-training LLMs on datasets constructed by errors (zhao2024large, ), re-weighting the samples (chen2024take, ) or tokens (lin2024rho, ), and cannot be properly categorized, we use “Other” to represent them in Table 1.

Language Shift.(gogoulou2024continual, ) focuses on assessing LLMs’ natural ability to learn new languages sequentially (English, Norwegian, and Icelandic). With no explicit CL techniques employed for preventing horizontal forgetting, the study observes consistent positive forward transfer of the knowledge, facilitating new language acquisition regardless of the learning order. Forgetting, on the other hand, emerges as a significant challenge that cannot be mitigated by increasing LLM size. In (li2024examining, ), the degree of forgetting of previously learned language when adapting LLMs to a new language is investigated. Various CL techniques, including parameter freezing, LoRA (hu2021lora, ), and (IA)3 (liu2022few, ), are evaluated across multiple dimensions, including output language, general knowledge retention, and reliability. Preliminary experimental results highlight the non-trivial nature of addressing horizontal forgetting for CPT under the language shift. We argue that research on CPT under language shifts is in its preliminary stages for two main reasons: Firstly, the datasets’ scale, including the number of languages and total token count, remains small. Secondly, specific methods targeting language shifts have yet to be proposed; only basic combinations of existing continual learning techniques have been evaluated.

Table 1. Summary of existing studies on Continual Pre-training of LLMs. The papers are organized based on their relation to CL: (i) no CL techniques are studied, (ii) CL techniques are studied as solely baselines, and (iii) new CL approaches are proposed. In the table, Dist. Shift denotes what type(s) of distributional shifts this particular study considers and is dedicated to solve. In the section of Continual Learning Tech., we mainly categorize three types of continual learning techniques that are studied in the paper: rehearsal (Rehearsal), parameter regularization (Param. Reg.), and architecture expansion (Arch. Exp.). We use “✓”, “✗”, and “♣” to denote “deployed in the proposed method”, “not studied in the paper”, and “studied as a baseline method”, respectively. It is noteworthy that we do not include naive sequential fine-tuning in this table, as it is universally studied as the important baseline method in all of the papers in the table. The papers with only “♣” (jin2022lifelong, ; jang2022temporalwiki, ; jang2022towards, ) means that no new but only existing CL techniques are studied in them, and the papers with only “✗” (gupta2023continual, ; gogoulou2024continual, ) means that no CL techniques but special aspects of fine-tuning are studied.
Method Scenario Continual Learning Tech. LLM Arch. Evaluation
Dist. Shift #Domains Rehearsal Param. Reg. Arch. Exp. Pre-Training Downstream
TimeLMs (loureiro2022timelms, ) Temporal 8 RoBERTa
(yildiz2024investigating, ) Content 159 RoBERTa GPT-2
(gupta2023continual, ) Content 1 Pythia
(gogoulou2024continual, ) Language 3 GPT
RHO-1 (lin2024rho, ) Other 1 TinyLlama Mistral
(li2024examining, ) Language 1 P-Freeze{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT LoRA{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Llama2
CKL (jang2022towards, ) Temporal 1 Mix-Review{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT P-Freeze{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT RecAdam{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT LoRA{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT K-Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT T5
LLPT (jin2022lifelong, ) Temporal   Content 4   8 ER{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Logit-KD{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Rep-KD{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Contrast-KD{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT SEED-KD{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT oEWC{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Layer Exp.{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT RoBERTa
TemporalWiki (jang2022temporalwiki, ) Temporal 5 Mix-Review{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT P-Freeze{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT RecAdam{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT LoRA{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT K-Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT GPT-2
CPT (ke2022continual-train, ) Content 4 DER++{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT  KD{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT CPT{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT  EWC{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT  HAT{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT DEMix{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT RoBERTa
ERNIE 2.0 (sun2020ernie, ) Content 4 ER✓♣✓♣{}^{\text{\char 51\char 168}}start_FLOATSUPERSCRIPT ✓♣ end_FLOATSUPERSCRIPT ERNIE
(amba2021dynamic, ) Temporal 7 P-Freeze{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT Vocab. Exp.{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT BERT
(cossu2022continual, ) Content 5 Vocab. Exp.{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT BERT  RoBERTa
DEMix (gururangan2022demix, ) Content 8 MoE{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT GPT-3
TempoT5 (dhingra2022time, ) Temporal 1 Vocab. Exp.{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT Prompt{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT T5
RecTuning (qin2023recyclable, ) Content 4 ER{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT    KD{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT Adapter{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT RoBERTa
Lifelong-MoE (chen2023lifelong, ) Content 3 ER{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT    KD{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT P-Freeze{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT  L2{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT MoE{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT GLaM
ELLE (qin2022elle, ) Content 5 ER✓♣✓♣{}^{\text{\char 51\char 168}}start_FLOATSUPERSCRIPT ✓♣ end_FLOATSUPERSCRIPT  KD{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT P-Freeze{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT Prompt{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT  Layer Exp.{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT BERT    GPT
(ibrahim2024simple, ) Content Language 2 ER{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT GPT-NeoX
CEM (zhao2024large, ) Other 1 ER{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT CuteGPT ChatGLM Qwen-Chat
IR-DRO (chen2024take, ) Other 1 ER{}^{\text{\char 51}}start_FLOATSUPERSCRIPT ✓ end_FLOATSUPERSCRIPT OPT

Content Shift. Without using complex CL techniques, (yildiz2024investigating, ) explores the large-scale CPT over 159 content domains, and makes key observations about the domain structures and model properties. It shows that CPT on various domains can effectively improve models’ adaptation ability compared to DAP on single domain. Similarly, (gupta2023continual, ) continues the pre-training phase of Pythia (biderman2023pythia, ) with no complex CL techniques. The study focuses on improving CPT with simple learning rate (re)warm-up. They discover that re-warming consistently improves models trained from scratch. Built upon this simple observation, (ibrahim2024simple, ) further shows that proper combination of learning rate re-warming and re-decay, and replay of the previous data is sufficient to achieve a comparable performance as full re-training.

Another pioneering work, LLPT (jin2022lifelong, ), establishes a comprehensive training and evaluation protocol for a series of content-level distributional shifts, referred to as “domain-incremental data streams”. They assess multiple CL methods and, similar to findings in (gogoulou2024continual, ), find consistent forward knowledge transfer, yet horizontal forgetting remains significant. Besides, contrary to the common understanding that experience replay (ER, (chaudhry2019tiny, )) is the most efficient approach to preventing forgetting, the authors find it ineffective in the case of CPT. They speculate that ER’s inefficiency may stem from overfitting issues (yuan2020revisiting, ; jin2022lifelong, ). Recyclable Tuning (qin2023recyclable, ) is the first study to consider both upstream LLM suppliers and downstream consumers at the same time. It shows that if the upstream supplier continually pre-trains LLMs, with or without replay, consumer-side efficiency can be boosted by recycling previously learned update components. Two CL techniques, initialization from outdated components and knowledge distillation, complement each other in improving adaptation efficiency in this context.

Other approaches involve training additional domain-specific experts for new content domains. DEMix (gururangan2022demix, ) addresses CPT by incrementally training and integrating new experts (DEMix layer) for new domains. To ensure reasonable inference performance during testing when no domain information is available, DEMix proposes a parameter-free probabilistic approach to dynamically estimate a weighted mixture of domains. Introducing a new domain variable Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT alongside each word xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the authors estimate the next word probability p(xt|𝒙<t)𝑝conditionalsubscript𝑥𝑡subscript𝒙absent𝑡p(x_{t}|{\bm{x}}_{<t})italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) by marginalizing over all experts222In the marginalization step of the following equation in the original paper, it is p(Dt=j|𝒙t)𝑝subscript𝐷𝑡conditional𝑗subscript𝒙𝑡p(D_{t}=j|{\bm{x}}_{t})italic_p ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) instead of p(Dt=j|𝒙<t)𝑝subscript𝐷𝑡conditional𝑗subscript𝒙absent𝑡p(D_{t}=j|{\bm{x}}_{<t})italic_p ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ), while we believe this is a minor typo, and hence here we use the updated version.:

p(xt|𝒙<t)𝑝conditionalsubscript𝑥𝑡subscript𝒙absent𝑡\displaystyle p(x_{t}|{\bm{x}}_{<t})italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) =j=1np(xt|𝒙<t,Dt=j)p(Dt=j|𝒙<t)=j=1np(xt|𝒙<t,Dt=j)[p(𝒙<t|Dt=j)p(Dt=j)j=1np(𝒙<t|Dt=j)p(Dt=j)],absentsuperscriptsubscript𝑗1𝑛𝑝conditionalsubscript𝑥𝑡subscript𝒙absent𝑡subscript𝐷𝑡𝑗𝑝subscript𝐷𝑡conditional𝑗subscript𝒙absent𝑡superscriptsubscript𝑗1𝑛𝑝conditionalsubscript𝑥𝑡subscript𝒙absent𝑡subscript𝐷𝑡𝑗delimited-[]𝑝conditionalsubscript𝒙absent𝑡subscript𝐷𝑡𝑗𝑝subscript𝐷𝑡𝑗superscriptsubscriptsuperscript𝑗1𝑛𝑝conditionalsubscript𝒙absent𝑡subscript𝐷𝑡superscript𝑗𝑝subscript𝐷𝑡superscript𝑗\displaystyle=\sum_{j=1}^{n}p(x_{t}|{\bm{x}}_{<t},D_{t}=j)\cdot p(D_{t}=j|{\bm% {x}}_{<t})=\sum_{j=1}^{n}p(x_{t}|{\bm{x}}_{<t},D_{t}=j)\cdot\left[\tfrac{p({% \bm{x}}_{<t}|D_{t}=j)\cdot p(D_{t}=j)}{\sum_{j^{\prime}=1}^{n}p({\bm{x}}_{<t}|% D_{t}=j^{\prime})\cdot p(D_{t}=j^{\prime})}\right],= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j ) ⋅ italic_p ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j ) ⋅ [ divide start_ARG italic_p ( bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j ) ⋅ italic_p ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( bold_italic_x start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT | italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_p ( italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ] ,

where the conditional probability terms p(|,Dt)p(\cdot|\cdot,D_{t})italic_p ( ⋅ | ⋅ , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are calculated by using a specific domain expert. The DEMix framework’s modularization has been shown to facilitate efficient domain-adaptive pre-training, promote relevant knowledge during inference, and allow for removable components. Lifelong-MoE (chen2023lifelong, ), similar to DEMix (gururangan2022demix, ), incrementally trains domain experts for new domains. However, Lifelong-MoE differs from DEMix in utilizing a token-level gating function to activate multiple experts for intermediate embedding calculation. During training, previous experts’ parameters and gating functions remain frozen, and knowledge distillation loss is employed to regulate parameter updates, which thereby makes Lifelong-MoE robust against the issue of horizontal forgetting.

It is noteworthy that some papers draw almost opposite conclusions regarding the significance of CPT for content shifts. For instance, (cossu2022continual, ) continually pre-trains BERT-based models (devlin2018bert, ; liu2019roberta, ) on five scientific domains and evaluates performance on downstream sentiment analysis. They observe that even the trivial sequential pre-training does not exhibit severe forgetting, prompting reasonable questions about the necessity of CPT.

Temporal Shift. In the context of CPT amid content shifts, Multi-Task Learning (MTL) is often regarded as the upper bound achievable (pentina2016theoretical, ; wang2024comprehensive, ; shi2024unified, ). However, this belief does not fully hold when considering CL under temporal shifts (jang2022towards, ; jang2022temporalwiki, ; dhingra2022time, ), as temporal shifts can introduce conflicting information, posing challenges for LLMs. For instance, the statement “Lionel Messi plays for team Barcelona” remains accurate from 2004 to 2021 but becomes false by 2024, as “Lionel Messi plays for team Inter Miami” becomes the correct statement.

Hence, as advocated by CKL (jang2022towards, ) and TemporalWiki (jang2022temporalwiki, ), LLMs undergoing continual adaptation to temporal shifts must simultaneously achieve three objectives: (i) retention of old knowledge, (ii) acquisition of new knowledge, and (iii) update of the outdated knowledge. They evaluate the same set of continual learning baseline methods (chen2020recall, ; he2021analyzing, ; hu2022lora, ; wang2021kadapter, ), each highlighting distinct aspects of their impact. CKL (jang2022towards, ) observes that parameter expansion consistently exhibits robust performance across all experimental conditions. In contrast, replay-based methods struggle to efficiently adapt to new knowledge acquisition and outdated knowledge update, leading to rapid forgetting of newly learned information during training. TemporalWiki (jang2022temporalwiki, ) constructs a series of temporal corpora and their differential sets from sequential snapshots of Wikipedia, revealing that updating LLMs on these differential sets substantially enhances new knowledge acquisition and updates, requiring significantly less computational resources, and various CL techniques prove effective in mitigating horizontal forgetting during this process. LLPT (jin2022lifelong, ) introduces temporal generalization evaluation for LLMs pre-trained on sequential corpora. Through experiments on a large-scale chronologically-ordered Tweet Stream, the authors demonstrate the superiority of CPT combined with CL techniques to task-specific LMs, in terms of both knowledge acquisition and temporal generalization. Nonetheless, these preliminary experiments do not conclusively determine which specific CL method is more preferable than the others.

Another line of work, Temporal Language Models (TLMs), takes a different approach to address knowledge retention, acquisition, and update under temporal shifts by integrating temporal information into the model (rosin2022time, ; dhingra2022time, ; su2023efficient, ). During training, they inject temporal information into training examples as prefixes of prompts, using special tokens (rosin2022time, ), explicit year information (dhingra2022time, ), or syntax-guided structural information (su2023efficient, ). In sequential training experiments conducted by TempoT5 (dhingra2022time, ), comparison between continually and jointly pre-trained LMs demonstrates that CPT better balances adaptation and forgetting when the replay rate of past data is appropriately set.

Others. CPT as a technique to progressively attain novel knowledge, can be used to refine LLMs’ behavior. CEM (zhao2024large, ) collects examples where the model’s response is incorrect and continually trains the model on these, along with a supplemental dataset. RHO-1 (lin2024rho, ) proposes Selective Language Modeling (SLM), which employs a reference model to evaluate the perplexity of each token in the training corpus, and continually pre-trains the model on high-perplexity tokens. Similarly, IR-DRO (chen2024take, ) re-trains the model on re-weighted examples from the original pre-training dataset, focusing more on higher-loss sequences.

The significance of addressing temporal shifts through CPT is underscored by several industrial studies. For instance, (amba2021dynamic, ) employs a dynamic vocabulary expansion algorithm and an efficient sub-sampling procedure to conduct CPT on large-scale emerging tweet data. Conversely, (loureiro2022timelms, ) adopts CPT without explicit measures to constrain model updates, releasing a series of BERT-based LMs incrementally trained on new tweet data every three months. Preliminary experimental results demonstrate substantial improvements of continually pre-trained LMs over the base BERT model across downstream tasks. While some studies question the necessity of continually adapting LLMs along the temporal axis for environmental reasons, such as reducing CO2 emissions (attanasio2023worth, ), the community commonly embraces CPT as a more efficient learning paradigm compared to the traditional “combine-and-retrain” approach.

4.2. Domain-Adaptive Pre-training (DAP)

Background of DAP. Institutions, regardless of size, often possess significant amounts of unlabeled, domain-specific data. This data bridges the gap between general-purpose LLMs trained on diverse corpora and fine-tuned LLMs designed for specific downstream tasks. Leveraging this data as a preparatory stage can facilitate effective adaptation of LLMs to downstream tasks. Such process of “continued/continual/continuous pre-training” (yan2023af, ; guo2023continuous, ; ma2023ecomgptct, ; han2021econet, ; xie2023efficient, ; xie2023quert, ; huang2023lawyer, ; Lu2023BBTFin, ; Xie2023PIXIU, ; Azerbayev2023LLEMMA, ; yue2023mammoth, ; colombo2024saullm7b, ; Zhang2024SciGLM, ; shen2024tag, ), “further pre-training” (song2024code, ; lin2023geogalactica, ; deng2023learning, ; Rubungo2023LLM-Prop, ; agarwal2024structured, ), “domain tuning” (rongali2021continual, ), “knowledge enhancement pre-training” (Lu2023BBTFin, ), and “knowledge injection training” (wu2023pmc, ) is unified and termed “Domain Adaptive Pre-training (DAP)” (gururangan2020dont, ) for clarity and consistency throughout this survey. In the pioneering work of domain-adaptive pre-training (DAPT) (gururangan2020dont, ), the authors continuously pre-train the language models on a larger domain-specific dataset before fine-tuning them to the downstream tasks, resulting in universally improved performance aross various tasks. As the observation above has been validated on multiple domains in parallel, including BioMed, CS, News, and Reviews (gururangan2020dont, ), practitioners commonly accept that employing DAP on additional unlabeled domain-specific data benefits downstream tasks. Consequently, this technique has become widely deployed in many modern LLMs.

Summary of LLMs with DAP. We provide a summary of existing studies utilizing DAP for LLMs in Table 2. Each entry is characterized by three main features: (i) training process specifications, encompassing the vertical domain for which LLMs are trained, the training pipeline preceding release, and the LLM architecture employed; (ii) adopted continual learning techniques, including rehearsal, parameter regularization, and architecture expansion; and (iii) evaluation metrics for CL, such as backward transfer (forgetting) and forward transfer (adaptation to downstream data).

4.2.1. General Observation on DAP

Several key observations emerge regarding the research landscape of DAP (Table 2).

4.2.2. Different Domains of DAP

Legal Domain. Given the legal industry’s demand for managing ever-growing volumes of legal documents, there’s a burgeoning need to harness LLMs to aid legal professionals in navigating, interpreting, and generating high-quality legal materials (xiao2021lawformer, ; savelka2023explaining, ; yue2023disc, ). In Layer Llama (huang2023lawyer, ), the authors gathered publicly available legal texts from China Courts websites, totaling approximately 10 billion tokens as noted in a GitHub issue (lawyerllam-git, ). In SaulLM (colombo2024saullm7b, ), the authors collected the DAP corpus from various jurisdictions in different countries, resulting in a corpus of 30 billion tokens to cover diverse aspects of legal texts. When combined with previously available datasets (gao2020pile, ; koehn2005europarl, ), the total number of tokens used for legal-domain DAP reaches 94 billion. The substantial volume of DAP data, while offering valuable insights into specific domains, increases the risk of vertical forgetting of the general knowledge due to the large number of update steps involved. To mitigate this issue, SaulLM incorporates general data from Wikipedia, StackExchange, and GitHub into the DAP data, constituting about 2% of the final dataset (colombo2024saullm7b, ). Similarly, Lawyer Llama incorporates replaying general-domain data during DAP, but the replay rate is not disclosed (huang2023lawyer, ). (takahashi2024pretraining, ) also replays of non-latest business documents during DAP when building a Japanese business-specific LLM.

Medical Domain. The development of LLMs holds promise for revolutionary changes in the medical industry, offering potential improvements in efficiency and quality across medical communication, disease diagnosis, and decision-making for doctors (he2024foundation, ; li2023chatdoctor, ; singhal2023large, ; jeblick2022chatgpt, ; chen2023utility, ). Efforts have been made to develop medical specialists by either training an LLM from scratch (Han2023MedAlpaca, ; Singhal2023MedPaLM2, ; gu2021domain, ; luo2022biogpt, ) or fine-tuning publicly-available LLMs to meet specific medical needs (luo2023biomedgpt, ; wu2023pmc, ; Chen2023HuatuoGPTII, ; xiong2023doctorglm, ; bao2023discmedllm, ; zhang2023alpacare, ). Among these approaches, DAP techniques have been extensively utilized to preserve the communication and instruction-following abilities of a general LLM, preparing it for subsequent medical applications (luo2023biomedgpt, ; wu2023pmc, ; Chen2023HuatuoGPTII, ).

BioMedGPT (luo2023biomedgpt, ) is a multi-modal biomedical language model that integrates representations of human language and the language of life (molecules, proteins, cells, genes, etc.). Prior to final multi-modal supervised fine-tuning, the authors initialize the model from Llama2-Chat (touvron2023llama2, ) and conduct DAP using extensive biomedical documents from S2ORC (lo2020s2orc, ), without considering any CL techniques or evaluations. In (guo2023continuous, ), DAP is performed using Chinese medical encyclopedias and online expert articles, with next-token prediction as the training objective. During DAP, the performance gradually deteriorates on general-domain datasets as the training step increases, but improves on the downstream medical examination tasks (hendryckstest2021, ; liu2023benchmarking, ; li2023cmmlu, ). PMC-LLama (wu2023pmc, ) gathers biomedical papers from S2ORC (lo2020s2orc, ) and medical textbooks for “knowledge injection training”. During this phase, a general language corpus from RedPajama-Data (together2023redpajama, ) is replayed at a 5% rate within a training batch. However, the paper does not analyze the effectiveness of this operation of mixing in general-domain data for DAP.

To mitigate vertical forgetting, AF Adapter (yan2023af, ) proposes an adapter structure extending the width of Attention layers and FFNs for acquiring domain knowledge and only the adapters are tuned during DAP. Similarly, Hippocrates (acikgoz2024hippocrates, ) deploys LoRA during DAP to both have medical-specific knowledge injected and general ability preserved. Me-Llama (xie2024me, ) mixes in about 25% of the general-domain data for DAP on the clinical notes and biomedical articles, which achieves even positive backward transfer on MMLU (hendryckstest2021, ). HuatuoGPT-II (Chen2023HuatuoGPTII, ) proposes to fuse the DAP into the final SFT, making the two-stage development one unified process. The challenge of such process mainly comes from the data heterogeneity of DAP’s unlabeled corpus. The authors address this challenge by reformulating paragraphs of data into (instruction, output) format using existing large language models. They further employ a priority sampling strategy to avoid compromising downstream ability, a pitfall observed in the fixed-rate data mixing strategy (touvron2023llama2, ). This paper empirically demonstrates the superiority of unified one-stage SFT over two-stage training, questioning the reasonability of the current DAP. On medical-domain data, (rongali2021continual, ) finds that LMs constrained by CL techniques on source domains exhibit greater robustness to future domain shifts. Specifically, they identify that parameter regularization techniques like EWC (kirkpatrick2017overcoming, ), despite slightly higher cost, can facilitate positive forward and backward transfer.

Table 2. Summary of the existing studies that leverage Domain-Adaptive Pre-training of LLMs, where the papers are organized in four main categories based on whether they (i) adopt the continual learning techniques and (ii) perform the evaluation for backward transfer (forgetting). In the column of Train Proc. (Training Process), we omit the phase of general Pre-Training. DAP represents Domain-Adaptive Pre-Training; SFT represents Supervised Fine-Tuning; IT represents Instruction Tuning. The prefix G- and D- represent General and Domain-Specific training process (lin2023geogalactica, ; huang2023lawyer, ), and the prefix U- represents them unified (wu2024llama, ; Chen2023HuatuoGPTII, ). The prefix MM- and LC- represents Multi-Modal and Long-Context training phases (luo2023biomedgpt, ; Zheng2023MarineGPT, ; rozière2024code, ). In the column of Continual Learning Eval., we consider two criteria: (i) Backward Transfer, i.e., performance degradation on the previous tasks, which is also known as catastrophic forgetting, (ii) Forward Transfer, i.e., the performance gained by DAP while transferring the LLMs to the downstream tasks. We use L and Perp. to denote Loss and Perplexity, FT to denote Fine-Tuning, ZS and FS to denote Zero-Shot and Few-Shot Accuracy, HE and LLM to denote the Human Evaluation and LLM Evaluation for generative tasks.
Domain Method Train Proc. LLM Arch. Continual Learning Tech. Continual Learning Eval.
Rehearsal Param. Reg. Arch. Exp. Backward Transfer Forward Transfer
Medical BioMedGPT (luo2023biomedgpt, ) DAP \rightarrow MM-SFT Llama2 FT
Financial BBT-Fin (Lu2023BBTFin, ) DAP T5 FT
Financial CFGPT (li2023cfgpt, ) DAP \rightarrow SFT InternLM Q-LoRA(SFT)(SFT){}_{\text{(SFT)}}start_FLOATSUBSCRIPT (SFT) end_FLOATSUBSCRIPT HE1
Scientific AstroLlama (Nguyen2023AstroLLaMA, ) DAP LlaVa Perp.
Scientific OceanGPT (Bi2023OCEANGPT, ) DAP \rightarrow IT Vicuna Llama2-chat ChatGLM2 LoRA(IT)(IT){}_{\text{(IT)}}start_FLOATSUBSCRIPT (IT) end_FLOATSUBSCRIPT HE
Scientific K2 (deng2023learning, ) DAP \rightarrow SFT Llama LoRA(SFT)(SFT){}_{\text{(SFT)}}start_FLOATSUBSCRIPT (SFT) end_FLOATSUBSCRIPT Perp. | ZS | LLM
Scientific MarineGPT (Zheng2023MarineGPT, ) MM-DAP \rightarrow MM-IT Llama HE
Code CodeGen (nijkamp2022codegen, ) DAP \rightarrow DAP CodeGen Perp. | ZS
Code Comment-Aug (song2024code, ) IT \rightarrow DAP Llama2 Code Llama InternLM2 ZS
EventTemporal EcoNet (han2021econet, )1 DAP \rightarrow FT BERT RoBERTa FT
CommonSense CALM (zhou2020pre, ) DAP \rightarrow FT T5 FT
Multi-Domain BLADE (li2024blade, ) DAP \rightarrow IT BLOOMZ ZS
Scientific ClimateGPT (thulke2024climategpt, ) DAP \rightarrow IT \rightarrow RAG Llama2 FS | Ret.
Medical (guo2023continuous, ) DAP \rightarrow FT Llama2 FS | FT FS | FT
Financial (xie2023efficient, ) DAP Pythia L | FS L | FS
Scientific GeoGalactica (lin2023geogalactica, ) DAP \rightarrow G-SFT \rightarrow D-SFT GAL ZS Perp. | ZS | LLM
Code StarCoder (li2023starcoder, ) DAP StarCoder Perp. | ZS | FS Perp. | ZS | FS
Code DeepSeek-Coder (guo2024deepseekcoder, ) DAP DeepSeek-LLM ZS | FS ZS
Multi-Domain DAPT (gururangan2020dont, ) DAP \rightarrow FT RoBERTa Loss L | FT
Financial WeaverBird (Xue2023WeaverBird, ) DAP GLM2 LoRA HE
Code IRCoder (paul2024ircoder, ) DAP StarCoder DeepSeek-Coder Code Llama LoRA ZS
Code Code Llama (rozière2024code, ) DAP \rightarrow LC-FT \rightarrow IT DAP \rightarrow DAP \rightarrow LC-FT Llama2 Replay Perp. | ZS
Legal SaulLM (colombo2024saullm7b, ) DAP \rightarrow U-IT Mistral Replay Perp. | ZS
Medical PMC-Llama (wu2023pmc, ) DAP \rightarrow IT Llama Replay ZS | FT
Scientific Llema (Azerbayev2023LLEMMA, ) DAP Code Llama Replay Perp. | FS
Multi-Domain DAS (ke2022continual-pre, ) [DAP]n RoBERTa DER++{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT EWC{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT  HAT{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT  Soft-Masking Adapter{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT  DEMix{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT FT
Medical Hippocrates (acikgoz2024hippocrates, ) DAP \rightarrow IT \rightarrow MA Llama2 Mistral LoRA ZS | FS
Language Sailor (dou2024sailor, ) DAP Qwen1.5 Replay ZS
Code & Math Llama Pro (wu2024llama, ) DAP \rightarrow U-SFT Llama2 Block Exp. LoRA{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT ZS | FS Perp. | ZS | FS
Medical AF Adapter (yan2023af, ) DAP \rightarrow FT RoBERTa Layer Exp. LoRA{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT Acc. L | FT
Medical (rongali2021continual, ) DAP \rightarrow FT BERT RoBERTa DistilBERT Replay{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT GEM{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT L2 Reg.{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT EWC{}^{\text{\char 168}}start_FLOATSUPERSCRIPT ♣ end_FLOATSUPERSCRIPT L | FT L | FT
Medical HuatuoGPT-II (Chen2023HuatuoGPTII, ) DAP + U-SFT Baichuan2 Replay ZS ZS | HE
Financial XuanYuan 2.0 (Zhang2023xuanyuan, ) DAP + SFT BLOOM Replay HE HE
Scientific PLlama (Yang2023PLLaMa, ) DAP \rightarrow IT GAL Replay L L | ZS
E-Commerce EcomGPT-CT (ma2023ecomgptct, ) DAP \rightarrow SFT BLOOM Replay ZS | FS ZS | FS
Legal Layer Llama (huang2023lawyer, ) DAP \rightarrow G-IT \rightarrow D-IT Llama Replay ZS ZS
Multi-Domain AdaptLLM (cheng2024adapting, ) DAP Llama Replay ZS ZS | FT
Language Swallow (fujii2024continual, ) DAP Llama2 Replay FS FS
Financial (takahashi2024pretraining, ) DAP Llama2 Replay Loss | ZS Loss | ZS | FS | RAG
Medical Me-Llama (xie2024me, ) DAP \rightarrow IT Llama2 Replay ZS | FS ZS | FS | FT
Language Aurora-M (nakamura2024aurora, ) DAP \rightarrow IT StarCoder Replay ZS ZS | FS | HE

Financial Domain. Similar to the medical domain, LLMs hold immense potential for enhancing financial communication, decision-making processes, and risk assessment for both traders and ordinary individuals (shah2023zero, ; Yang2023InvestLM, ; Wang2023FinGPT, ; li2023cfgpt, ). Despite advancements, a gap persists between general-purpose LLMs and existing domain-specific smaller-scale LLMs (araci2019finbert, ; Wu2023BloombergGPT, ), underscoring the urgent need for more powerful financial-domain experts through the integration of LLMs. Notably, DAP techniques have emerged as crucial tools for tailoring LLMs to the intricacies of the financial domain while mitigating the negative effects of abrupt domain shifts from general to finance (Lu2023BBTFin, ; li2023cfgpt, ; xie2023efficient, ; Xue2023WeaverBird, ; Zhang2023xuanyuan, ).

BBT-Fin (Lu2023BBTFin, ) collects a Chinese financial DAP dataset comprising 80 billion tokens sourced from corporate reports, analyst reports, social media, and financial news. In addition to the conventional masked language modeling (MLM) training objective, BBT-Fin further incorporates triplet masking and span masking techniques during DAP. CFGPT (li2023cfgpt, ) creates CFData, a financial dataset for DAP and SFT, comprising 141 billion tokens. During DAP, CFGPT does not employ CL techniques but utilizes QLoRA (dettmers2023qlora, ) for preventing overfitting to downstream data and balancing general response ability and domain-specific ability during SFT. These two methods are typical domain-specific LLMs focusing solely on adaptation to target domains without explicit CL measures or evaluation of vertical forgetting.

In (xie2023efficient, ), the authors aim to enhance the data efficiency of DAP. When the downstream tasks’ data distribution 𝒯𝒯{\mathcal{T}}caligraphic_T are known, based on the generalization bound (ben2010theory, ; ganin2016domain, ; shi2024unified, ), the authors propose to sample the subset of DAP data whose distribution 𝒟𝒟{\mathcal{D}}caligraphic_D is similar to the downstream task’s data, i.e., dΔ(𝒟,𝒯)subscript𝑑Δ𝒟𝒯d_{{\mathcal{H}}\Delta{\mathcal{H}}}({\mathcal{D}},{\mathcal{T}})italic_d start_POSTSUBSCRIPT caligraphic_H roman_Δ caligraphic_H end_POSTSUBSCRIPT ( caligraphic_D , caligraphic_T ) is low. When the downstream data distribution is unknown, the authors suggest ensuring novelty and diversity in the sampled corpus for DAP. This approach significantly enhances DAP efficiency: it utilizes only 10% of the originally collected data yet outperforms models trained on the entire DAP dataset, underscoring the importance of data quality over quantity. WeaverBird (Xue2023WeaverBird, ) introduces an intelligent finance dialogue system, where the encoder is trained on Chinese and English financial documents, alongside expert-annotated financial query-response pairs, using LoRA (hu2022lora, ). Xuanyuan 2.0 (Zhang2023xuanyuan, ), akin to HuatuoGPT-II (Chen2023HuatuoGPTII, ), proposes the technique of hybrid-tuning, which fuses the stage of DAP and SFT into one, general-domain data and financial-domain data into one. Notably, the distribution of data in hybrid-tuning is non-conventional: financial DAP data comprises only a small portion of 13%. This prompts a pertinent question in line with the investigation on efficient DAP in (xie2023efficient, ): Is a large DAP dataset necessary for developing a domain-specific LLM?

Scientific Domain. Vertical scientific LLMs (Taylor2022Galactica, ; Yin2023FORGE, ; Xie2023DARWIN, ; Zhang2024SciGLM, ) span many subjects, including astronomy (Nguyen2023AstroLLaMA, ; Perkowski2024AstroLLaMAChat, ), mathematics (Azerbayev2023LLEMMA, ; Yu2023Outcome, ; luo2023wizardmath, ; yue2023mammoth, ; Gou2023ToRA, ), geology (Gao2023GLLAVA, ; roberts2023gpt4geo, ; lin2023geogalactica, ; wang2023nearrealtime, ), chemistry and physics (Rubungo2023LLM-Prop, ), biology (Yang2023PLLaMa, ; Zhao2023GIMLET, ; Cao2023InstructMol, ; Abdine2023Prot2Text, ; xTrimoPGLM2024Chen, ; Bi2023OCEANGPT, ; Zheng2023MarineGPT, ). However, among all the studies listed above, only a small fraction of them adopt the technique of DAP.

OceanGPT (Bi2023OCEANGPT, ) is the first LLM tailored specifically for the ocean domain. It performs DAP on a raw corpus of ocean science literature, prioritizing recent research and historically significant works. K2 (deng2023learning, ) pioneers the development of a foundational language model tailored specifically for geoscience. It aggregates geoscience open access literature and Earth science-related Wikipedia pages for DAP. Following this, it undergoes multi-task instruction tuning utilizing LoRA (hu2022lora, ) on both a general instruction tuning dataset and the GeoSignal benchmark introduced within the K2 framework. AstroLlama (Nguyen2023AstroLLaMA, ) gathers abstracts solely from astronomy papers on arXiv and proceeds pre-training. It observes an improved perplexity on the domain of scholarly astronomy, without providing more quantitative evaluation. MarineGPT (Zheng2023MarineGPT, ) is a multi-modal LLM designed specifically for the marine domain. During DAP, MarineGPT incorporates 5 million marine image-text pairs to imbue domain knowledge. This involves training the parameters of a Q-Former (li2023blip2, ) between the frozen visual encoder (dosovitskiy2020image, ) and text decoder (touvron2023llama, ).

Another branch of methods proactively integrate in the replay of the general-domain data to mitigate vertical forgetting. GeoGalactica (lin2023geogalactica, ) introduces a series of LLMs tailored for geoscience. In the DAP phase, besides the 52-billion-token geoscience corpus, Arxiv papers and Codedata are incorporated, with a mixing ratio of 8:1:1. The authors believe that the inclusion of the Codedata during the model’s pre-training can significantly boost the reasoning ability of the LLMs. Although GeoGalactica pinpoints challenges of DAP, including overfitting, catastrophic forgetting, maintaining the training stability, and convergence speed, it does not further provide empirical evidence supporting the inclusion of the Codedata, or deploying specific measures to address the challenges proposed above. Llemma (Azerbayev2023LLEMMA, ) focuses on mathematics, initialized from Code Llama (rozière2024code, ), and undergoes DAP on a blend of the 55 billion mathematical pre-training dataset and general domain data at the ratio of 19:1. In contrast, PLlama (Yang2023PLLaMa, ), designed for plant science, mixes domain-specific and general-domain data at the ratio of 9:1.

Code Domain. The development of LLMs for automatic code filling, debugging, and generation holds significant practical importance (moradidakhel2023github, ; sun2024survey, ). These advancements cover various frameworks, including encoder-only (moradidakhel2023github, ), encoder-decoder (wang2021codet5, ; wang2023codet5plus, ; chai2023erniecode, ), and decoder-only (nijkamp2022codegen, ; nijkamp2023codegen2, ; zheng2023codegeex, ; chen2021evaluating, ; li2023starcoder, ; lozhkov2024starcoder, ; guo2024deepseekcoder, ). In the era of LLMs, there’s a growing trend towards decoder-only architectures (sun2024survey, ), leveraging models pre-trained on general natural language like Llama (touvron2023llama, ; touvron2023llama2, ). Consequently, there’s a shift in the training objective from utilizing code structures to simpler tasks like next token prediction and infilling.

From the perspective of CL, the code domain presents unique advantages and challenges for DAP, compared to other domains discussed so far. On one hand, its hierarchical structure (general domain corpus \rightarrow multi-language code \rightarrow specific programming language) provides an ideal training pipeline for DAPs (rozière2024code, ), offering potential for more efficient training strategies. On the other hand, programming languages adhere to strict grammars, unlike the fuzzy and context-dependent natural language. Consequently, language models should ideally leverage these structures through tailored designs, and adopting the same training objectives as for natural languages may yield sub-optimal results. Therefore, many existing studies omit DAP (wang2021codet5, ; wang2023codet5plus, ; luo2023wizardcoder, ; muennighoff2024octopack, ; jiang2023selfevolve, ; wei2023magicoder, ; zhuo2024astraios, ; di2023codefuse, ; li2024instructcoder, ). In the following section, we will introduce existing code LLMs that employ DAP before the final downstream tasks, discussing both their common attributes and unique characteristics.

Representing a series of notable works that focus solely on adaptation to target domains, CodeGen (nijkamp2022codegen, ) comprises a suite of LLMs designed for natural language (CodeGen-NL), multi-lingual programming languages (CodeGen-Multi), and mono-lingual programming languages (CodeGen-Mono). These models are trained sequentially, with each subsequent model initialized from the previous one trained on more general-domain data. Comment-Aug (song2024code, ) addresses the challenge of aligning programming languages with natural languages (PL-NL alignment) by performing DAP on the code augmented with generated additional comments. StarCoder (li2023starcoder, ) introduces two models: StarCoderBase and StarCoder. StarCoderBase is initially trained on a mixed dataset comprising various programming languages without significant reweighting on the data. Subsequently, StarCoderBase undergoes further fine-tuning on an additional 35 billion tokens of Python code, resulting in the development of StarCoder. DeepSeek-Coder-v1.5 (guo2024deepseekcoder, ) originates from DeepSeek-LLM (deepseekai2024deepseek, ) and undergoes pre-training on 2 trillion tokens, comprising 87% source code, 10% English code-related natural language, and 3% Chinese natural language corpus. Initialization from a general-domain LLM results in improved performance across various tasks, including natural language and mathematical reasoning, with minimal performance degradation on coding tasks, which underscores the efficacy of DAP.

As the only work investigated so far that utilizes the general data replay to mitigate vertical forgetting in the code domain, Code Llama (rozière2024code, ) introduces a sophisticated training framework tailored for various coding tasks and model sizes. Initialized from Llama 2 weights, these models undergo DAP on a dataset composed of deduplicated public code, discussions about code, and a subset of natural language data. This mix of natural language data serves as a form of pseudo-replay to maintain the models’ proficiency in understanding natural language. Besides replay, architecture expansion has proven effective in acquiring robust coding abilities and preventing vertical forgetting simultaneously. IRCoder (paul2024ircoder, ) utilizes compiler intermediate representations to enhance the multilingual transferability of Code LLMs. By conducting DAP on code grounded in intermediate representations with LoRA (hu2021lora, ), IRCoder achieves superior multilingual programming instruction following, enhanced multilingual code understanding, and increased robustness to prompt perturbations. Llama Pro (wu2024llama, ) undergoes DAP on a combination of code and math data. It expands the original Llama2 architecture by dynamically adding multiple identity copies of the transformer blocks. These added blocks initially preserves the original functionality, and will be tuned for DAP. The proposed expansion method is shown to be more resilient against vertical forgetting compared to other parameter-efficient tuning methods like LoRA.

The three aforementioned works highlight the importance of DAP for code LLMs. However, it is crucial to note that the problem definition and conventional architectures of existing Code LLMs may present challenges of compatibility for DAP deployment, and need to be addressed in the future.

Other Domains. ECONET (han2021econet, ) enhances the model’s ability to reason about event temporal relations through a dedicated DAP phase. Temporal and event indicators are masked out, and a contrastive loss is applied to the recovered masked tokens. Results demonstrate that incorporating this DAP stage significantly improves performance on final tasks compared to direct fine-tuning. Concept-Aware Language Model (CALM) (zhou2020pre, ) introduces a data-efficient DAP approach for enhancing the concept-centric commonsense reasoning ability of LLMs. It incorporates both generative and discriminative commonsense reasoning tasks specifically tailored for concept-centric reasoning tasks. Consequently, even a small number of data examples for DAP can lead to notable improvements for downstream tasks.

Aurora-M (nakamura2024aurora, ) and Swallow (fujii2024continual, ) adopt the simple replay strategy that mixes in a small portion of general data during DAP for their multi-lingual ability. Furthermore, Sailor (dou2024sailor, ) studies the optimal strategy of data mixing for DAP, balancing the general knowledge and capacity of different languages. EcomGPT-CT (ma2023ecomgptct, ) employs a data mixing strategy for DAP which transforms semi-structured E-commerce data into a set of nodes and edges, samples a cluster of nodes, and then extracts and concatenates them into a training example. It combines the general-domain corpus with E-commerce data at a ratio of 2:1, which is significantly lower than the common setting adopted by other works.

Notably, there are some papers studying other effective ways of DAP. AdaptLLM (cheng2024adapting, ) transforms raw corpora into (raw text, question, answer) format, creating intrinsic reading comprehension tasks. AdaptLLM demonstrates superior domain-specific knowledge adaptation and minimal vertical forgetting, thereby challenging the data efficiency of conventional DAP. Tag-LLM (shen2024tag, ) re-purposes the general-domain LLM into domain-specific one by multi-stage training of domain tags and function tags, without modifying the base LLM’s weights and thereby mitigates forgetting.

4.3. Continual Fine-Tuning (CFT)

Background of Continual Fine-Tuning (CFT). Continual Fine-Tuning (CFT) lies at the bottom layer of the vertical continuity, where models are trained on successive homogeneous tasks drawn from an evolving data distribution. As the service-oriented layer of LLM, it doesn’t require consideration of further adaptation to another downstream tasks, simplifying optimization objectives to a great extent: better adaptation and less forgetting333We direct interested readers to additional survey literature on the topic of general CFT (biesialska2020continual, ; ke2023continual, ).. In the era of LLMs, new computational paradigms in CFT have emerged and attracted significant attention within the research community. These topics include:

  • Continual Instruction Tuning (CIT), where models must generalize to new tasks encoded in instructions, requiring semantic understanding (zhang2023citb, ) (Section 4.3.3).

  • Continual Model Refinement (CMR), where fine-grained, possibly example-level solutions are required, differing from task-level approaches (hartvigsen2023aging, ) (Section 4.3.4).

  • Continual Model Alignment (CMA), which aligns models with evolving human preferences, challenging due to subjective nature and lack of clear task boundaries (lin2024mitigating, ; zhangcppo, ) (Section 4.3.5).

  • Continual Learning for Multimodal Language Models (CMLLMs), where addressing the composite architectural design and preventing catastrophic forgetting are key challenges (he2023continual, ; ni2023continual, ) (Section 4.3.6).

We summarize existing studies on CFT in Table 3, categorizing studies into sub-categories as listed above. The table includes details on incremental learning types (X-IL), LLM architecture, and employed CL techniques and evaluation metrics. After discussing general observations on CFT in Section 4.3.1, we’ll delve into each sub-category in detail.

Table 3. Summary of the existing studies on Continual Fine-Tuning LLMs, where the papers are organized in five main categories based on what downstream tasks they are designed to tackle, including (i) General Continual Fine-Tuning (CFT); (ii) Continual Instruction Tuning (CIT); (iii) Continual Model Refinement (CMR); (iv) Continual Model Alignment (CMA); (v) Continual Multimodal LLMs (CMLLMs), which is shown in the column of CFT Type. The column of X-IL shows what continual learning paradigm the study includes (van2022three, ), where TIL represents task-incremental learning, meaning task ID/information is provided during inference; DIL represents domain-incremental learning, meaning the tasks are defined in the same format, and no task ID/information is available during inference; CIL represents class-incremental learning, meaning the task ID needs to be further inferred when testing.
CFT Type Method X-IL LLM Arch. Continual Learning Tech. Continual Learning Eval.
Rehearsal Param. Reg. Arch. Exp. Others Avg. Acc. Bwd. Trans. Fwd. Trans.
General CTR (ke2021achieve, ) DIL | CIL BERT Adapter
(tao2022can, ) TIL BERT S-Replay
CIRCLE (wei2022circle, ) DIL T5 Replay EWC Prompt
ConPET (song2023conpet, ) DIL Llama Replay LoRA
(bai2023enhancing, ) DIL | CIL BERT G-Prompt
(luo2023investigating, ) TIL DistilBERT ALBERT | RoBERTa ER | DER | LwF
SEQ (zheng2023learn, ) TIL | CIL Pythia | BERT | GPT2 P-Freeze Tricks for Classifiers
LFPT5 (qin2021lfpt5, ) DIL T5 P-Replay
(weyssow2023usage, ) DIL RoBERTa | GPT2 Replay EWC | SI | RWalk
LR ADJUST (winata2023overcoming, ) DIL XLM-R LR Scheduling
C3 (chen2024parameterizing, ) TIL T5 KD Prompt Tuning
CT0 (scialom2022fine, ) TIL T0 S-Replay
RCL (wang2023trace, ) TIL LLaMA Vicuna | Baichuan Replay
DynaInst (mok2023large, ) TIL BART Replay
CITB (zhang2023citb, ) TIL T5 Replay | AGEM L2 | EWC AdapterCL
SSR (huang2024mitigating, ) TIL LLaMA | Alpaca RandSel | KMeansSel
KPIG (he2024dont, ) DIL | TIL LLaMA | Baichuan DynaInst | PCLL | DCL L2 EWC DARE LM-Cocktail KPIG
ConTinTin (yin2022contintin, ) TIL BART Replay InstructionSpeak
O-LoRA (wang2023orthogonal, ) TIL LLaMA | Alpaca O-LoRA
CIT SAPT (zhao2024sapt, ) TIL T5 | LLaMA SAPT
InsCL (wang2024inscl, ) TIL LLaMA Replay InsCL
CMR (lin2022continual, ) DIL BART ER | MIR | MLR L2 | EWC
GRACE (hartvigsen2023aging, ) DIL T5 | BERT | GPT2 Adapter
WilKE (hu2024wilke, ) DIL GPT2 | GPT-J Adaptor
Larimar (das2024larimar, ) DIL BERT | GPT-J Kanerva Memory
MELO (yu2023melo, ) DIL BERT | GPT2 | T5 LoRA
CME (li2023continual, ) DIL BERT Replay Inner-Prod. Reg.
CMR WISE (wang2024wise, ) DIL GPT-J | Llama2 | Mistral Side Memory
COPF (zhang2023copf, ) TIL | DIL Llama Replay Function Reg. Prompt
AMA (lin2024mitigating, ) DIL OpenLLaMA | Mistral Replay L1 | L2 LoRA Adaptive Model Avg.
CMA CPPO (zhangcppo, ) TIL GPT2 Weighting Prompt
EProj (he2023continual, ) TIL InstructBLIP TSIR Projector Exp.
Fwd-Prompt (zheng2024antiforgetting, ) TIL InstructBLIP | BLIP2 Projector Exp.
CoIN (chen2024coin, ) TIL LLaVA MoE | LoRA
Model Tailor (zhu2024model, ) TIL InstructBLIP | LLaVA Model Tailor
CMLLMs RebQ (zhao2024reconstruct, ) TIL ViLT Prompt Tuning

4.3.1. General Observations on CFT

Examining the landscape of continual learning in the context of LLMs, and combined with the results shown in Table 3, we make several key observations about CFT.

  • OBS-1: There has been a noticeable transition in focus from CIL to TIL and DIL. It has been a longstanding common sense in the CL community that CIL, as it requires the model to predict the context label and within-context label at the same time (van2022three, ; wang2024comprehensive, ; kim2022theoretical, ), is the most challenging CL scenario and hence receives most of the attention from the community. However, among all 35 papers presented in Table 3, only 3 papers study CFT of CIL. The transition of the research focus demonstrates the importance of TIL and DIL in the real-world applications of continual LLMs. More detail discussion of this transition will be included in Section 6.2.

  • OBS-2: In CFT, CL techniques enjoy broader adoption and explicit exploration compared to CPT and DAP. In Table 3, all 35 papers explicitly deploy the CL techniques, 50% of which develop new techniques that cannot be easily interpreted as trivial combination of existing classic CL techniques, e.g., shared attentive learning framework in SAPT (zhao2024sapt, ), external memory deployed in Larimar (das2024larimar, ), and adaptive model averaging method to achieve Pareto-optimal in AMA (lin2024mitigating, ), etc. This underscores the recognition of continual learning as a pivotal component in the development of resilient and adaptive LLMs.

4.3.2. General Continual Fine-Tuning (General CFT)

Researchers have long investigated the phenomenon of forgetting resilience in pre-trained LLMs when fine-tuned for downstream tasks (ke2021achieve, ; tao2022can, ; luo2023investigating, ; zheng2023learn, ; mehta2023empirical, ), despite some discover the opposite (luo2023investigating, ). Although the pre-trained weights initially position the model in a flat-loss basin, aiding adaptation to future tasks without severely impacting previous ones (hao2019visualizing, ; neyshabur2020being, ; mirzadeh2022wide, ; mehta2023empirical, ), zero or near-zero forgetting is only observed at the representation level. This implies that while the model retains its ability to distinguish between task-specific representations, it may still forget specific task details (wu2021pretrained, ; tao2022can, ; luo2023investigating, ; zheng2023learn, ). Therefore, additional measures are necessary when deploying these models in real-world applications (ke2021achieve, ; wei2022circle, ; bai2023enhancing, ; qin2021lfpt5, ; weyssow2023usage, ; chen2024parameterizing, ).

Many studies advance beyond naive sequential fine-tuning, leveraging the inherent anti-forgetting nature of LLMs while avoiding the adoption of overly complex CL techniques (winata2023overcoming, ; zheng2023learn, ). For instance, LR ADJUST (winata2023overcoming, ) proposes a straightforward yet effective method of dynamically adjusting the learning rate to mitigate the overwriting of knowledge from new languages onto old ones. Building on the innate anti-forgetting ability of large language models like Pythia (biderman2023pythia, ), SEQ (zheng2023learn, ) introduces several strategies for fine-tuning LLMs on a sequence of downstream classification tasks, such as freezing the LLM and old classifier’s parameters after warm-up, and pre-allocating future classifiers, etc.

Given the minimal forgetting observed at the representation level in CL, some studies aim to tackle the misalignment between the representation space and the decision-making layers by introducing representation-level constraints during CFT. NeiAttn (bai2023enhancing, ) exemplifies this approach by formulating classification tasks as masked language modeling and proposing a neighboring attention mechanism to counteract negative representation drift.

Another line of approaches refines the input/output format and network architectures of pre-trained LLMs to be better suited for CFT. For instance, CTR (ke2021achieve, ) incorporates two CL-plugin modules that consist of a task-specific module (TSM) for acquiring task-specific knowledge and a knowledge-sharing module (KSM) for selectively transferring previously learned similar knowledge. CIRCLE (wei2022circle, ) manually designs diverse prompt templates for various types of buggy code, unifying them as the cloze task and employs difficulty-based replay to enhance continual program repair. LFPT5 (qin2021lfpt5, ) addresses lifelong few-shot language learning by consolidating sequence labeling, text classification, and text generation into a text-to-text generation task. It undergoes prompt tuning on generated pseudo-examples from previous domains when adapting to new tasks. In (zhang2022continual, ), the authors propose a method for adaptively adding compositional adapters during continual sequence generation tasks. Before training on new domains, a decision stage determines which trained module can be reused. During training, this module also regenerates examples of the past for replay. C3 (chen2024parameterizing, ) merges PEFT and in-context learning (ICL) in a teacher-student framework. The teacher model undergoes in-context tuning focused solely on the current domain, while the student model, together with tunable prompts, minimizes the KL-divergence between the output distribution and the ground truth and teacher model simultaneously.

4.3.3. Continual Instruction Tuning (CIT)

While LLMs are typically pre-trained on extensive and diverse corpora, they may struggle with specific tasks such as instruction following despite their general knowledge. Numerous studies have shown that Instruction Tuning (IT) can notably improve LLMs’ ability to follow textual instructions (zhang2024instruction, ; wei2021finetuned, ; jiang2024instructiontuned, ; sanh2022multitask, ; ouyang2022rlhf, ), leveraging the pre-existing knowledge within LLMs to bridge the gap between general and task-specific performance (wei2022finetuned, ). Recent works like WizardLM (xu2023wizardlm, ) and CodecLM (wang2024codeclm, ) further tailor synthetic data to steer LLMs’ behavior through IT. Additionally, IT enhances the interaction between humans and LLMs, providing a more natural interface and aligning LLM outputs more closely with human expectations and preferences (luo2023empirical, ).

When the instruction tuning data comes in as a stream, forgetting of the previously learned instructions should be addressed. CT0 (scialom2022fine, ) represents the inaugural study on Continual Instruction Tuning (CIT) of LLMs, applying the replay method on the base T0 model throughout the process. Many subsequent studies focus on enhancing the replay method used during CIT. For instance, (he2024dont, ) improve replay efficiency by computing Key-part Information Gain (KPIG) on masked parts to dynamically select replay data, addressing the “half-listening” issue in instruction following. Similarly, SSR (huang2024mitigating, ) uses the LLM to generate synthetic instances for replay, achieving superior or comparable performance to traditional methods at a lower cost.

Other approaches introduce multiple CL techniques during CIT. DynaInst (mok2023large, ) merges parameter regularization with dynamic replay, selectively storing and replaying instances and tasks to enhance outcomes. InstructionSpeak (yin2022contintin, ) employs negative training and replay instructions to improve both forward transfer and backward transfer. Some methods incorporate PEFT. Orthogonal Low-Rank Adaptation (O-LoRA) learns new tasks within an orthogonal subspace while preserving LoRA parameters for previous tasks (wang2023orthogonal, ) to minimize the interference among different tasks. Shared Attention Framework (SAPT) combines a PET block with a selection module via a Shared Attentive Learning & Selection module, tackling catastrophic forgetting and knowledge transfer concurrently (zhao2024sapt, ). While regularization-based and architectural-based methods require additional parameter storage and GPU memory, together with replay-based methods they remain for CIT due to the simplicity and effectiveness (wang2024inscl, ).

4.3.4. Continual Model Refinement (CMR)

Like humans, LLMs are prone to errors, such as inaccurate translations or outdated information (de2021editing, ). Directly fine-tuning the model to correct these mistakes may disrupt its performance on previously learned tasks. To overcome these challenges, Model Refinement (also known as Model Editing) is proposed, which aims to rectify the model’s errors while preserving its performance on other inputs, with only moderate computing resources (sinitsin2020editable, ; de2021editing, ; fast_edit, ; hase2021language, ; huang2023transformer, ; mitchell2022memory, ; hartvigsen2023aging, ). The concept of model editing was initially explored in (sinitsin2020editable, ), which introduced a “reliability-locality-efficiency” principle and proposed a gradient descent editor to address it efficiently. Subsequent research, such as (de2021editing, ) and (fast_edit, ), extended this principle to edit factual knowledge in BERT-based language models and larger models like GPT-J-6B (gpt-j, ) and T5-XXL (raffel2020exploring, ), respectively, using gradient decomposition. These approaches typically update a subset of model parameters to alter the labels of specific inputs. Additionally, memory-based models, as discussed in (mitchell2022memory, ) and (hartvigsen2023aging, ), incorporate editing through retrieval mechanisms.

The concept of Continual Model Refinement (CMR) extends model refinement horizontally, presenting updated sample pairs (𝒙e,ye,y^e)Ne=1subscriptsuperscriptsubscript𝒙𝑒subscript𝑦𝑒subscript^𝑦𝑒𝑒1𝑁{({\bm{x}}_{e},y_{e},\widehat{y}_{e})}^{e=1}_{N}( bold_italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_e = 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT sequentially as a stream. (lin2022continual, ) initially introduces this idea, evaluating various CL methods with a dynamic sampling algorithm. Many CMR methods employ a retrieval mechanism. For instance, (hartvigsen2023aging, ) uses hidden activations of the language model as a “key” to activate updated parameters only when input x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT resembles updated sample pairs; (yu2023melo, ) improves this approach’s efficiency by integrating LoRA (hu2021lora, ); (das2024larimar, ) augments the LLM with an external episodic memory, modeling CMR as an ongoing memory refresh. Meanwhile, some methods focus solely on updating a subset of model parameters. For example, (hu2024wilke, ) addresses the issue of “toxicity buildup and flash” in single-editing methods like ROME (meng2022locating, ), adapting it to the CL context with a knowledge-aware layer selection algorithm. WISE (wang2024wise, ) addresses the “impossible triangle” of reliability, locality, and generalization in existing lifelong model refinement methods. It introduces a side memory system that enables knowledge sharding and merging, successfully achieving all three objectives simultaneously.

While all these works pioneer research in CMR, the exploration of CMR of LLMs remains open. (hase2023does, ) highlights a potential problem: the location for storing the fact may not coincide with the best place for editing it. This challenges the classical “locate and edit” paradigm used by several existing methods (meng2022locating, ; meng2022mass, ), and could become a significent concern for CMR (hu2024wilke, ). Other questions, including whether such problem setting fits LLMs and whether more memory/computationally efficient methods of CMR could be developed for LLMs, are yet to be answered.

4.3.5. Continual Model Alignment (CMA)

Model Alignment (MA) ensures AI systems’ actions and outputs align with human values, ethics, and preferences (ouyang2022rlhf, ; rafailov2024dpo, ). MA can be broadly categorized into two types: Reinforcement Learning-based (RL-based) and Supervised Learning-based (SL-based). RL-based approaches (wu2022survey, ; ouyang2022rlhf, ; schulman2017proximal, ) are trained to make decisions reinforced by human feedback, using a reward system to guide them towards desirable outcomes. Conversely, the SL-based approaches (hendrycks2023aligning, ; rafailov2024dpo, ; ji2024ai, ) directly train models on datasets of human preferences, aligning their output with demonstrated human values. Both approaches leverage a combination of algorithmic learning techniques and human feedback to progressively refine the model behavior. When LLMs undergo the phase of MA, vertical forgetting of previous knowledge usually occurs. In (lin2024mitigating, ), the authors refer to this phenomenon of catastrophic forgetting induced caused by MA as the “Alignment Tax”. Notably, even a single stage of MA can diminish the model’s performance capabilities, as it restricts the model’s responses to a narrower subset of the training distribution.

Continual Model Alignment (CMA) aims to continuously refine LLMs to align with evolving human values, ethics, and data. The static nature of LLM training on historical data sets can lead to discrepancies between the models’ outputs and current factual accuracies, societal norms, and standards, making CMA a crucial process for maintaining their adaptability and alignment with contemporary contexts (taori2023alpaca, ). Likewise, there are two types of CMA frameworks: RL-based and SL-based. In the realm of RL-based CMA, two significant contributions have been noted. (lin2024mitigating, ) identifies the conflicts between the existing CL techniques and RLHF, and proposes Adaptive Model Averaging (AMA), adaptively finding appropriate ratios for the combination of model layers to gain maximal rewards with minimal tax; Continual Proximal Policy Optimization (CPPO) (zhangcppo, ) proposes a weighting strategy for different examples deciding its usage of policy enhancement or knowledge retention, mitigating the alignment tax over time. For SL-based CMA, Continual Optimal Policy Fitting (COPF) (zhang2023copf, ) presents a solution adapted from the Direct Policy Optimization (DPO) (rafailov2024direct, ), solving its potential risks of sub-optimal policy fitting and over-optimization in the context of CMA.

4.3.6. Continual Multimodal Large Language Models (CMLLMs)

Multi-modal LLMs integrate data of multiple modalities, like texts, images and videos to enhance real-world information comprehension (peng2023kosmos2, ; li2023otter, ). Typically, MLLMs consist of modality-specific sub-modules such as pre-trained vision encoders, large language models, and projectors for cross-model alignment. This alignment is essential for MLLMs to fuse the diverse data types and promote their comprehension.

Continually training multi-modal models like CLIP (radford2021learning, ) has been long studied (zheng2023preventing, ; ni2023continual, ; wu2024building, ; khan2021personalizing, ; liu2023class, ; garg2024tic, ; li2024coleclip, ; jha2024clap4clip, ; yu2024select, ; yu2024boosting, ), while the problem of continually training MLLMs still remains underexplored. Several existing studies have investigated the causes of catastrophic forgetting when continually training MLLMs. (zheng2024antiforgetting, ) performs singular value decomposition on input embeddings, revealing a significant disparity among different input embeddings. This discrepancy causes the model to learn irrelevant information for previously trained tasks, resulting in catastrophic forgetting and negative forward transfer. (zhai2023investigating, ) observes that minority collapse may lead to catastrophic forgetting, when the imbalance ratio between majority and minority classes approaches infinity during fine-tuning. It further identifies hallucination as a contributing factor to performance degradation in MLLMs.

Continual Fine-Tuning MLLMs. In contrast to traditional continual learning methods that involve full-model fine-tuning for new tasks, continual fine-tuning for MLLMs focuses on refining specific layers when adapting to new tasks (zhai2023investigating, ; he2023continual, ; zheng2024antiforgetting, ; chen2024coin, ; zhu2024model, ). Given the strong capabilities of pre-trained models, training specific layers suffices, and can simultaneously reduce computational demands. (zhao2024reconstruct, ) additionally considers an continual learning scenario, Continual Missing Modality Learning (CMML), where different modalities are emerging throughout the incremental learning stages. All the aforementioned studies collectively indicate that MLLMs still suffer from catastrophic forgetting, which manifests in two ways: along the direction of vertical continuity, a performance decline on pre-trained tasks following fine-tuning for downstream tasks; and along the axis of horizontal continuity, a performance degrade on previously fine-tuned tasks after fine-tuning for new tasks. (zheng2024antiforgetting, ) also observes negative forward transfer, where the performance of unseen tasks degrades when learning new tasks, indicating a decline in model generalization capability.

While traditional CL methods are applicable, some may not yield optimal results, as evidenced by various experiments (he2023continual, ; zheng2024antiforgetting, ). For instance, (he2023continual, ) observes a consistent efficacy of replay-based and model expansion strategies across diverse scenarios of continual fine-tuning MLLMs, but regularization-based methods only perform well on models that have been jointly instruction-tuned on multiple tasks. Other works seek to develop ad-hoc solutions for continual learning MLLMs. (he2023continual, ) proposes EProj to expand the projection layer in MLLMs for each new task and utilizes task-similarity-informed regularization (TIR) to enhance performance. (zheng2024antiforgetting, ) introduces Fwd-Prompt, a prompt tuning method that projects prompt gradient to both the residual space and the pre-trained subspace to minimize the interference between tasks and reuse pre-trained knowledge respectively, fostering positive forward transfer without relying on previous samples. (zhu2024model, ) focuses on the forgetting of the pre-trained MLLMs after fine-tuned on specific tasks and proposes model tailor to compensate the selected subset that are critical for enhancing target task performance. (zhao2024reconstruct, ) presents a novel method named Reconstruct before Query (RebQ), leveraging the multi-modal knowledge from a pre-trained model to reconstruct the absent information for the missing modality. Recently, MoE (Mixture-of-Experts) framework has gained attention which resembles the architecture-based methods in CL. It provides the model with the ability to learn different intentions from distinct experts, e.g., (chen2024coin, ) first introduces MoELoRA to fine-tune LLaVA, effectively mitigate the catastrophic forgetting of MLLMs in CoIN and the results demonstrate the effectiveness.

5. Evaluation Protocols and Datasets

In this section, we introduce the evaluation protocols and datasets for continul LLMs. In Section 5.1, we discuss common continual learning evaluation metrics adapted for this context, along with metrics designed specifically for continual LLMs. Then, in Section 5.2, we outline the datasets available for each discussed topic.

5.1. Evaluation Protocols

5.1.1. Interpreting Basic Continual Learning Evaluation Metrics in the Continual LLMs

In some literature, OP~~OP\widetilde{\operatorname{OP}}over~ start_ARG roman_OP end_ARG is referred to as “example accuracy” (chen2024parameterizing, ), “whole accuracy” (song2023conpet, ), or “edit success rate” in CMR (hartvigsen2023aging, ). The concepts of Forgetting and Backward Transfer underpin various evaluation metrics, such as knowledge retention (jin2022lifelong, ), performance on unchanged knowledge (jang2022temporalwiki, ), average increased perplexity (AP+(qin2022elle, ), and test and edit retention rate in CMR (hartvigsen2023aging, ). We extend the notation of forward transfer in the vertical direction to represent the performance improvement on downstream tasks resulting from domain-adaptive pre-training (see Table 2). Forward Transfer is alternatively referred to as temporal generalization (jin2022lifelong, ) or knowledge transfer (lazaridou2021mind, ) in some literature.

5.1.2. Evaluation Metrics of Continual LLMs

LAnguage Model Analysis (LAMA). LAnguage Model Analysis (LAMA) is an evaluation framework designed to probe the world knowledge embedded in language models (petroni2019language, ). It converts each world fact into a cloze statement, which is then inputted into the language models to predict the correct answer. LAMA has been extended for continual pre-training, particularly for those under the temporal shifts (jang2022temporalwiki, ; jang2022towards, ). In CKL, three LAMA benchmarks are constructed for different dimensions: InvariantLAMA assesses knowledge retention on time-invariant facts, UpdatedLAMA focuses on knowledge update, and NewLAMA evaluates knowledge acquisition (jang2022towards, ).

Forgotten / (Updated + Acquired) Ratio (FUAR). As the performance of a pre-trained LLM is decomposed into a fine-grained set in CKL (jang2022towards, ), OP becomes a too general metric and cannot accurately reflect the balance and trade-offs of the model’s behavior. To address this issue, CKL proposes a joint evaluation metric FUAR (Forgotten / (Updated + Acquired) Ratio) for continual pre-training. A FUAR value of 1 represents an equal trade-off between the knowledge forgetting and knowledge learning: for each piece of updated or acquired knowledge, one piece of time-invariant knowledge is forgotten on average. A FUAR less than 1 suggests high learning efficacy, where more than one piece of knowledge is acquired at the expense of forgetting one piece of time-invariant knowledge.

X-Delta. In TRACE (wang2023trace, ), the authors propose a set of “X-Delta” metrics for continual instruction tuning, quantifying the forward transfer on specific abilities of LLMs. Let’s denote a set of M𝑀Mitalic_M datasets {X1,X2,,XM}subscript𝑋1subscript𝑋2subscript𝑋𝑀\{X_{1},X_{2},\cdots,X_{M}\}{ italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } for task X. The baseline performances of the pre-trained LLM evaluated on these tasks are denoted as {b1X,,bMX}superscriptsubscript𝑏1𝑋superscriptsubscript𝑏𝑀𝑋\{b_{1}^{X},\cdots,b_{M}^{X}\}{ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT , ⋯ , italic_b start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT }. The model undergoes continuous fine-tuning on a different set of tasks, distinct from those used for evaluation. Throughout the sequential training process, the performance of the model after learning task t𝑡titalic_t on evaluation tasks Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is Rt,iXsuperscriptsubscript𝑅𝑡𝑖𝑋R_{t,i}^{X}italic_R start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. The X-Delta ΔRtXΔsuperscriptsubscript𝑅𝑡𝑋\Delta R_{t}^{X}roman_Δ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT after learning task t𝑡titalic_t is defined as:

(14) ΔRtXΔsuperscriptsubscript𝑅𝑡𝑋\displaystyle\Delta R_{t}^{X}roman_Δ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT 1Mm=1M(Rt,iXbiX).absent1𝑀superscriptsubscript𝑚1𝑀superscriptsubscript𝑅𝑡𝑖𝑋superscriptsubscript𝑏𝑖𝑋\displaystyle\triangleq\tfrac{1}{M}\sum_{m=1}^{M}(R_{t,i}^{X}-b_{i}^{X}).≜ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT ) .

In the public TRACE benchmark, the authors construct three sets of evaluation tasks to benchmark the ability of LLMs, including general ability, instruction following, and safety (wang2023trace, ).

NLG Score. In continual model alignment, three prominent metrics used to evaluate different aspects of Natural language generation (NLG) are BLEU-4 (papineni2002bleu, ), METEOR (banerjee2005meteor, ), and ROUGE-L (lin2004rouge, ). BLEU-4(papineni2002bleu, ), designed for machine translation (MT), evaluates the precision of n-grams between the machine-generated and reference texts, focusing especially on four-word sequences to gauge fluency and adequacy. METEOR (banerjee2005meteor, ) also targets MT but aims to improve correlation with human judgment by considering synonyms and stemming, thus providing a more nuanced assessment of translation quality. On the other hand, ROUGE-L (lin2004rouge, ) is commonly applied in summarization tasks, assessing the longest common subsequence between the generated summary and a set of reference summaries, effectively measuring the recall of essential content. Each metric has its strengths and is tailored to specific kinds of language processing tasks, reflecting different dimensions of text generation quality.

5.2. Datasets

In this section, we provide a comprehensive review of the datasets available for benchmarking continual LLMs, as illustrated in Table 4. We intentionally exclude datasets used for domain-adaptive pre-training LLMs in vertical domains such as legal, medical, and financial, unless they are specifically designed for continual domain-adaptive pre-training. Furthermore, we omit datasets used in general continual fine-tuning, as they have already been extensively studied in existing works (biesialska2020continual, ; ke2023continual, ).

Datasets for Continual Pre-Training (CPT) and Domain Adaptive Pre-Training (DAP). Current research lacks a widely recognized benchmark for evaluating continual pre-training LLMs under temporal shifts. TimeLMs utilizes a series of Twitter corpora collected until 2022, sequentially pre-training RoBERTa models quarterly (loureiro2022timelms, ). CC-RecentNews, adopted as unlabeled pre-training data for LMs in CKL (jang2022towards, ), consists of recent news and serves as a single-stage dataset. Additionally, CKL introduces InvariantLAMA, NewLAMA, and UpdatedLAMA to assess the principles of continual knowledge learning. TWiki, a dataset derived from the articles of Wikipedia between August and December 2021, is curated and cleaned in TemporalWiki (jang2022temporalwiki, ). This dataset facilitates the exploration of incremental learning by providing the Diffsets between neighboring snapshots. For works that study the content-level distributional shifts in CPT and DAP, researchers often resort to a similar set of publicly available datasets (lo2020s2orc, ; xu2019bert, ; ni2019justifying, ) to construct their own test beds for continual learning algorithms. The DAPT dataset, developed by (gururangan2020dont, ), comprises four domains: BioMed and Computer Science from S2ORC (lo2020s2orc, ), News from (zellers2019defending, ), and Reviews from (he2016ups, ). In DAPT’s original study, each domain undergoes individual domain adaptive pre-training stages to demonstrate the universality of DAP’s effectiveness. Subsequent works, such as ELLE (qin2022elle, ) and Recyclable Tuning (qin2023recyclable, ), follow suit by employing these domains for multi-stage CPT. DEMix (gururangan2022demix, ) presents another large-scale dataset, featuring eight semantic domains with over 73.8 billion tokens. Alongside the training set, it includes eight additional datasets for validating the generalization ability of LLMs. On a smaller scale, CPT (ke2022continual-train, ) and DAS (ke2022continual-pre, ) datasets consist of four and eight domains, respectively, with approximately 3.12 million examples and a size of 4.16GB each. These datasets are constructed similarly to the aforementioned ones.

Table 4. Summary of the existing benchmarks publicly available for Continual Learning LLMs. In the column of Name, we use the superscript “” to denote the lack of the dataset name and the name shown is that of the original paper. In this table, we deliberately omit the datasets used for domain-adaptive pre-training the vertical LLMs, as their main focus of development is not on continual learning. We also omit the datasets used for general continual fine-tuning, as they are extensively discussed in other existing surveys (biesialska2020continual, ; ke2023continual, ).
Name Type Shift Domain #Stages Scale Sources Applications Comment
TimeLMs (loureiro2022timelms, ) CPT Temporal Social Media 8 #Examples: 123.86M Tweets (loureiro2022timelms, ) code
CC-RecentNews (jang2022towards, ) CPT Temporal News 1 #Tokens: similar-to\sim168M Web (jang2022towards, ) code
TWiki (jang2022temporalwiki, ) CPT Temporal General Knowledge 5 #Tokens: 4.7B Wikipedia (jang2022temporalwiki, ) code
DAPT (gururangan2020dont, ) CPT DAP Content Multi-Domain 4 Size: 160GB BioMed (lo2020s2orc, ), CS (lo2020s2orc, ), News (zellers2019defending, ), Reviews (he2016ups, ) (gururangan2020dont, ) (qin2023recyclable, ) (qin2022elle, ) code
CPT (ke2022continual-train, ) CPT Content Multi-Domain 4 #Examples: 3.12M Yelp (xu2019bert, ), S2ORC (lo2020s2orc, ), AG-News (zhang2015character, ) (ke2022continual-train, ) code
DEMix (gururangan2022demix, ) CPT Content Multi-Domain 8 #Tokens: 73.8B 1B (chelba2014billion, ), CS (lo2020s2orc, ), Legal (caselaw2018, ), Med (lo2020s2orc, ) WebText (gokaslan2019OpenWeb, ), RealNews (zellers2019defending, ), Reddit (baumgartner2020pushshift, ), Reviews (ni2019justifying, ) (gururangan2022demix, ) code
DAS (ke2022continual-pre, ) CPT DAP Content Multi-Domain 6 Size: 4.16GB Yelp (xu2019bert, ), Reviews (ni2019justifying, ), Papers (lo2020s2orc, ), PubMed (ke2022continual-pre, ) code
SuperNI (wang2022supernaturalinstructions, ) CIT Content Mutli-Domain 16 #Tasks: 1616 #Examples: similar-to\sim5M GitHub (zhang2023citb, ; wang2024inscl, ) code
CITB (zhang2023citb, ) CIT Content Mutli-Domain 19 #Tasks: 38 SuperNI (wang2022supernaturalinstructions, ) (zhang2023citb, ) code
CoIN (chen2024coin, ) CIT Content Multi-Domain 8 #Examples: similar-to\sim1.14M RefCOCO (kazemzadeh-etal-2014-referitgame, ),RefCOCO+ (mao2016generation, ),RefCOCOg (mao2016generation, ) ImageNet (imagenet_cvpr09, ), VQAv2 (goyal2017making, ), ScienceQA (lu2022learn, ) TextVQA  (singh2019vqa, ), GQA (hudson2019gqa, ), VizWiz  (gurari2018vizwiz, ), OCR-VQA (mishraICDAR19, ) (chen2024coin, ) code
TRACE (wang2023trace, ) CIT Content Mutli-Domain 8 #Examples: 56,000 ScienceQA (lu2022learn, ), FOMC (shah2023trillion, ), MeetingBank (hu2023meetingbank, ) C-STANCE (zhao-etal-2023-c, ), 20Minuten (kew-etal-2023-20, ), CodeXGLUE (lu2021codexglue, ), NumGLUE(mishra2022numglue, ) (wang2023trace, ) code
NATURAL-INSTRUCTION (mishra2021natural, ) CIT Content Mutli-Domain 6 #Examples: 193k CosmosQA (huang2019cosmos, ), DROP (dua2019drop, ), Essential-Terms (khashabi-etal-2017-learning, ) MCTACO (zhou2019goingvacationtakeslonger, ), MultiRC (khashabi-etal-2018-looking, ), QASC (khot2020qasc, ) Quoref(dasigi-etal-2019-quoref, ) , ROPES (lin2019reasoning, ) , Winogrande (sakaguchi2019winogrande, ) (mishra2021natural, ) code
IMDB (maas2011learning, ) CMA Content Social Media 1 Size: 217.35 MB IMDB (zhang2023copf, ) code
HH-RLHF (bai2022training, ) CMA Content General Knowledge 1 Size: 28.1 MB Human Feedback (zhang2023copf, ) code
Reddit TL;DR (volske2017tl, ) CMA Content Social Media 2 Size: 19.6 GB Reddit (zhang2023copf, ; zhangcppo, ) code
Common Sense QA (lin2024mitigating, ) Reading Comprehension (lin2024mitigating, ) Translation (lin2024mitigating, ) CMA Content Multi-Domain 6 #Examples: similar-to\sim 41.16M ARC Easy and Challenge (clark2018think, ), Race (lai2017race, ), PIQA (bisk2020piqa, ) SQuAD (rajpurkar2018know, ), DROP (dua2019drop, ) WMT 2014 French to English (bojar2014findings, ) (lin2024mitigating, ) see sources
FEVER (fever, ) CMR Content General Knowledge 1 #Examples: 420k Wikipedia (de2021editing, ; hase2021language, ) code
VitaminC (vitaminC, ) CMR Content General Knowledge 1 #Examples: 450k Wikipedia (mitchell2022memory, ) code
zsRE (zsRE, ) CMR Content General Knowledge 1 #Examples: 120M Wikireading (Wikireading, ) (hase2021language, ; meng2022locating, ; meng2022mass, ; hase2023does, ; hartvigsen2023aging, ; das2024larimar, ) -
T-rex (T-rex, ) CMR Content General Knowledge 1 #Examples: 11M Dbpedia abstracts (Dbpedia, ) (li2022large, ; dong2022calibrating, ) code
NQ (nq, ) CMR Content General Knowledge 1 #Examples: 320k Google queries, Wikipedia (hartvigsen2023aging, ) code
CounterFact (meng2022locating, ) CMR Content General Knowledge 1 #Examples: 22k zsRE (zsRE, ) (meng2022locating, ; yu2023melo, ; hu2024wilke, ; das2024larimar, ) code
SCOTUS (scotus, ) CMR Temporal Law 1 #Examples: 9.2k Supreme Court Database (hartvigsen2023aging, ) code

Datasets for Continual Instruction Tuning. Measuring the effectiveness of CIT is crucial, particularly because traditional evaluation metrics may not be suitable for LLMs: many of them are overly simplistic and fail to comprehensively assess the model’s ability to learn continually. New benchmarks and metrics are required to evaluate both the retention of old knowledge and the integration of new instructions. TRACE (wang2023trace, ) stands as a continual learning benchmark designed specifically for LLMs, encompassing diverse tasks such as multilingual capabilities, code generation, and mathematical reasoning. CITB (zhang2023citb, ) represents another benchmark for CIT, incorporating both learning and evaluation protocols. It in addition demonstrates that replay generally yields the best performance across all methods. CoIN (chen2024coin, ) extends the benchmark to MLLMs, incorporating a balanced and diverse set of instructions from vision-language datasets.

Datasets for Continual Model Refinement. Most datasets for continual model refinement can be categorized into two types (mazzia2023survey, ): fact checking and question answering. For fact checking, models are asked to verify the truthfulness of certain claims, typically modeled as a classification task. Key datasets include FEVER (fever, ) (used by (de2021editing, ; hase2021language, )) and VitaminC (vitaminC, ) (used by (mitchell2022memory, )), both sourced from Wikipedia. For question answering, models are tasked with providing specific answers instead of choices. Zero-shot Relation Extraction (zsRE) (zsRE, ) is the most widely employed dataset for this purpose  (hase2021language, ; meng2022locating, ; meng2022mass, ; hase2023does, ; hartvigsen2023aging, ; das2024larimar, ), alongside Natural Questions (NQ)  (nq, ) and T-rex  (T-rex, ). (meng2022locating, ) adapted zsRE with additional counterfactuals to create the more challenging CounterFact dataset, used by  (yu2023melo, ; hu2024wilke, ; das2024larimar, ). Beyond these two categories, SCOTUS (scotus, ) is also utilized (hartvigsen2023aging, ) in the assessment of continual model refinement through a document classification task for U.S. Supreme Court cases into 11 topics.

Datasets for Continual Model Alignment. In the domain of reinforcement learning with human feedback (RLHF), several datasets are commonly employed across different studies to evaluate the adaptation and effectiveness of models under varying scenarios and continuous learning conditions. The IMDB (maas2011learning, ) and HH-RLHF (bai2022training, ) dataset, as introduced in (zhang2023copf, ) within their study on continual learning through optimal policy fitting, leverages data gathered from interactive RL scenarios to model human preferences dynamically. Similarly, the Reddit TL;DR dataset (volske2017tl, ) used by (zhangcppo, ; zhang2023copf, ) is focused on text summarization, providing a robust platform for testing the longevity and adaptability of learning algorithms under evolving conditions. Lastly, Common Sense QA (clark2018think, ; lai2017race, ; bisk2020piqa, ), Reading Comprehension (rajpurkar2018know, ; dua2019drop, ), and Translation (bojar2014findings, ), which are utilized in (lin2024mitigating, ) are selected to assess the challenges of aligning RL agents with human expectations without incurring significant performance penalties. Each of these datasets is pivotal in advancing the understanding of continual learning and the interplay between human feedback and machine learning adaptation.

Datasets for Continual Multimodal Large Language Models. Following LLaVA (liu2023visual, ), many MLLMs adopt the pattern of instruction tuning to enable assessing alignment with human intention and knowledge preservation for reasoning. Thus, traditional tasks like image classification can be transformed to VQA tasks to evaluate the ability of MLLMs, which are otherwise challenging to assess using conventional methods. Several benchmarks have been proposed to evaluate the CL method for MLLMs. MCIT (he2023continual, ) proposes the first continual instruction tuning benchmarks, Benchmark1 and Benchmark2. The difference between benchmark1 and benchmark2 is that benchmark2 includes Multi-task Joint Instruction Tuning, which aims to explore whether multi-task joint instruction tuning improves the model’s continual learning ability. (zhai2023investigating, ) proposes EMT, the first classification evaluation framework to investigate catastrophic forgetting in MLLMs. (chen2024coin, ) presents a comprehensive benchmark CoIN, spanning 8 task categories and evaluating MLLMs from two perspectives: Instruction Following and General Knowledge, which assess the alignment with human intention and knowledge preserved for reasoning, respectively. (zhao2024reconstruct, ) constructs two datasets, UPMC-Food101-CMML and MM-IMDb-CMML to benchmark the novel CMML task, which means the data of certain modalities is missing during continual fine-tuning. UPMC-Food101-CMM contains 101 food categories and 61,142 training, 6,846 validation, and 22,716 test image-text pairs. MM-IMDb-CMML is a multi-label classification dataset across 27 distinct movie genres, consisting of 15,552 training, 2,608 validation and 7,799 test image-text pairs.

6. Discussion

In this section, we delve into the intersection of conventional computational patterns in continual learning and the training and deployment of large language models (LLMs). We begin by examining intriguing properties that arise during continual learning with LLMs. Next, we explore the evolving roles of three types of incremental learning within the context of LLMs. Following this, we contrast the roles of memory in continual LLMs with those in traditional continual learning. Finally, we conclude with a concise overview of promising directions for future research in this area.

6.1. Intriguing Properties Emergent in Continual LLMs

Beyond the well-established resilience of pre-trained large language models (LLMs) against catastrophic forgetting compared to downstream-specific models (ke2021achieve, ; tao2022can, ; luo2023investigating, ; zheng2023learn, ; mehta2023empirical, ), there is a notable lack of exploration into other intriguing properties of LLMs when trained continually. While investigations into the emergent capabilities of continuously trained LLMs have attracted attentions from the community to a certain extent, they remain relatively limited. For instance, in (yang2024reawakening, ), it is observed that when fine-tuned sequentially and cyclically on a series of documents, large models exhibit a phenomenon known as “anticipatory recovering”. This refers to the LLMs’ ability to recover forgotten information on documents even before encountering them again. This suggests that LLMs may possess the capability of sequential memorization, which could pave the way for research into memory replay and more complex structured learning environments as model parameters scale up.

6.2. Conventional Types of Incremental Learning

As mentioned in Section 2.2, three types of incremental learning are prevalent (van2022three, ). Among them, class-incremental learning (CIL) has historically attracted significant attention from the community (rebuffi2017icarl, ; wu2019large, ). However, in the context of continually pre-training and adapting large language models (LLMs), we observe a decreased interest in CIL but an increased focus on task-incremental learning (TIL) and domain-incremental learning (DIL). Given that language models are inherently designed for content generation and are pre-trained with the pretext generative task of next-word prediction, it is natural to emphasize the patterns of generative tasks and integrate the traditional CIL paradigm into the broader framework of language modeling, discarding the incremental classification head (shao2023class, ; dalessandro2023multimodal, ; cao2024generative, ). For instance, in Vocabulary-Aware Label Generation (VAG), CIL is redefined as the task of continual label generation. This approach utilizes a pre-trained encoder-decoder language model to generate class labels (shao2023class, ). Meanwhile, in the Generative Multi-modal Model (GMM) for CIL (cao2024generative, ), image patches and prompts are concatenated and fed into the language model to generate classification results.

However, the declining attention to the conventional CIL paradigm does not suggest that these techniques are not impactful in the field of continual learning for LLMs. Nonetheless, many current research endeavors unwittingly employ such techniques, indicating their widespread adoption in various applications. For example, techniques such as vocabulary expansion (amba2021dynamic, ; cossu2022continual, ) can be seen as an extension of expanding the classification head in CIL. These CIL techniques can be further integrated into systems like Lifelong-MoE (chen2023lifelong, ), where adding a new expert to the transformer blocks requires updating the gating function to include the routing of the newly added expert. The EProj framework, described in (he2023continual, ), employs a similar architecture, incorporating linear projectors for new domains alongside a selector module trained to route these projectors. Since the aforementioned sub-modules operate on the principles of CIL, previously validated techniques can be directly applied.

The importance of domain-incremental learning (DIL) is self-evident, given the shared task definition and input-output format in continual pre-training (CPT) and domain-adaptive pre-training (DAP). As dynamically expanding token vocabularies can pose additional challenges, it is natural to focus on understanding distributional shifts within the input corpus while keeping the vocabulary fixed. On another front, task-incremental learning (TIL) attracts significant interest due to its potential for personalizing LLM services. For instance, users may desire options for selecting domain-specific experts, thereby making task IDs available throughout inference time (huang2023lorahub, ; wistuba2023continual, ). Additionally, TIL plays a crucial role in instruction tuning, where instructions can be seen as natural-language-encoded task information (scialom2022fine, ; huang2024mitigating, ; mok2023large, ; he2024dont, ; yin2022contintin, ; wang2023orthogonal, ; zhao2024sapt, ; wang2024inscl, ). It is worth noting that the boundary between TIL and DIL becomes somewhat blurred in continual instruction tuning. Language models demonstrate the capability to infer domain information for unseen instructions, suggesting a convergence of TIL and DIL in certain contexts.

6.3. Roles of Memory in Continual LLMs

Previous continual learning research, drawing inspiration from human learning patterns, primarily emphasizes the storage efficiency of past data. The setting of continual learning with limited memory size has garnered significant attention from the community. However, this focus may no longer hold true in the context of continual LLMs. In the direction of relaxing memory constraints, institutions with access to training data may opt to retain full access without restricting memory size, given that the cost of memory storage is more than affordable. In such scenarios, as highlighted in (verwimp2024continual, ), the challenge shifts from storage efficiency to computational efficiency. To achieve continual learning goals, models must efficiently adapt to new data (efficient adaptation) and select key experiences for replay (efficient replay) (xie2023efficient, ; jin2024model, ). Therefore, it is essential to reassess the existing memory constraint and prioritize optimizing computational efficiency for continual learning of LLMs by restricting the number of updates and the number of FLOPs (prabhu2023computationally, ; wang2022sparcl, ).

On the other end of the spectrum, studies with tightened memory constraints remain vital in modern continual learning of LLMs. As shown in Fig. 1, upstream suppliers of LLMs typically do not provide training data with the released model weights. Consequently, consumers must adapt these models to downstream data without access to the actual replay data. Various rehearsal-free continual strategies are applied in this scenario, such as collecting data examples from alternate sources (rozière2024code, ; colombo2024saullm7b, ; wu2023pmc, ; Azerbayev2023LLEMMA, ), leveraging the generative capabilities of LLMs to produce pseudo-examples for replay (qin2021lfpt5, ), and implementing regularization techniques in the parameter space (ke2022continual-pre, ; rongali2021continual, ). Continual learning under the strict memory constraint is also driven by data privacy concerns, where preserving data on the server side is prohibited. In these scenarios, researchers must rely on online continual learning methods (cai2021online, ; mai2022online, ; prabhu2023online, ), where data examples are only utilized for training as they arrive in a stream, and numerous efforts are already underway to develop LLMs capable of operating under these constraints (yang2022continual, ; wang2022online, ; bornschein2024transformers, ).

6.4. Prospective Directions

Theories of Continual LLMs. It is widely recognized that the continual learning community tends to prioritize empirical research over theoretical exploration. Nevertheless, there are efforts to establish theoretical foundations for CL. In (wang2024comprehensive, ), the authors utilize second-order Taylor expansions around optimal parameters to derive an inter-task generalization error bound based on the maximum eigenvalue and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm of parameter differences. Another line of approaches leverages task/domain discrepancies to construct a multi-task generalization bound. For instance, Unified Domain Incremental Learning (UDIL) in (shi2024unified, ) proposes upper bounds for intra-domain and cross-domain distillation losses, unifying various replay-based DIL techniques under a single adaptive generalization bound. However, applying these existing theories directly to continual LLMs can be imprudent, given their pre-trained, large-scale nature. Consequently, there is a notable gap in research focusing on continually learning LLMs with robust theoretical guarantees and understanding the forgetting behaviors of LLMs from a theoretical perspective.

Efficient Replay for Knowledge Retention for Continual LLMs. Computational resources for training large-scale LLMs are often limited. While the storage budget can theoretically be infinite (Section 6.3), replaying past experiences without specific design can lead to inefficient updates in current domain learning, resulting in slow convergence. Beyond sparse replay solutions that control data mixture ratios (lin2023geogalactica, ; rozière2024code, ; Yang2023PLLaMa, ), there is ongoing exploration of efficient replay for continual LLMs. For example, KPIG (he2024dont, ) enhances replay efficiency by calculating Key-part Information Gain (KPIG) on masked segments, enabling the dynamic selection of replay data. In (jin2024model, ), a pioneering effort introduces a forgetting forecasting mechanism based on output changes during adaptation, later used for selective replay in continual model refinement (CMR). It has been verified in this work that filtering replay samples based on their tendency to forget significantly improves knowledge retention rates for continual LLMs. However, more sophisticated and accurate data mixing strategies and efficient replay sample selection mechanisms are much needed, e.g., a dynamic data mixing ratio throughout the training process. Hence we mark this practical direction of efficient replay for LLMs a significant research focus in the future.

Continual LLMs with Controllable Memory. The long-term memory inherent in the whole set of parameters of LLMs often lacks interpretability and explicit manipulability, which is crucial in certain application areas. For instance, consider a scenario where a supplier collects data from customers under their consent and continually utilizes this data to update LLMs. However, if some users later revoke their consent, the knowledge acquired by the trained model from that portion of data must also be revoked. With a continually pre-trained large-scale LLM, the only solution is to roll back to a previous model version predating the inclusion of these users’ data and retrain the model from that point onward. This example of “machine unlearning” (bourtoule2020machine, ; nguyen2022survey, ) vividly illustrates the benefits of equipping LLMs with an external, controllable memory. As part of continual model refinement (CMR), memory systems for continual learning have been explored in several studies. Larimar (das2024larimar, ) suggests integrating the Kanerva Machine (wu2018kanerva, ) as an episodic memory for multi-fact model editing. This memory system supports basic operations like writing, reading, and generating, as well as advanced operations such as sequential writing and forgetting. It enables one-shot knowledge updates without costly retraining or fine-tuning. Additionally, other memory systems like Hopfield Networks (ramsauer2021hopfield, ; pourcel2022online, ) hold promise for future investigation.

Continual LLMs with Custom Preferences. Customizing user preferences is critical for LLMs, especially in service-oriented contexts. Users often require different trade-offs between domain expertise, ethics, values, or tones of expression. Efficiently building customized LLMs for individual users and offering flexible adjustment options is a challenging task. Early attempts in this direction include Imprecise Bayesian Continual Learning (IBCL), which, under certain assumptions, guarantees the generation of Pareto-optimal models based on user preferences by combining two model posteriors in the parameter space (lu2023ibcl, ). While empirical validation is limited in scale, this approach paves the way for future research in this area.

7. Conclusion

In this work, we offer a comprehensive survey on continual LLMs, summarizing recent advancements in their training and deployment from a continual learning standpoint. We categorize the problems and tasks based on their positions within our proposed broader framework of modern stratified continual learning of LLMs. While there is a widespread and growing interest in this area across the community, we also note several missing cornerstones, including algorithmic diversity and a fundamental understanding of large models’ behaviors such as knowledge forgetting, transfer, and acquisition. With a holistic yet detailed approach, we aim for this survey to inspire more practitioners to explore continual learning techniques, ultimately contributing to the development of robust and self-evolving AI systems.

References

  • [1] H. Abdine, M. Chatzianastasis, C. Bouyioukos, and M. Vazirgiannis. Prot2text: Multimodal protein’s function generation with GNNs and transformers. In Deep Generative Models for Health Workshop NeurIPS 2023, 2023.
  • [2] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • [3] E. C. Acikgoz, O. B. İnce, R. Bench, A. A. Boz, İ. Kesen, A. Erdem, and E. Erdem. Hippocrates: An open-source framework for advancing large language models in healthcare. arXiv preprint arXiv:2404.16621, 2024.
  • [4] M. Agarwal, Y. Shen, B. Wang, Y. Kim, and J. Chen. Structured code representations enable data-efficient adaptation of code language models, 2024.
  • [5] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
  • [6] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3366–3375, 2017.
  • [7] S. Amba Hombaiah, T. Chen, M. Zhang, M. Bendersky, and M. Najork. Dynamic language models for continuously evolving content. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 2514–2524, 2021.
  • [8] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • [9] D. Araci. Finbert: Financial sentiment analysis with pre-trained language models, 2019.
  • [10] G. Attanasio, D. Nozza, F. Bianchi, and D. Hovy. Is it worth the (environmental) cost? limited evidence for temporal adaptation via continuous training, 2023.
  • [11] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos, S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and S. Welleck. Llemma: An open language model for mathematics. CoRR, abs/2310.10631, 2023.
  • [12] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  • [13] X. Bai, J. Shang, Y. Sun, and N. Balasubramanian. Enhancing continual learning with global prototypes: Counteracting negative representation drift, 2023.
  • [14] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • [15] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  • [16] J. Bang, H. Kim, Y. Yoo, J.-W. Ha, and J. Choi. Rainbow memory: Continual learning with a memory of diverse samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8218–8227, June 2021.
  • [17] Z. Bao, W. Chen, S. Xiao, K. Ren, J. Wu, C. Zhong, J. Peng, X. Huang, and Z. Wei. Disc-medllm: Bridging general large language models and real-world medical consultation, 2023.
  • [18] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn. The pushshift reddit dataset, 2020.
  • [19] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan. A theory of learning from different domains. Machine learning, 79:151–175, 2010.
  • [20] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng, and H. Chen. Oceangpt: A large language model for ocean science tasks. CoRR, abs/2310.02031, 2023.
  • [21] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  • [22] M. Biesialska, K. Biesialska, and M. R. Costa-jussà. Continual lifelong learning in natural language processing: A survey. In D. Scott, N. Bel, and C. Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6523–6541, Barcelona, Spain (Online), Dec. 2020. International Committee on Computational Linguistics.
  • [23] Y. Bisk, R. Zellers, J. Gao, Y. Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  • [24] O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58, 2014.
  • [25] J. Bornschein, Y. Li, and A. Rannen-Triki. Transformers for supervised online continual learning, 2024.
  • [26] L. Bourtoule, V. Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot. Machine unlearning, 2020.
  • [27] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [28] M. Brümmer, M. Dojchinovski, and S. Hellmann. Dbpedia abstracts: A large-scale, open, multilingual nlp training corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3339–3343, 2016.
  • [29] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920–15930, 2020.
  • [30] L. Caccia, R. Aljundi, N. Asadi, T. Tuytelaars, J. Pineau, and E. Belilovsky. New insights on reducing abrupt representation change in online continual learning. arXiv preprint arXiv:2104.05025, 2021.
  • [31] Z. Cai, O. Sener, and V. Koltun. Online continual learning with natural distribution shifts: An empirical study with visual data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8281–8290, 2021.
  • [32] H. Cao, Z. Liu, X. Lu, Y. Yao, and Y. Li. Instructmol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. CoRR, abs/2311.16208, 2023.
  • [33] X. Cao, H. Lu, L. Huang, X. Liu, and M.-M. Cheng. Generative multi-modal models are good class incremental learners. IEEE Computer Vision and Pattern Recognition (CVPR), 2024.
  • [34] Caselaw Access Project. Caselaw access project, 2018.
  • [35] Y. Chai, S. Wang, C. Pang, Y. Sun, H. Tian, and H. Wu. ERNIE-code: Beyond English-centric cross-lingual pretraining for programming languages. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 10628–10650, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [36] I. Chalkidis, T. Pasini, S. Zhang, L. Tomada, S. F. Schwemer, and A. Søgaard. Fairlex: A multilingual benchmark for evaluating fairness in legal text processing. arXiv preprint arXiv:2203.07228, 2022.
  • [37] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. In ICLR, 2019.
  • [38] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato. On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486, 2019.
  • [39] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling, 2014.
  • [40] B. Chen, X. Cheng, P. Li, Y. Geng, J. Gong, S. Li, Z. Bei, X. Tan, B. Wang, X. Zeng, C. Liu, A. Zeng, Y. Dong, J. Tang, and L. Song. xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. CoRR, abs/2401.06199, 2024.
  • [41] C. Chen, J. Zhu, X. Luo, H. Shen, L. Gao, and J. Song. Coin: A benchmark of continual instruction tuning for multimodel large language model, 2024.
  • [42] J. Chen, X. Wang, A. Gao, F. Jiang, S. Chen, H. Zhang, D. Song, W. Xie, C. Kong, J. Li, X. Wan, H. Li, and B. Wang. Huatuogpt-ii, one-stage training for medical adaption of llms. CoRR, abs/2311.09774, 2023.
  • [43] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba. Evaluating large language models trained on code, 2021.
  • [44] S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. In B. Webber, T. Cohn, Y. He, and Y. Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7870–7881, Online, Nov. 2020. Association for Computational Linguistics.
  • [45] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts, G. K. Savova, R. H. Mak, and D. S. Bitterman. The utility of chatgpt for cancer treatment information. medRxiv, 2023.
  • [46] W. Chen, Y. Zhou, N. Du, Y. Huang, J. Laudon, Z. Chen, and C. Cui. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pages 5383–5395. PMLR, 2023.
  • [47] X. Chen, Z. Wang, D. Sow, J. Yang, T. Chen, Y. Liang, M. Zhou, and Z. Wang. Take the bull by the horns: Hard sample-reweighted continual training improves llm generalization. arXiv preprint arXiv:2402.14270, 2024.
  • [48] Y. Chen, S. Zhang, G. Qi, and X. Guo. Parameterizing context: Unleashing the power of parameter-efficient fine-tuning and in-context tuning for continual table semantic parsing. Advances in Neural Information Processing Systems, 36, 2024.
  • [49] Z. Chen and B. Liu. Lifelong machine learning, volume 1. Springer.
  • [50] D. Cheng, S. Huang, and F. Wei. Adapting large language models via reading comprehension, 2024.
  • [51] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  • [52] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  • [53] P. Colombo, T. P. Pires, M. Boudiaf, D. Culver, R. Melo, C. Corro, A. F. T. Martins, F. Esposito, V. L. Raposo, S. Morgado, and M. Desa. Saullm-7b: A pioneering large language model for law, 2024.
  • [54] T. Computer. Redpajama: an open dataset for training large language models, 2023.
  • [55] A. O. Constantinescu, J. X. O’Reilly, and T. E. Behrens. Organizing conceptual knowledge in humans with a gridlike code. Science, 352(6292):1464–1468, 2016.
  • [56] A. Cossu, T. Tuytelaars, A. Carta, L. Passaro, V. Lomonaco, and D. Bacciu. Continual pre-training mitigates forgetting in language and vision, 2022.
  • [57] M. D’Alessandro, A. Alonso, E. Calabrés, and M. Galar. Multimodal parameter-efficient few-shot class incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, pages 3393–3403, October 2023.
  • [58] P. Das, S. Chaudhury, E. Nelson, I. Melnyk, S. Swaminathan, S. Dai, A. Lozano, G. Kollias, V. Chenthamarakshan, S. Dan, et al. Larimar: Large language models with episodic memory control. arXiv preprint arXiv:2403.11901, 2024.
  • [59] P. Dasigi, N. F. Liu, A. Marasović, N. A. Smith, and M. Gardner. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5925–5932, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.
  • [60] N. De Cao, W. Aziz, and I. Titov. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021.
  • [61] M. De Lange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366–3385, 2021.
  • [62] C. Deng, T. Zhang, Z. He, Y. Xu, Q. Chen, Y. Shi, L. Fu, W. Zhang, X. Wang, C. Zhou, Z. Lin, and J. He. K2: A foundation language model for geoscience knowledge understanding and utilization, 2023.
  • [63] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [64] Y. Deng, W. Lei, W. Lam, and T.-S. Chua. A survey on proactive dialogue systems: Problems, methods, and prospects. arXiv preprint arXiv:2305.02750, 2023.
  • [65] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  • [66] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [67] B. Dhingra, J. R. Cole, J. M. Eisenschlos, D. Gillick, J. Eisenstein, and W. W. Cohen. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273, 2022.
  • [68] P. Di, J. Li, H. Yu, W. Jiang, W. Cai, Y. Cao, C. Chen, D. Chen, H. Chen, L. Chen, et al. Codefuse-13b: A pretrained multi-lingual code large language model. arXiv preprint arXiv:2310.06266, 2023.
  • [69] Q. Dong, D. Dai, Y. Song, J. Xu, Z. Sui, and L. Li. Calibrating factual knowledge in pretrained language models. arXiv preprint arXiv:2210.03329, 2022.
  • [70] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [71] L. Dou, Q. Liu, G. Zeng, J. Guo, J. Zhou, W. Lu, and M. Lin. Sailor: Open language models for south-east asia. arXiv preprint arXiv:2404.03608, 2024.
  • [72] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, E. Wang, K. Webster, M. Pellat, K. Robinson, K. Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. Le, Y. Wu, Z. Chen, and C. Cui. GLaM: Efficient scaling of language models with mixture-of-experts. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR, 17–23 Jul 2022.
  • [73] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. arXiv preprint arXiv:1903.00161, 2019.
  • [74] S. Ebrahimi, M. Elhoseiny, T. Darrell, and M. Rohrbach. Uncertainty-guided continual learning with bayesian neural networks. arXiv preprint arXiv:1906.02425, 2019.
  • [75] S. Ebrahimi, F. Meier, R. Calandra, T. Darrell, and M. Rohrbach. Adversarial continual learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 386–402. Springer, 2020.
  • [76] H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, and E. Simperl. T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018.
  • [77] K. Fujii, T. Nakamura, M. Loem, H. Iida, M. Ohi, K. Hattori, H. Shota, S. Mizuki, R. Yokota, and N. Okazaki. Continual pre-training for cross-lingual llm adaptation: Enhancing japanese language capabilities. arXiv preprint arXiv:2404.17790, 2024.
  • [78] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky. Domain-adversarial training of neural networks. The journal of machine learning research, 17(1):2096–2030, 2016.
  • [79] J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li, and L. Kong. G-llava: Solving geometric problem with multi-modal large language model. CoRR, abs/2312.11370, 2023.
  • [80] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, S. Presser, and C. Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  • [81] S. Garg, S. Dutta, M. Dalirrooyfard, A. Schneider, and Y. Nevmyvaka. In- or out-of-distribution detection via dual divergence estimation. In R. J. Evans and I. Shpitser, editors, Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216 of Proceedings of Machine Learning Research, pages 635–646. PMLR, 31 Jul–04 Aug 2023.
  • [82] S. Garg, M. Farajtabar, H. Pouransari, R. Vemulapalli, S. Mehta, O. Tuzel, V. Shankar, and F. Faghri. Tic-clip: Continual training of clip models. In The Twelfth International Conference on Learning Representations (ICLR), 2024.
  • [83] E. Gogoulou, T. Lesort, M. Boman, and J. Nivre. Continual learning under language shift, 2024.
  • [84] A. Gokaslan and V. Cohen. Openwebtext corpus, 2019.
  • [85] Z. Gou, Z. Shao, Y. Gong, yelong shen, Y. Yang, M. Huang, N. Duan, and W. Chen. ToRA: A tool-integrated reasoning agent for mathematical problem solving. In The Twelfth International Conference on Learning Representations, 2024.
  • [86] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering, 2017.
  • [87] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1):1–23, Oct. 2021.
  • [88] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. K. Li, F. Luo, Y. Xiong, and W. Liang. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  • [89] Z. Guo and Y. Hua. Continuous training and fine-tuning for domain-specific language models in medical question answering, 2023.
  • [90] K. Gupta, B. Thérien, A. Ibrahim, M. L. Richter, Q. Anthony, E. Belilovsky, I. Rish, and T. Lesort. Continual pre-training of large language models: How to (re)warm your model?, 2023.
  • [91] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. Vizwiz grand challenge: Answering visual questions from blind people, 2018.
  • [92] S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettlemoyer. DEMix layers: Disentangling domains for modular language modeling. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5557–5576, Seattle, United States, July 2022. Association for Computational Linguistics.
  • [93] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online, July 2020. Association for Computational Linguistics.
  • [94] R. Han, X. Ren, and N. Peng. ECONET: Effective continual pretraining of language models for event temporal reasoning. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5367–5380, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  • [95] T. Han, L. C. Adams, J. Papaioannou, P. Grundmann, T. Oberhauser, A. Löser, D. Truhn, and K. K. Bressem. Medalpaca - an open-source collection of medical conversational AI models and training data. CoRR, abs/2304.08247, 2023.
  • [96] Y. Hao, L. Dong, F. Wei, and K. Xu. Visualizing and understanding the effectiveness of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4143–4152, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.
  • [97] T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y. Kim, and M. Ghassemi. Aging with grace: Lifelong model editing with discrete key-value adaptors. In Advances in Neural Information Processing Systems, 2023.
  • [98] P. Hase, M. Bansal, B. Kim, and A. Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. Knowledge Editing in Language Models, 2023.
  • [99] P. Hase, M. Diab, A. Celikyilmaz, X. Li, Z. Kozareva, V. Stoyanov, M. Bansal, and S. Iyer. Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs. arXiv preprint arXiv:2111.13654, 2021.
  • [100] T. L. Hayes and C. Kanan. Lifelong machine learning with deep streaming linear discriminant analysis, 2020.
  • [101] J. He, H. Guo, M. Tang, and J. Wang. Continual instruction tuning for large multimodal models, 2023.
  • [102] R. He and J. McAuley. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 507–517, Republic and Canton of Geneva, CHE, 2016. International World Wide Web Conferences Steering Committee.
  • [103] T. He, J. Liu, K. Cho, M. Ott, B. Liu, J. Glass, and F. Peng. Analyzing the forgetting problem in pretrain-finetuning of open-domain dialogue response models. In P. Merlo, J. Tiedemann, and R. Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1121–1133, Online, Apr. 2021. Association for Computational Linguistics.
  • [104] Y. He, F. Huang, X. Jiang, Y. Nie, M. Wang, J. Wang, and H. Chen. Foundation model for advancing healthcare: Challenges, opportunities, and future directions. arXiv preprint arXiv:2404.03264, 2024.
  • [105] Y. He, X. Huang, M. Tang, L. Meng, X. Li, W. Lin, W. Zhang, and Y. Gao. Don’t half-listen: Capturing key-part information in continual instruction tuning, 2024.
  • [106] D. Hendrycks, C. Burns, S. Basart, A. Critch, J. Li, D. Song, and J. Steinhardt. Aligning ai with shared human values, 2023.
  • [107] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  • [108] D. Hewlett, A. Lacoste, L. Jones, I. Polosukhin, A. Fandrianto, J. Han, M. Kelcey, and D. Berthelot. Wikireading: A novel large-scale language understanding task over wikipedia. arXiv preprint arXiv:1608.03542, 2016.
  • [109] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  • [110] C. Hu, P. Cao, Y. Chen, K. Liu, and J. Zhao. Wilke: Wise-layer knowledge editor for lifelong knowledge editing, 2024.
  • [111] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • [112] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  • [113] Y. Hu, T. Ganter, H. Deilamsalehy, F. Dernoncourt, H. Foroosh, and F. Liu. Meetingbank: A benchmark dataset for meeting summarization, 2023.
  • [114] C. Huang, Q. Liu, B. Y. Lin, T. Pang, C. Du, and M. Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
  • [115] J. Huang, L. Cui, A. Wang, C. Yang, X. Liao, L. Song, J. Yao, and J. Su. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal, 2024.
  • [116] L. Huang, R. L. Bras, C. Bhagavatula, and Y. Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning, 2019.
  • [117] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, Z. Wu, and Y. Feng. Lawyer llama technical report. arXiv preprint arXiv:2305.15062, 2023.
  • [118] Q. Huang, M. Tao, C. Zhang, Z. An, C. Jiang, Z. Chen, Z. Wu, and Y. Feng. Lawyer llama. https://github.com/AndrewZhe/lawyer-llama, 2023.
  • [119] Z. Huang, Y. Shen, X. Zhang, J. Zhou, W. Rong, and Z. Xiong. Transformer-patcher: One mistake worth one neuron. arXiv preprint arXiv:2301.09785, 2023.
  • [120] D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
  • [121] A. Ibrahim, B. Thérien, K. Gupta, M. L. Richter, Q. Anthony, T. Lesort, E. Belilovsky, and I. Rish. Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763, 2024.
  • [122] J. Jang, S. Ye, C. Lee, S. Yang, J. Shin, J. Han, G. Kim, and M. Seo. Temporalwiki: A lifelong benchmark for training and evaluating ever-evolving language models. 2022.
  • [123] J. Jang, S. Ye, S. Yang, J. Shin, J. Han, G. Kim, S. J. Choi, and M. Seo. Towards continual knowledge learning of language models. In ICLR, 2022.
  • [124] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke, and M. Ingrisch. Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports, 2022.
  • [125] S. Jha, D. Gong, and L. Yao. Clap4clip: Continual learning with probabilistic finetuning for vision-language models. arXiv preprint arXiv:2403.19137, 2024.
  • [126] J. Ji, T. Qiu, B. Chen, B. Zhang, H. Lou, K. Wang, Y. Duan, Z. He, J. Zhou, Z. Zhang, F. Zeng, K. Y. Ng, J. Dai, X. Pan, A. O’Gara, Y. Lei, H. Xu, B. Tse, J. Fu, S. McAleer, Y. Yang, Y. Wang, S.-C. Zhu, Y. Guo, and W. Gao. Ai alignment: A comprehensive survey, 2024.
  • [127] S. Jiang, Y. Wang, and Y. Wang. Selfevolve: A code evolution framework via large language models, 2023.
  • [128] Y. Jiang, Z. Pan, X. Zhang, S. Garg, A. Schneider, Y. Nevmyvaka, and D. Song. Empowering time series analysis with large language models: A survey, 2024.
  • [129] Z. Jiang, Z. Sun, W. Shi, P. Rodriguez, C. Zhou, G. Neubig, X. V. Lin, W. tau Yih, and S. Iyer. Instruction-tuned language models are better knowledge learners, 2024.
  • [130] X. Jin and X. Ren. What will my model forget? forecasting forgotten examples in language model refinement, 2024.
  • [131] X. Jin, D. Zhang, H. Zhu, W. Xiao, S.-W. Li, X. Wei, A. Arnold, and X. Ren. Lifelong pretraining: Continually adapting language models to emerging corpora. In A. Fan, S. Ilic, T. Wolf, and M. Gallé, editors, Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models, pages 1–16, virtual+Dublin, May 2022. Association for Computational Linguistics.
  • [132] E. R. Kandel, J. H. Schwartz, T. M. Jessell, S. Siegelbaum, A. J. Hudspeth, S. Mack, et al. Principles of neural science, volume 4. McGraw-hill New York, 2000.
  • [133] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • [134] S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. ReferItGame: Referring to objects in photographs of natural scenes. In A. Moschitti, B. Pang, and W. Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, Oct. 2014. Association for Computational Linguistics.
  • [135] Z. Ke, H. Lin, Y. Shao, H. Xu, L. Shu, and B. Liu. Continual training of language models for few-shot learning. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10205–10216, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
  • [136] Z. Ke and B. Liu. Continual learning of natural language processing tasks: A survey, 2023.
  • [137] Z. Ke, B. Liu, N. Ma, H. Xu, and S. Lei. Achieving forgetting prevention and knowledge transfer in continual learning. In NeurIPS, 2021.
  • [138] Z. Ke, Y. Shao, H. Lin, T. Konishi, G. Kim, and B. Liu. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022.
  • [139] T. Kew, M. Kostrzewa, and S. Ebling. 20 minuten: A multi-task news summarisation dataset for German. In H. Ghorbel, M. Sokhn, M. Cieliebak, M. Hürlimann, E. de Salis, and J. Guerne, editors, Proceedings of the 8th edition of the Swiss Text Analytics Conference, pages 1–13, Neuchatel, Switzerland, June 2023. Association for Computational Linguistics.
  • [140] M. Khan, P. Srivatsa, A. Rane, S. Chenniappa, A. Hazariwala, and P. Maes. Personalizing pre-trained models. arXiv preprint arXiv:2106.01499, 2021.
  • [141] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, 2018.
  • [142] D. Khashabi, T. Khot, A. Sabharwal, and D. Roth. Learning what is essential in questions. In R. Levy and L. Specia, editors, Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 80–89, Vancouver, Canada, Aug. 2017. Association for Computational Linguistics.
  • [143] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sabharwal. Qasc: A dataset for question answering via sentence composition, 2020.
  • [144] G. Kim, C. Xiao, T. Konishi, Z. Ke, and B. Liu. A theoretical study on solving continual learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5065–5079. Curran Associates, Inc., 2022.
  • [145] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • [146] P. Koehn. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand, Sept. 13-15 2005.
  • [147] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  • [148] C. C. T. Kwok, O. Etzioni, and D. S. Weld. Scaling question answering to the web. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, page 150–161, New York, NY, USA, 2001. Association for Computing Machinery.
  • [149] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  • [150] A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d’Autume, T. Kocisky, S. Ruder, et al. Mind the gap: Assessing temporal generalization in neural language models. Advances in Neural Information Processing Systems, 34:29348–29363, 2021.
  • [151] B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  • [152] O. Levy, M. Seo, E. Choi, and L. Zettlemoyer. Zero-shot relation extraction via reading comprehension. arXiv preprint arXiv:1706.04115, 2017.
  • [153] B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu. Otter: A multi-modal model with in-context instruction tuning, 2023.
  • [154] C.-A. Li and H.-Y. Lee. Examining forgetting in continual pre-training of aligned large language models, 2024.
  • [155] D. Li, A. S. Rawat, M. Zaheer, X. Wang, M. Lukasik, A. Veit, F. Yu, and S. Kumar. Large language models with controllable working memory. arXiv preprint arXiv:2211.05110, 2022.
  • [156] H. Li, Q. Ai, J. Chen, Q. Dong, Z. Wu, Y. Liu, C. Chen, and Q. Tian. Blade: Enhancing black-box large language models with small domain-specific models. arXiv preprint arXiv:2403.18365, 2024.
  • [157] H. Li, Y. Zhang, F. Koto, Y. Yang, H. Zhao, Y. Gong, N. Duan, and T. Baldwin. Cmmlu: Measuring massive multitask language understanding in chinese. arXiv preprint arXiv:2306.09212, 2023.
  • [158] J. Li, Y. Bian, G. Wang, Y. Lei, D. Cheng, Z. Ding, and C. Jiang. Cfgpt: Chinese financial assistant with large language model, 2023.
  • [159] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  • [160] K. Li, Q. Hu, X. Zhao, H. Chen, Y. Xie, T. Liu, Q. Xie, and J. He. Instructcoder: Instruction tuning large language models for code editing, 2024.
  • [161] L. Li and X. Qiu. CONTINUAL MODEL EVOLVEMENT WITH INNER-PRODUCT RESTRICTION, 2023.
  • [162] M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen, N. Cheng, J. Wang, T. Zhou, and J. Xiao. From quantity to quality: Boosting llm performance with self-guided data selection for instruction tuning. ArXiv, abs/2308.12032, 2023.
  • [163] X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online, Aug. 2021. Association for Computational Linguistics.
  • [164] Y. Li, Z. Li, K. Zhang, R. Dan, S. Jiang, and Y. Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023.
  • [165] Y. Li, G. Pang, W. Suo, C. Jing, Y. Xi, L. Liu, H. Chen, G. Liang, and P. Wang. Coleclip: Open-domain continual learning via joint task prompt and vocabulary learning. arXiv preprint arXiv:2403.10245, 2024.
  • [166] Z. Li and D. Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  • [167] B. Y. Lin, S. Wang, X. Lin, R. Jia, L. Xiao, X. Ren, and S. Yih. On continual model refinement in out-of-distribution data streams. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3128–3139, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • [168] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  • [169] K. Lin, O. Tafjord, P. Clark, and M. Gardner. Reasoning over paragraph effects in situations, 2019.
  • [170] Y. Lin, H. Lin, W. Xiong, S. Diao, J. Liu, J. Zhang, R. Pan, H. Wang, W. Hu, H. Zhang, H. Dong, R. Pi, H. Zhao, N. Jiang, H. Ji, Y. Yao, and T. Zhang. Mitigating the alignment tax of rlhf, 2024.
  • [171] Z. Lin, C. Deng, L. Zhou, T. Zhang, Y. Xu, Y. Xu, Z. He, Y. Shi, B. Dai, Y. Song, B. Zeng, Q. Chen, T. Shi, T. Huang, Y. Xu, S. Wang, L. Fu, W. Zhang, J. He, C. Ma, Y. Zhu, X. Wang, and C. Zhou. Geogalactica: A scientific large language model in geoscience, 2023.
  • [172] Z. Lin, Z. Gou, Y. Gong, X. Liu, Y. Shen, R. Xu, C. Lin, Y. Yang, J. Jiao, N. Duan, et al. Rho-1: Not all tokens are what you need. arXiv preprint arXiv:2404.07965, 2024.
  • [173] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning, 2023.
  • [174] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  • [175] J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, and M. L. Li. Benchmarking large language models on cmexam – a comprehensive chinese medical exam dataset, 2023.
  • [176] X. Liu, X. Cao, H. Lu, J.-w. Xiao, A. D. Bagdanov, and M.-M. Cheng. Class incremental learning with pre-trained vision-language models. arXiv preprint arXiv:2310.20348, 2023.
  • [177] Y. Liu, R. J. Dolan, Z. Kurth-Nelson, and T. E. Behrens. Human replay spontaneously reorganizes experience. Cell, 178(3):640–652, 2019.
  • [178] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  • [179] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. Weld. S2ORC: The semantic scholar open research corpus. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983, Online, July 2020. Association for Computational Linguistics.
  • [180] V. Lomonaco, D. Maltoni, and L. Pellegrini. Rehearsal-free continual learning over small non-i.i.d. batches, 2020.
  • [181] D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
  • [182] D. Loureiro, F. Barbieri, L. Neves, L. Espinosa Anke, and J. Camacho-collados. TimeLMs: Diachronic language models from Twitter. In V. Basile, Z. Kozareva, and S. Stajner, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 251–260, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • [183] D. Lu, H. Wu, J. Liang, Y. Xu, Q. He, Y. Geng, M. Han, Y. Xin, and Y. Xiao. Bbt-fin: Comprehensive construction of chinese financial domain pre-trained language model, corpus and benchmark. CoRR, abs/2302.09432, 2023.
  • [184] P. Lu, M. Caprio, E. Eaton, and I. Lee. Ibcl: Zero-shot model generation for task trade-offs in continual learning, 2023.
  • [185] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
  • [186] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang, G. Li, L. Zhou, L. Shou, L. Zhou, M. Tufano, M. Gong, M. Zhou, N. Duan, N. Sundaresan, S. K. Deng, S. Fu, and S. Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation, 2021.
  • [187] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao, X. Geng, Q. Lin, S. Chen, and D. Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  • [188] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics, 23(6), Sept. 2022.
  • [189] Y. Luo, Z. Yang, X. Bai, F. Meng, J. Zhou, and Y. Zhang. Investigating forgetting in pre-trained representations through continual learning, 2023.
  • [190] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, and Y. Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.
  • [191] Y. Luo, J. Zhang, S. Fan, K. Yang, Y. Wu, M. Qiao, and Z. Nie. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023.
  • [192] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang. Wizardcoder: Empowering code large language models with evol-instruct, 2023.
  • [193] S. Ma, S. Huang, S. Huang, X. Wang, Y. Li, H.-T. Zheng, P. Xie, F. Huang, and Y. Jiang. Ecomgpt-ct: Continual pre-training of e-commerce large language models with semi-structured data, 2023.
  • [194] A. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pages 142–150, 2011.
  • [195] Z. Mai, R. Li, J. Jeong, D. Quispe, H. Kim, and S. Sanner. Online continual learning in image classification: An empirical survey. Neurocomputing, 469:28–51, 2022.
  • [196] J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions, 2016.
  • [197] V. Mazzia, A. Pedrani, A. Caciolai, K. Rottmann, and D. Bernardi. A survey on knowledge editing of neural networks. arXiv preprint arXiv:2310.19704, 2023.
  • [198] D. McCaffary. Towards continual task learning in artificial neural networks: current approaches and insights from neuroscience. arXiv preprint arXiv:2112.14146, 2021.
  • [199] J. L. McClelland, B. L. McNaughton, and R. C. O’Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
  • [200] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109–165. Academic Press, 1989.
  • [201] S. V. Mehta, D. Patil, S. Chandar, and E. Strubell. An empirical investigation of the role of pre-training in lifelong learning. Journal of Machine Learning Research, 24(214):1–50, 2023.
  • [202] K. Meng, D. Bau, A. Andonian, and Y. Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  • [203] K. Meng, A. S. Sharma, A. Andonian, Y. Belinkov, and D. Bau. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022.
  • [204] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  • [205] S. I. Mirzadeh, A. Chaudhry, D. Yin, H. Hu, R. Pascanu, D. Gorur, and M. Farajtabar. Wide neural networks forget less catastrophically. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15699–15717. PMLR, 17–23 Jul 2022.
  • [206] A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  • [207] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773, 2021.
  • [208] S. Mishra, A. Mitra, N. Varshney, B. Sachdeva, P. Clark, C. Baral, and A. Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks, 2022.
  • [209] E. Mitchell, C. Lin, A. Bosselut, C. Finn, and C. D. Manning. Fast model editing at scale. arXiv preprint arXiv:2110.11309, 2021.
  • [210] E. Mitchell, C. Lin, A. Bosselut, C. D. Manning, and C. Finn. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831. PMLR, 2022.
  • [211] J. Mok, J. Do, S. Lee, T. Taghavi, S. Yu, and S. Yoon. Large-scale lifelong learning of in-context instructions and how to tackle it. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12573–12589, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [212] A. Moradi Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. J. Jiang. Github copilot ai pair programmer: Asset or liability? Journal of Systems and Software, 203:111734, 2023.
  • [213] N. Muennighoff, Q. Liu, A. Zebaze, Q. Zheng, B. Hui, T. Y. Zhuo, S. Singh, X. Tang, L. von Werra, and S. Longpre. Octopack: Instruction tuning code large language models, 2024.
  • [214] T. Nakamura, M. Mishra, S. Tedeschi, Y. Chai, J. T. Stillerman, F. Friedrich, P. Yadav, T. Laud, V. M. Chien, T. Y. Zhuo, et al. Aurora-m: The first open source multilingual language model red-teamed according to the us executive order. arXiv preprint arXiv:2404.00399, 2024.
  • [215] B. Neyshabur, H. Sedghi, and C. Zhang. What is being transferred in transfer learning? Advances in neural information processing systems, 33:512–523, 2020.
  • [216] T. D. Nguyen, Y. Ting, I. Ciuca, C. O’Neill, Z. Sun, M. Jablonska, S. Kruk, E. Perkowski, J. W. Miller, J. Li, J. Peek, K. Iyer, T. Rózanski, P. Khetarpal, S. Zaman, D. Brodrick, S. J. R. Méndez, T. Bui, A. Goodman, A. Accomazzi, J. P. Naiman, J. Cranney, K. Schawinski, and UniverseTBD. Astrollama: Towards specialized foundation models in astronomy. CoRR, abs/2309.06126, 2023.
  • [217] T. T. Nguyen, T. T. Huynh, P. L. Nguyen, A. W.-C. Liew, H. Yin, and Q. V. H. Nguyen. A survey of machine unlearning, 2022.
  • [218] J. Ni, J. Li, and J. McAuley. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 188–197, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.
  • [219] Z. Ni, H. Shi, S. Tang, L. Wei, Q. Tian, and Y. Zhuang. Revisiting catastrophic forgetting in class incremental learning. arXiv preprint arXiv:2107.12308, 2021.
  • [220] Z. Ni, L. Wei, S. Tang, Y. Zhuang, and Q. Tian. Continual vision-language representation learning with off-diagonal information. In Proceedings of the 40th International Conference on Machine Learning, pages 26129–26149, 2023.
  • [221] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and Y. Zhou. Codegen2: Lessons for training llms on programming and natural languages. ICLR, 2023.
  • [222] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. Codegen: An open large language model for code with multi-turn program synthesis. ICLR, 2023.
  • [223] H. F. Ólafsdóttir, D. Bush, and C. Barry. The role of hippocampal replay in memory and planning. Current Biology, 28(1):R37–R50, 2018.
  • [224] OpenAI. Introducing chatgpt. [online]. available: https://openai.com/blog/chatgpt. 2022.
  • [225] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback, 2022.
  • [226] C. Pallier, S. Dehaene, J.-B. Poline, D. LeBihan, A.-M. Argenti, E. Dupoux, and J. Mehler. Brain imaging of language plasticity in adopted adults: Can a second language replace the first? Cerebral cortex, 13(2):155–161, 2003.
  • [227] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • [228] I. Paul, J. Luo, G. Glavaš, and I. Gurevych. Ircoder: Intermediate representations make language models robust multilingual code generators, 2024.
  • [229] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023.
  • [230] A. Pentina. Theoretical foundations of multi-task lifelong learning. PhD thesis, 2016.
  • [231] E. Perkowski, R. Pan, T. D. Nguyen, Y. Ting, S. Kruk, T. Zhang, C. O’Neill, M. Jablonska, Z. Sun, M. J. Smith, H. Liu, K. Schawinski, K. Iyer, I. Ciuca, and UniverseTBD. Astrollama-chat: Scaling astrollama with conversational and diverse datasets. CoRR, abs/2401.01916, 2024.
  • [232] F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller. Language models as knowledge bases? In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China, Nov. 2019. Association for Computational Linguistics.
  • [233] J. Pourcel, N.-S. Vu, and R. M. French. Online task-free continual learning with dynamic sparse distributed memory. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, editors, Computer Vision – ECCV 2022, pages 739–756, Cham, 2022. Springer Nature Switzerland.
  • [234] A. Prabhu, H. A. Al Kader Hammoud, P. K. Dokania, P. H. Torr, S.-N. Lim, B. Ghanem, and A. Bibi. Computationally budgeted continual learning: What does matter? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3698–3707, 2023.
  • [235] A. Prabhu, Z. Cai, P. Dokania, P. Torr, V. Koltun, and O. Sener. Online continual learning without the storage constraint, 2023.
  • [236] C. Qin and S. Joty. Lfpt5: A unified framework for lifelong few-shot language learning based on prompt tuning of t5. In International Conference on Learning Representations, 2021.
  • [237] Y. Qin, C. Qian, X. Han, Y. Lin, H. Wang, R. Xie, Z. Liu, M. Sun, and J. Zhou. Recyclable tuning for continual pre-training. arXiv preprint arXiv:2305.08702, 2023.
  • [238] Y. Qin, J. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou. ELLE: Efficient lifelong pre-training for emerging data. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2789–2810, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • [239] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [240] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [241] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  • [242] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  • [243] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  • [244] P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822, 2018.
  • [245] R. Ramesh and P. Chaudhari. Model zoo: A growing" brain" that learns continually. arXiv preprint arXiv:2106.03027, 2021.
  • [246] H. Ramsauer, B. Schäfl, J. Lehner, P. Seidl, M. Widrich, T. Adler, L. Gruber, M. Holzleitner, M. Pavlović, G. K. Sandve, V. Greiff, D. Kreil, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. Hopfield networks is all you need, 2021.
  • [247] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  • [248] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  • [249] M. Riemer, I. Cases, R. Ajemian, M. Liu, I. Rish, Y. Tu, and G. Tesauro. Learning to learn without forgetting by maximizing transfer and minimizing interference. arXiv preprint arXiv:1810.11910, 2018.
  • [250] H. Ritter, A. Botev, and D. Barber. Online structured laplace approximations for overcoming catastrophic forgetting. Advances in Neural Information Processing Systems, 31, 2018.
  • [251] J. Roberts, T. Lüddecke, S. Das, K. Han, and S. Albanie. Gpt4geo: How a language model sees the world’s geography, 2023.
  • [252] S. Rongali, A. Jagannatha, B. P. S. Rawat, and H. Yu. Continual domain-tuning for pretrained language models, 2021.
  • [253] G. D. Rosin, I. Guy, and K. Radinsky. Time masking for temporal language models. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, page 833–841, New York, NY, USA, 2022. Association for Computing Machinery.
  • [254] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve. Code llama: Open foundation models for code, 2024.
  • [255] A. N. Rubungo, C. Arnold, B. P. Rand, and A. B. Dieng. Llm-prop: Predicting physical and electronic properties of crystalline solids from their text descriptions. CoRR, abs/2310.14029, 2023.
  • [256] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  • [257] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  • [258] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multitask prompted training enables zero-shot task generalization, 2022.
  • [259] F. Sarfraz, E. Arani, and B. Zonooz. Error sensitivity modulation based experience replay: Mitigating abrupt representation drift in continual learning. arXiv preprint arXiv:2302.11344, 2023.
  • [260] J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, and H. Xu. Explaining legal concepts with augmented large language models (gpt-4), 2023.
  • [261] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.
  • [262] T. Schuster, A. Fisch, and R. Barzilay. Get your vitamin c! robust fact verification with contrastive evidence. arXiv preprint arXiv:2103.08541, 2021.
  • [263] J. Schwarz, W. Czarnecki, J. Luketina, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell. Progress & compress: A scalable framework for continual learning. In International conference on machine learning, pages 4528–4537. PMLR, 2018.
  • [264] T. Scialom, T. Chakrabarty, and S. Muresan. Fine-tuned language models are continual learners. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6107–6122, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
  • [265] A. Shah and S. Chava. Zero is not hero yet: Benchmarking zero-shot performance of llms for financial tasks, 2023.
  • [266] A. Shah, S. Paturi, and S. Chava. Trillion dollar words: A new financial dataset, task & market analysis, 2023.
  • [267] Y. Shao, Y. Guo, D. Zhao, and B. Liu. Class-incremental learning based on label generation. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1263–1276, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [268] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • [269] J. Shen, N. Tenenholtz, J. B. Hall, D. Alvarez-Melis, and N. Fusi. Tag-llm: Repurposing general-purpose llms for specialized domains. arXiv preprint arXiv:2402.05140, 2024.
  • [270] H. Shi and H. Wang. A unified approach to domain incremental learning with memory: Theory and algorithm. Advances in Neural Information Processing Systems, 36, 2024.
  • [271] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards vqa models that can read, 2019.
  • [272] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  • [273] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal, M. Schaekermann, A. Wang, M. Amin, S. Lachgar, P. A. Mansfield, S. Prakash, B. Green, E. Dominowska, B. A. y Arcas, N. Tomasev, Y. Liu, R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, D. R. Webster, G. S. Corrado, Y. Matias, S. Azizi, A. Karthikesalingam, and V. Natarajan. Towards expert-level medical question answering with large language models. CoRR, abs/2305.09617, 2023.
  • [274] A. Sinitsin, V. Plokhotnyuk, D. Pyrkin, S. Popov, and A. Babenko. Editable neural networks. arXiv preprint arXiv:2004.00345, 2020.
  • [275] J. S. Smith, J. Tian, S. Halbe, Y.-C. Hsu, and Z. Kira. A closer look at rehearsal-free continual learning, 2023.
  • [276] L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Dumas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, A. Ravichander, K. Richardson, Z. Shen, E. Strubell, N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer, N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld, J. Dodge, and K. Lo. Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
  • [277] C. Song, X. Han, Z. Zeng, K. Li, C. Chen, Z. Liu, M. Sun, and T. Yang. Conpet: Continual parameter-efficient tuning for large language models, 2023.
  • [278] D. Song, H. Guo, Y. Zhou, S. Xing, Y. Wang, Z. Song, W. Zhang, Q. Guo, H. Yan, X. Qiu, and D. Lin. Code needs comments: Enhancing code llms with comment augmentation, 2024.
  • [279] P. Sprechmann, S. M. Jayakumar, J. W. Rae, A. Pritzel, A. P. Badia, B. Uria, O. Vinyals, D. Hassabis, R. Pascanu, and C. Blundell. Memory-based parameter adaptation. In International Conference on Learning Representations, 2018.
  • [280] Z. Su, J. Li, Z. Zhang, Z. Zhou, and M. Zhang. Efficient continue training of temporal language model with structural information. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6315–6329, Singapore, Dec. 2023. Association for Computational Linguistics.
  • [281] Q. Sun, Z. Chen, F. Xu, K. Cheng, C. Ma, Z. Yin, J. Wang, C. Han, R. Zhu, S. Yuan, Q. Guo, X. Qiu, P. Yin, X. Li, F. Yuan, L. Kong, X. Li, and Z. Wu. A survey of neural code intelligence: Paradigms, advances and beyond, 2024.
  • [282] Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang. Ernie 2.0: A continual pre-training framework for language understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8968–8975, Apr. 2020.
  • [283] K. Takahashi, T. Omi, K. Arima, and T. Ishigaki. Pretraining and updating language-and domain-specific large language model: A case study in japanese business domain. arXiv preprint arXiv:2404.08262, 2024.
  • [284] M. Tao, Y. Feng, and D. Zhao. Can bert refrain from forgetting on sequential tasks? a probing study. In The Eleventh International Conference on Learning Representations, 2022.
  • [285] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  • [286] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and R. Stojnic. Galactica: A large language model for science. CoRR, abs/2211.09085, 2022.
  • [287] D.-A. Team. Deepseek llm: Scaling open-source language models with longtermism, 2024.
  • [288] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • [289] S. Team. Starcoder: may the source be with you!, 2023.
  • [290] S. Team. Starcoder 2 and the stack v2: The next generation, 2024.
  • [291] V. Thengane, S. Khan, M. Hayat, and F. Khan. Clip model is an efficient continual learner, 2022.
  • [292] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
  • [293] D. Thulke, Y. Gao, P. Pelser, R. Brune, R. Jalota, F. Fok, M. Ramos, I. van Wyk, A. Nasir, H. Goldstein, et al. Climategpt: Towards ai synthesizing interdisciplinary research on climate change. arXiv preprint arXiv:2401.09646, 2024.
  • [294] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [295] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [296] G. M. Van de Ven, T. Tuytelaars, and A. S. Tolias. Three types of incremental learning. Nature Machine Intelligence, 4(12):1185–1197, 2022.
  • [297] E. Verwimp, R. Aljundi, S. Ben-David, M. Bethge, A. Cossu, A. Gepperth, T. L. Hayes, E. Hüllermeier, C. Kanan, D. Kudithipudi, C. H. Lampert, M. Mundt, R. Pascanu, A. Popescu, A. S. Tolias, J. van de Weijer, B. Liu, V. Lomonaco, T. Tuytelaars, and G. M. van de Ven. Continual learning: Applications and the road forward, 2024.
  • [298] M. Völske, M. Potthast, S. Syed, and B. Stein. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
  • [299] B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  • [300] C. Wang, D. Engler, X. Li, J. Hou, D. J. Wald, K. Jaiswal, and S. Xu. Near-real-time earthquake-induced fatality estimation using crowdsourced data and large-language models, 2023.
  • [301] L. Wang, X. Zhang, Q. Li, J. Zhu, and Y. Zhong. Coscl: Cooperation of small continual learners is stronger than a big one. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, pages 254–271. Springer, 2022.
  • [302] L. Wang, X. Zhang, H. Su, and J. Zhu. A comprehensive survey of continual learning: Theory, method and application. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–20, 2024.
  • [303] N. Wang, H. Yang, and C. D. Wang. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. CoRR, abs/2310.04793, 2023.
  • [304] P. Wang, Z. Li, N. Zhang, Z. Xu, Y. Yao, Y. Jiang, P. Xie, F. Huang, and H. Chen. Wise: Rethinking the knowledge memory for lifelong model editing of large language models. arXiv preprint arXiv:2405.14768, 2024.
  • [305] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, J. Ji, G. Cao, D. Jiang, and M. Zhou. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1405–1418, Online, Aug. 2021. Association for Computational Linguistics.
  • [306] X. Wang, T. Chen, Q. Ge, H. Xia, R. Bao, R. Zheng, Q. Zhang, T. Gui, and X. Huang. Orthogonal subspace learning for language model continual learning. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10658–10671, Singapore, Dec. 2023. Association for Computational Linguistics.
  • [307] X. Wang, Y. Zhang, T. Chen, S. Gao, S. Jin, X. Yang, Z. Xi, R. Zheng, Y. Zou, T. Gui, Q. Zhang, and X. Huang. Trace: A comprehensive benchmark for continual learning in large language models, 2023.
  • [308] Y. Wang, H. Le, A. D. Gotmare, N. D. Q. Bui, J. Li, and S. C. H. Hoi. Codet5+: Open code large language models for code understanding and generation, 2023.
  • [309] Y. Wang, Y. Liu, C. Shi, H. Li, C. Chen, H. Lu, and Y. Yang. Inscl: A data-efficient continual learning paradigm for fine-tuning large language models with instructions, 2024.
  • [310] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, E. Pathak, G. Karamanolakis, H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuznia, K. Doshi, M. Patel, K. K. Pal, M. Moradshahi, M. Parmar, M. Purohit, N. Varshney, P. R. Kaza, P. Verma, R. S. Puri, R. Karia, S. K. Sampat, S. Doshi, S. Mishra, S. Reddy, S. Patro, T. Dixit, X. Shen, C. Baral, Y. Choi, N. A. Smith, H. Hajishirzi, and D. Khashabi. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks, 2022.
  • [311] Y. Wang, W. Wang, S. Joty, and S. C. Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, 2021.
  • [312] Z. Wang, C.-L. Li, V. Perot, L. T. Le, J. Miao, Z. Zhang, C.-Y. Lee, and T. Pfister. Codeclm: Aligning language models with tailored synthetic data. arXiv preprint arXiv:2404.05875, 2024.
  • [313] Z. Wang, L. Liu, Y. Kong, J. Guo, and D. Tao. Online continual learning with contrastive vision transformer. In European Conference on Computer Vision, pages 631–650. Springer, 2022.
  • [314] Z. Wang, Z. Zhan, Y. Gong, G. Yuan, W. Niu, T. Jian, B. Ren, S. Ioannidis, Y. Wang, and J. Dy. Sparcl: Sparse continual learning on the edge. Advances in Neural Information Processing Systems, 35:20366–20380, 2022.
  • [315] Z. Wang, Z. Zhang, S. Ebrahimi, R. Sun, H. Zhang, C.-Y. Lee, X. Ren, G. Su, V. Perot, J. Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. European Conference on Computer Vision, 2022.
  • [316] Z. Wang, Z. Zhang, C.-Y. Lee, H. Zhang, R. Sun, X. Ren, G. Su, V. Perot, J. Dy, and T. Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022.
  • [317] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • [318] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners, 2022.
  • [319] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  • [320] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • [321] Y. Wei, Z. Wang, J. Liu, Y. Ding, and L. Zhang. Magicoder: Source code is all you need, 2023.
  • [322] M. Weyssow, X. Zhou, K. Kim, D. Lo, and H. Sahraoui. On the usage of continual learning for out-of-distribution generalization in pre-trained language models of code. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, page 1470–1482, New York, NY, USA, 2023. Association for Computing Machinery.
  • [323] G. Winata, L. Xie, K. Radhakrishnan, S. Wu, X. Jin, P. Cheng, M. Kulkarni, and D. Preotiuc-Pietro. Overcoming catastrophic forgetting in massively multilingual continual learning. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 768–777, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [324] M. Wistuba, P. T. Sivaprasad, L. Balles, and G. Zappella. Continual learning with low rank adaptation. In NeurIPS 2023 Workshop on Distribution Shifts (DistShifts), 2023.
  • [325] M. Wistuba, P. T. Sivaprasad, L. Balles, and G. Zappella. Continual learning with low rank adaptation, 2023.
  • [326] C. Wu, Y. Gan, Y. Ge, Z. Lu, J. Wang, Y. Feng, P. Luo, and Y. Shan. Llama pro: Progressive llama with block expansion, 2024.
  • [327] C. Wu, W. Lin, X. Zhang, Y. Zhang, Y. Wang, and W. Xie. Pmc-llama: Towards building open-source language models for medicine. arXiv preprint arXiv:2305.10415, 6, 2023.
  • [328] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann, P. Kambadur, D. S. Rosenberg, and G. Mann. Bloomberggpt: A large language model for finance. CoRR, abs/2303.17564, 2023.
  • [329] T. Wu, M. Caccia, Z. Li, Y.-F. Li, G. Qi, and G. Haffari. Pretrained language model in continual learning: A comparative study. In International conference on learning representations, 2021.
  • [330] T. Wu, L. Luo, Y.-F. Li, S. Pan, T.-T. Vu, and G. Haffari. Continual learning for large language models: A survey, 2024.
  • [331] X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, and L. He. A survey of human-in-the-loop for machine learning. Future Generation Computer Systems, 135:364–381, 2022.
  • [332] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019.
  • [333] Y. Wu, G. Wayne, A. Graves, and T. Lillicrap. The kanerva machine: A generative distributed memory. arXiv preprint arXiv:1804.01756, 2018.
  • [334] Z. Wu, Z. Weng, W. Peng, X. Yang, A. Li, L. S. Davis, and Y.-G. Jiang. Building an open-vocabulary video clip model with better architectures, optimization and data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [335] C. Xiao, X. Hu, Z. Liu, C. Tu, and M. Sun. Lawformer: A pre-trained language model for chinese legal long documents, 2021.
  • [336] J. Xie, Y. Liang, J. Liu, Y. Xiao, B. Wu, and S. Ni. Quert: Continual pre-training of language model for query understanding in travel domain search. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 5282–5291, New York, NY, USA, 2023. Association for Computing Machinery.
  • [337] Q. Xie, Q. Chen, A. Chen, C. Peng, Y. Hu, F. Lin, X. Peng, J. Huang, J. Zhang, V. Keloth, et al. Me llama: Foundation large language models for medical applications. arXiv preprint arXiv:2402.12749, 2024.
  • [338] Q. Xie, W. Han, X. Zhang, Y. Lai, M. Peng, A. Lopez-Lira, and J. Huang. PIXIU: A large language model, instruction data and evaluation benchmark for finance. CoRR, abs/2306.05443, 2023.
  • [339] S. M. Xie, S. Santurkar, T. Ma, and P. S. Liang. Data selection for language models via importance resampling. Advances in Neural Information Processing Systems, 36, 2024.
  • [340] T. Xie, Y. Wan, W. Huang, Z. Yin, Y. Liu, S. Wang, Q. Linghu, C. Kit, C. Grazian, W. Zhang, I. Razzak, and B. Hoex. DARWIN series: Domain specific large language models for natural science. CoRR, abs/2308.13565, 2023.
  • [341] Y. Xie, K. Aggarwal, and A. Ahmad. Efficient continual pre-training for building domain specific large language models, 2023.
  • [342] H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, Q. Wang, and D. Shen. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097, 2023.
  • [343] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  • [344] H. Xu, B. Liu, L. Shu, and P. S. Yu. Bert post-training for review reading comprehension and aspect-based sentiment analysis, 2019.
  • [345] S. Xue, F. Zhou, Y. Xu, H. Zhao, S. Xie, Q. Dai, C. Jiang, J. Zhang, J. Zhou, D. Xiu, and H. Mei. Weaverbird: Empowering financial decision-making with large language model, knowledge base, and search engine. CoRR, abs/2308.05361, 2023.
  • [346] Y. Yan, K. Xue, X. Shi, Q. Ye, J. Liu, and T. Ruan. Af adapter: Continual pretraining for building chinese biomedical language model. In 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 953–957, Los Alamitos, CA, USA, dec 2023. IEEE Computer Society.
  • [347] G. Yang, F. Pan, and W.-B. Gan. Stably maintained dendritic spines are associated with lifelong memories. Nature, 462(7275):920–924, 2009.
  • [348] P. Yang, D. Li, and P. Li. Continual learning for natural language generations with transformer calibration. In A. Fokkens and V. Srikumar, editors, Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 40–49, Abu Dhabi, United Arab Emirates (Hybrid), Dec. 2022. Association for Computational Linguistics.
  • [349] S. Yang, M. A. Ali, C.-L. Wang, L. Hu, and D. Wang. Moral: Moe augmented lora for llms’ lifelong learning, 2024.
  • [350] X. Yang, J. Gao, W. Xue, and E. Alexandersson. Pllama: An open-source large language model for plant science. CoRR, abs/2401.01600, 2024.
  • [351] Y. Yang, M. Jones, M. C. Mozer, and M. Ren. Reawakening knowledge: Anticipatory recovery from catastrophic interference via structured training, 2024.
  • [352] Y. Yang, Y. Tang, and K. Y. Tam. Investlm: A large language model for investment using financial domain instruction tuning. CoRR, abs/2309.13064, 2023.
  • [353] Y. Yang, J. Zhou, X. Ding, T. Huai, S. Liu, Q. Chen, L. He, and Y. Xie. Recent advances of foundation language models-based continual learning: A survey. arXiv preprint arXiv:2405.18653, 2024.
  • [354] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  • [355] Ç. Yıldız, N. K. Ravichandran, P. Punia, M. Bethge, and B. Ermis. Investigating continual pretraining in large language models: Insights and implications. arXiv preprint arXiv:2402.17400, 2024.
  • [356] J. Yin, S. Dash, F. Wang, and M. Shankar. FORGE: pre-training open foundation models for science. In D. Arnold, R. M. Badia, and K. M. Mohror, editors, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2023, Denver, CO, USA, November 12-17, 2023, pages 81:1–81:13. ACM, 2023.
  • [357] W. Yin, J. Li, and C. Xiong. ConTinTin: Continual learning from task instructions. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3062–3072, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • [358] F. Yu, A. Gao, and B. Wang. Outcome-supervised verifiers for planning in mathematical reasoning. CoRR, abs/2311.09724, 2023.
  • [359] J. Yu, Y. Zhuge, L. Zhang, D. Wang, H. Lu, and Y. He. Boosting continual learning of vision-language models via mixture-of-experts adapters. arXiv preprint arXiv:2403.11549, 2024.
  • [360] L. Yu, Q. Chen, J. Zhou, and L. He. Melo: Enhancing model editing with neuron-indexed dynamic lora. arXiv preprint arXiv:2312.11795, 2023.
  • [361] Y.-C. Yu, C.-P. Huang, J.-J. Chen, K.-P. Chang, Y.-H. Lai, F.-E. Yang, and Y.-C. F. Wang. Select and distill: Selective dual-teacher knowledge transfer for continual learning on vision-language models. arXiv preprint arXiv:2403.09296, 2024.
  • [362] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3903–3911, 2020.
  • [363] W. Yuan, Q. Zhang, T. He, C. Fang, N. Q. V. Hung, X. Hao, and H. Yin. Circle: continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022, page 678–690, New York, NY, USA, 2022. Association for Computing Machinery.
  • [364] S. Yue, W. Chen, S. Wang, B. Li, C. Shen, S. Liu, Y. Zhou, Y. Xiao, S. Yun, W. Lin, et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services. arXiv preprint arXiv:2309.11325, 2023.
  • [365] X. Yue, X. Qu, G. Zhang, Y. Fu, W. Huang, H. Sun, Y. Su, and W. Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  • [366] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi. Defending against neural fake news. Advances in neural information processing systems, 32, 2019.
  • [367] F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence, 2017.
  • [368] Y. Zhai, S. Tong, X. Li, M. Cai, Q. Qu, Y. J. Lee, and Y. Ma. Investigating the catastrophic forgetting in multimodal large language models, 2023.
  • [369] D. Zhang, Z. Hu, S. Zhoubian, Z. Du, K. Yang, Z. Wang, Y. Yue, Y. Dong, and J. Tang. Sciglm: Training scientific language models with self-reflective instruction annotation and tuning. CoRR, abs/2401.07950, 2024.
  • [370] H. Zhang, L. Gui, Y. Zhai, H. Wang, Y. Lei, and R. Xu. Copf: Continual learning human preference through optimal policy fitting. arXiv preprint arXiv:2310.15694, 2023.
  • [371] H. Zhang, Y. Lei, L. Gui, M. Yang, Y. He, H. Wang, and R. Xu. Cppo: Continual learning for reinforcement learning with human feedback.
  • [372] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun, S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and G. Wang. Instruction tuning for large language models: A survey, 2024.
  • [373] X. Zhang, C. Tian, X. Yang, L. Chen, Z. Li, and L. R. Petzold. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558, 2023.
  • [374] X. Zhang and Q. Yang. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, page 4435–4439, New York, NY, USA, 2023. Association for Computing Machinery.
  • [375] X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  • [376] Y. Zhang, X. Wang, and D. Yang. Continual sequence generation with adaptive compositional modules. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3653–3667, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • [377] Z. Zhang, M. Fang, L. Chen, and M.-R. Namazi-Rad. CITB: A benchmark for continual instruction tuning. In H. Bouamor, J. Pino, and K. Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9443–9455, Singapore, Dec. 2023. Association for Computational Linguistics.
  • [378] C. Zhao, Y. Li, and C. Caragea. C-STANCE: A large dataset for Chinese zero-shot stance detection. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13369–13385, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • [379] H. Zhao, H. Han, J. Shi, C. Du, J. Liang, and Y. Xiao. Large language model can continue evolving from mistakes. arXiv preprint arXiv:2404.08707, 2024.
  • [380] H. Zhao, S. Liu, C. Ma, H. Xu, J. Fu, Z.-H. Deng, L. Kong, and Q. Liu. GIMLET: A unified graph-text model for instruction-based molecule zero-shot learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [381] H. Zhao, H. Wang, Y. Fu, F. Wu, and X. Li. Memory-efficient class-incremental learning for image classification. IEEE Transactions on Neural Networks and Learning Systems, 33(10):5966–5977, 2022.
  • [382] S. Zhao, X. Zou, T. Yu, and H. Xu. Reconstruct before query: Continual missing modality learning with decomposed prompt collaboration, 2024.
  • [383] W. Zhao, S. Wang, Y. Hu, Y. Zhao, B. Qin, X. Zhang, Q. Yang, D. Xu, and W. Che. Sapt: A shared attention framework for parameter-efficient continual learning of large language models, 2024.
  • [384] J. Zheng, Q. Ma, Z. Liu, B. Wu, and H. Feng. Beyond anti-forgetting: Multimodal continual instruction tuning with positive forward transfer, 2024.
  • [385] J. Zheng, S. Qiu, and Q. Ma. Learn or recall? revisiting incremental learning with pre-trained language models, 2023.
  • [386] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li, T. Su, Z. Yang, and J. Tang. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x, 2023.
  • [387] Z. Zheng, M. Ma, K. Wang, Z. Qin, X. Yue, and Y. You. Preventing zero-shot transfer degradation in continual learning of vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19125–19136, 2023.
  • [388] Z. Zheng, J. Zhang, T. Vu, S. Diao, Y. H. W. Tim, and S. Yeung. Marinegpt: Unlocking secrets of ocean to the public. CoRR, abs/2310.13596, 2023.
  • [389] B. Zhou, D. Khashabi, Q. Ning, and D. Roth. "going on a vacation" takes longer than "going for a walk": A study of temporal commonsense understanding, 2019.
  • [390] W. Zhou, D.-H. Lee, R. K. Selvam, S. Lee, B. Y. Lin, and X. Ren. Pre-training text-to-text transformers for concept-centric common sense. 2021.
  • [391] D. Zhu, Z. Sun, Z. Li, T. Shen, K. Yan, S. Ding, K. Kuang, and C. Wu. Model tailor: Mitigating catastrophic forgetting in multi-modal large language models, 2024.
  • [392] T. Y. Zhuo, A. Zebaze, N. Suppattarachai, L. von Werra, H. de Vries, Q. Liu, and N. Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models, 2024.