1 Two Sea Changes in Natural Language Processing
Fully supervised learning, where a task-specific model is trained solely on a dataset of input–output examples for the target task, has long played a central role in many machine learning tasks [
60], and
natural language processing (NLP) was no exception. Because such manually annotated datasets are ever-insufficient for learning high-quality models, early NLP models relied heavily on
feature engineering (Table
1(a); e.g., Guyon et al. [
39], Lafferty et al. [
63], Och et al. [
92], Zhang and Nivre [
150]), where NLP researchers or engineers used their domain knowledge to define and extract salient features from raw data and provide models with the appropriate inductive bias to learn from this limited data. With the advent of neural network models for NLP, salient features were learned jointly with the training of the model itself [
6,
16], and hence focus shifted to
architecture engineering, where inductive bias was rather provided through the design of a suitable network architecture conducive to learning such features (Table
1(b); e.g., Bahdanau et al. [
4], Chung et al. [
15], Hochreiter and Schmidhuber [
44], Kalchbrenner et al. [
54], Kim [
57], Vaswani et al. [
137]).
1 However, from 2017 to 2019 there was a sea change in the learning of NLP models, and this fully supervised paradigm is now playing an ever-shrinking role. Specifically, the standard shifted to the
pre-train and fine-tune paradigm (Table
1(c); e.g., Dong et al. [
22], Lewis et al. [
69], Peters et al. [
97], Radford et al. [
104], Yang et al. [
143]). In this paradigm, a model with a fixed
2 architecture is
pre-trained as a
language model (LM),
3 predicting the probability of observed textual data. Because the raw textual data necessary to train LMs is available in abundance, these LMs can be trained on large datasets, in the process learning robust general-purpose features of the language it is modeling. The above pre-trained LM will be then adapted to different downstream tasks by introducing additional parameters and
fine-tuning them using task-specific objective functions. Within this paradigm, the focus turned mainly to
objective engineering, designing the training objectives used at both the pre-training and fine-tuning stages. For example, Zhang et al. [
148] show that introducing a loss function of predicting salient sentences from a document will lead to a better pre-trained LM for text summarization. Notably, the main body of the pre-trained LM is generally (but not always; Peters et al. [
98]) fine-tuned as well to make it more suitable for solving the downstream task.
Now, as of this writing in 2021, we are in the middle of a second sea change, in which the “pre-train, fine-tune” procedure is replaced by one in which we dub “
pre-train, prompt, and predict.” In this paradigm, instead of adapting pre-trained LMs to downstream tasks via objective engineering, downstream tasks are reformulated to look more like those solved during the original LM training with the help of a textual
prompt. For example, when recognizing the emotion of a social media post, “I missed the bus today,” we may continue with a prompt “I felt so ” and ask the LM to fill the blank with an emotion-bearing word. Or if we choose the prompt “English: I missed the bus today. French: ”), then an LM may be able to fill in the blank with a French translation. In this way, by selecting the appropriate prompts we can manipulate the model behavior so that the pre-trained LM itself can be used to
predict the desired output, sometimes even without any additional task-specific training (Table
1(d); e.g., Brown et al. [
9], Petroni et al. [
100], Radford et al. [
105], Schick and Schütze [
120]). The advantage of this method is that, given a suite of appropriate prompts, a single LM trained in an entirely unsupervised fashion can be used to solve a great number of tasks [
9,
131]. However, as with most conceptually enticing prospects, there is a catch—this method introduces the necessity for
prompt engineering, finding the most appropriate prompt to allow a LM to solve the task at hand.
This survey attempts to organize the current state of knowledge in this rapidly developing field by providing an overview and formal definition of prompting methods (Section
2). This is followed by in-depth discussion of prompting methods from basics such as prompt template engineering (Section
3) and prompt answer engineering (Section
4) to more advanced concepts such as multi-prompt learning methods (Section
5) and prompt-aware training methods (Section
6). We then organize the various applications to which prompt-based learning methods have been applied and discuss how they interact with the choice of prompting method (Section
7). Finally, we attempt to situate the current state of prompting methods in the research ecosystem, making connections to other research fields (Section
8), suggesting some current challenging problems that may be ripe for further research (Section
9).
Finally, to help beginners who are interested in this field learn more effectively, we highlight some systematic resources about prompt learning (as well as pre-training) provided both within this survey and on companion websites:
•
A
website of prompt-based learning that contains: frequent updates to this survey, related slides, and so on.
•
Figure
1: A typology of important concepts for prompt-based learning.
•
Tables 7 and 8: A systematic and comprehensive comparison among different prompting methods.
•
Table
5: An organization of commonly-used prompts.
•
Table
4: A timeline of prompt-based research works.
•
Table
1: A systematic and comprehensive comparison among different pre-trained LMs.
3 Prompt Template Engineering
Prompt template engineering is the process of creating a prompting function
\(f_{\text{prompt}}(\mathbf {x})\) that results in the most effective performance on the downstream task. In many previous works, this has involved human engineers or algorithms searching for the best template for each task the model is expected to perform. As shown in the “Prompt Template Engineering” section of Figure
1, one must first consider the
prompt shape and then decide whether to take a
manual or
automated approach to create prompts of the desired shape, as detailed below.
3.1 Prompt Shape
As noted above, there are two main varieties of prompts:
cloze prompts [
17,
100], which fill in the blanks of a textual string (e.g., “I love this movie, it is a
[Z] movie”), and
prefix prompts [
67,
71], which continue a string prefix (e.g., “I love this movie. What’s the sentiment of the review?
[Z]”). Which one is chosen will depend both on the task and the model that is being used to solve the task. In general, for tasks regarding generation, or tasks being solved using a standard auto-regressive LM, prefix prompts tend to be more conducive, as they mesh well with the left-to-right nature of the model. For tasks that are solved using masked LMs, cloze prompts are a good fit, as they very closely match the form of the pre-training task. Full text reconstruction models are more versatile and can be used with either cloze or prefix prompts. Finally, for some tasks regarding multiple inputs such as
text pair classification, prompt templates must contain space for two inputs,
[X1] and
[X2], or more.
3.2 Manual Template Engineering
Perhaps the most natural way to create prompts is to manually create intuitive templates based on human introspection. For example, the seminal LAMA dataset [
100] provides manually created cloze templates to probe knowledge in LMs. Brown et al. [
9] create manually crafted prefix prompts to handle a wide variety of tasks, including question answering, translation, and probing tasks for common sense reasoning. Schick and Schütze [
118] and Schick and Schütze [
117],
120] use pre-defined templates in a few-shot learning setting on text classification and conditional text generation tasks.
3.3 Automated Template Learning
While the strategy of manually crafting templates is intuitive and does allow solving various tasks with some degree of accuracy, there are also several issues with this approach: (1) Creating and experimenting with these prompts is an art that takes time and experience, particularly for some complicated tasks such as semantic parsing [
124]; and (2) even experienced prompt designers may fail to manually discover optimal prompts [
52].
To address these problems, a number of methods have been proposed to automate the template design process. In particular, the automatically induced prompts can be further separated into discrete prompts, where the prompt is an actual text string, and continuous prompts, where the prompt is instead described directly in the embedding space of the underlying LM.
One other orthogonal design consideration is whether the prompting function \(f_{\text{prompt}}(\mathbf {x})\) is static, using essentially the same prompt template for each input, or dynamic, generating a custom template for each input. Both static and dynamic strategies have been used for different varieties of discrete and continuous prompts, as we will mention below.
3.3.1 Discrete Prompts.
Works on discovering discrete prompts (a.k.a. hard prompts) automatically search for templates described in a discrete space, usually corresponding to natural language phrases. We detail several methods that have been proposed for this below.
•
D1: Prompt Mining. Jiang et al. [
52]’s
Mine approach is a mining-based method to automatically find templates given a set of training inputs
\(\mathbf {x}\) and outputs
\(\mathbf {y}\) . This method scrapes a large text corpus (e.g., Wikipedia) for strings containing
\(\mathbf {x}\) and
\(\mathbf {y}\) , and finds either the
middle words or
dependency paths between the inputs and outputs. Frequent middle words or dependency paths can serve as a template as in “
[X] middle words
[Z].”
•
D2: Prompt Paraphrasing. Paraphrasing-based approaches take in an existing seed prompt (e.g., manually constructed or mined), paraphrase it into a set of other candidate prompts, and then selects the one that achieves the highest training accuracy on the target task. This paraphrasing can be done in a number of ways, including using round-trip translation of the prompt into another language then back [
52], using replacement of phrases from a thesaurus [
147], or using a neural prompt rewriter specifically optimized to improve accuracy of systems using the prompt [
43]. Notably, Haviv et al. [
43] perform paraphrasing
after the input
\(\mathbf {x}\) is input into the prompt template, allowing a different paraphrase to be generated for each individual input.
•
D3: Gradient-based Search. Wallace et al. [
138] applied a gradient-based search over actual tokens to find short sequences that can trigger the underlying pre-trained LM to generate the desired target prediction. This search is done in an iterative fashion, stepping through tokens in the prompt. Built upon this method, Shin et al. [
125] automatically search for template tokens using downstream application training samples and demonstrates strong performance in prompting scenarios.
•
D4: Prompt Generation. Other works treat the generation of prompts as a text generation task and use standard natural language generation models to perform this task. For example, Gao et al. [
32] introduce the seq2seq pre-trained LM T5 into the template search process. Since T5 has been pre-trained on a task of filling in missing spans, they use T5 to generate template tokens by (1) specifying the position to insert template tokens within a template
5 and (2) providing training samples for T5 to decode template tokens. Guo et al. [
36] use reinforcement learning [
132] to generate prompts to control the text generation process. Ben-David et al. [
5] propose a domain adaptation algorithm that trains T5 to generate unique
domain relevant features (DRFs) (a set of keywords that characterize domain information) for each input. Then those DRFs can be concatenated with the input to form a template and be further used by downstream tasks.
•
D5: Prompt Scoring. Davison et al. [
19] investigate the task of knowledge base completion and design a template for an input (head-relation-tail triple) using LMs. They first hand-craft a set of templates as potential candidates and fill the input and answer slots to form a filled prompt. They then use a unidirectional LM to score those filled prompts, selecting the one with the highest LM probability. This will result in custom template for each individual input.
3.3.2 Continuous Prompts.
Because the purpose of prompt construction is to find a method that allows an LM to effectively perform a task, rather than being for human consumption, it is not necessary to limit the prompt to human-interpretable natural language. Because of this, there are also methods that examine continuous prompts (a.k.a. soft prompts) that perform prompting directly in the embedding space of the model. Specifically, continuous prompts remove two constraints: (1) relax the constraint that the embeddings of template words be the embeddings of natural language (e.g., English) words and (2) remove the restriction that the template is parameterized by the pre-trained LM’s parameters. Instead, templates have their own parameters that can be tuned based on training data from the downstream task. We highlight several representative methods below.
•
C1: Prefix Tuning. Prefix Tuning [
71] is a method that prepends a sequence of continuous task-specific vectors to the input, while keeping the LM parameters frozen. Mathematically, this consists of optimizing over the following log-likelihood objective given a trainable prefix matrix
\(M_{\phi }\) and a fixed pre-trained LM parameterized by
\(\theta\) ,
In Equation (
2),
\(h_{\lt i} = [h_{\lt i}^{(1)}; \cdots ; h_{\lt i}^{(n)}]\) is the concatenation of all neural network layers at timestep
i. It is copied from
\(M_{\phi }\) directly if the corresponding timestep is within the prefix (
\(h_i\) is
\(M_{\phi }[i]\) ); otherwise, it is computed using the pre-trained LM.
Experimentally, Li and Liang [
71] observe that such continuous prefix-based learning is more sensitive to different initialization in low-data settings than the use of discrete prompts with real words. Similarly, Lester et al. [
67] prepend the input sequence with special tokens to form a template and tune the embeddings of these tokens directly. Compared to the method of Li and Liang [
71], this adds fewer parameters as it does not introduce additional tunable parameters within each network layer. Tsimpoukelli et al. [
135] train a vision encoder that encodes an image into a sequence of embeddings that can be used to prompt a frozen auto-regressive LM to generate the appropriate caption. They show that the resulting model can perform few-shot learning for vision-language tasks such as visual question answering, and so on. Different from the above two works, the prefix used in Reference [
135] is sample dependent, namely a representation of input images, instead of a task embedding.
•
C2: Tuning Initialized with Discrete Prompts. There are also methods that initialize the search for a continuous prompt using a prompt that has already been created or discovered using discrete prompt search methods. For example, Zhong et al. [
152] first define a template using a discrete search method such as
AutoPrompt [
125]’s, initialize virtual tokens based on this discovered prompt, and then fine-tune the embeddings to increase task accuracy. This work found that initializing with manual templates can provide a better starting point for the search process. Qin and Eisner [
103] propose to learn a mixture of soft templates for each input where the weights and parameters for each template are jointly learned using training samples. The initial set of templates they use are either manually crafted ones or those obtained using the “prompt mining” method. Similarly, Hambardzumyan et al. [
40] introduce the use of a continuous template whose shape follows a manual prompt template.
•
C3: Hard-Soft Prompt Hybrid Tuning. Instead of using a purely learnable prompt template, these methods insert some tunable embeddings into a hard prompt template. Liu et al. [
77] propose “P-tuning,” where continuous prompts are learned by inserting trainable variables into the embedded input. To account for interaction between prompt tokens, they represent prompt embeddings as the output of a BiLSTM [
35]. P-tuning also introduces the use of task-related anchor tokens (such as “capital” in relation extraction) within the template for further improvement. These anchor tokens are not tuned during training. Han et al. [
41] propose
prompt tuning with rules (PTR), which uses manually crafted sub-templates to compose a complete template using logic rules. To enhance the representation ability of the resulting template, they also insert several virtual tokens whose embeddings can be tuned together with the pre-trained LMs parameters using training samples. The template tokens in
PTR contain both actual tokens and virtual tokens. Experiment results demonstrate the effectiveness of this prompt design method in relation classification tasks.