Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Enhancing In-Context Learning via Implicit Demonstration Augmentation

Xiaoling Zhou1, Wei Ye1111Corresponding authors. , Yidong Wang1, Chaoya Jiang1, Zhemg Lee2,
Rui Xie1, Shikun Zhang1111Corresponding authors.
1
National Engineering Research Center for Software Engineering, Peking University, China
2Tianjin University, Tianjin, China
xiaolingzhou@stu.pku.edu.cn, {wye,zhangsk}@pku.edu.cn
Abstract

The emergence of in-context learning (ICL) enables large pre-trained language models (PLMs) to make predictions for unseen inputs without updating parameters. Despite its potential, ICL’s effectiveness heavily relies on the quality, quantity, and permutation of demonstrations, commonly leading to suboptimal and unstable performance. In this paper, we tackle this challenge for the first time from the perspective of demonstration augmentation. Specifically, we start with enriching representations of demonstrations by leveraging their deep feature distribution. We then theoretically reveal that when the number of augmented copies approaches infinity, the augmentation is approximately equal to a novel logit calibration mechanism integrated with specific statistical properties. This insight results in a simple yet highly efficient method that significantly improves the average and worst-case accuracy across diverse PLMs and tasks. Moreover, our method effectively reduces performance variance among varying demonstrations, permutations, and templates, and displays the capability to address imbalanced class distributions.

1 Introduction

Large pre-trained language models (PLMs) have showcased exceptional abilities in in-context learning (ICL) Brown et al. (2020); Wang et al. (2023); Rubin et al. (2022), which assists the model in discerning the underlying patterns within demonstrations and make more accurate predictions Chan et al. (2022); Wu et al. (2023). As a new paradigm, ICL offers compelling advantages, allowing for natural language interaction with PLMs Wei et al. (2022); Yang et al. (2023), as well as reduced computational costs Li et al. (2023a); Rubin et al. (2022).

While promising, ICL’s performance is highly dependent on provided demonstrations and templates Liu et al. (2022); Zhang et al. (2022b); Sorensen et al. (2022), resulting in subpar and unstable performance. This promotes research aimed at improving the quality Rubin et al. (2022); Li et al. (2023b), quantity Li et al. (2023a); Choi et al. (2022), and permutations Lu et al. (2022); Tang et al. (2023) of demonstrations. Other research avenues include prediction adjustment Zhao et al. (2021); Han et al. (2023); Fei et al. (2023) and learning process design (e.g., channel models Min et al. (2022a) and meta-training frameworks Min et al. (2022b)). Despite ongoing efforts, ICL still struggles with efficiently and reliably capturing sufficient knowledge from context, leaving performance stability as a persistent bottleneck.

Refer to caption
Figure 1: Illustration for demonstration augmentation using semantic directions (vectors) sampled from the deep feature distribution of demonstration examples.

In this study, we propose enriching contextual knowledge for PLMs by augmenting demonstrations. We first attempt to enhance the representation of demonstrations by transforming them along semantic directions sampled from the deep feature space of demonstration examples, as depicted in Figure 1. This operation stems from the observation that the deep features in a network are usually linearized Bengio et al. (2013); Cheung and Yeung (2021); Cho (2016), implying the existence of numerous semantic directions within the deep feature space, hence potentially enabling us to incorporate richer contextual knowledge without extending input length. From this novel perspective, we theoretically prove that when the number of augmented pieces approaches infinity, its effect approximately equals a logit adjustment operation. Specifically, we derive a refined Softmax function that integrates the statistical properties of demonstrations. Consequently, rather than explicitly executing the augmentation procedure, we can efficiently conduct implicit demonstration augmentation using the derived prediction function, obtaining an improved ICL method with theoretical guidance.

We conduct extensive experiments across seven PLMs and various classification tasks. The empirical results demonstrate that our approach remarkably enhances prediction accuracy and reduces performance variability across different demonstrations, permutations, and templates. Notably, our method is straightforward, effective, and generalizable, enabling seamless integration with other ICL methods to enhance their performance.

Our contributions can be summarized as follows:

  • We introduce Implicit Demonstration Augmentation-based ICL (IDAICL), a pioneering work that incorporates demonstration augmentation into ICL. Instead of solely enhancing demonstration quality, quantity, or order, our method explores context augmentation within the deep feature space, offering a new perspective to enrich demonstrations bypassing input length limitations.

  • We theoretically establish that as the number of augmented pieces approaches infinity, our augmentation strategy approximates a logit-adjusted prediction function that integrates statistical properties derived from the input data distribution. Equipped with this function, IDAICL provides a straightforward yet theory-guided solution to enhance ICL.

  • Extensive experiments conducted across diverse tasks and PLMs conclusively illustrate that IDAICL considerably improves average and worst-case accuracy compared to existing ICL methods. Moreover, it effectively enhances performance stability.

2 Background and Related Work

2.1 In-Context Learning

Brown et al. Brown et al. (2020) showcased the ICL capability of PLMs, wherein PLMs generate predictions solely based on a concatenation of training examples for few-shot learning without updating parameters. Subsequent studies Holtzman et al. (2021); Min et al. (2022a, b) have developed this approach, yielding promising outcomes across various tasks. Nevertheless, recent research has uncovered certain limitations. To begin with, the volume of input knowledge for each query is constrained by the maximum input length of PLMs Hao et al. (2022), and the computational cost increases as the number of demonstrations grows Li et al. (2023a), making it challenging to integrate significant knowledge from demonstrations to PLMs. Additionally, ICL’s performance is sensitive to the input of PLMs Davison et al. (2019); Jiang et al. (2020), thus exhibiting high variance and poor worst-case accuracy Perez et al. (2021); Lu et al. (2022).

Researchers have explored various techniques to address the biases and instability of ICL. These techniques encompass learning process design Min et al. (2022a, b), demonstration retrieval Rubin et al. (2022); Zhang et al. (2022b), prompt engineering Sorensen et al. (2022); Lu et al. (2022), and prediction calibration Zhao et al. (2021); Fei et al. (2023). However, these methods have yet to fully address the issue of severely limited knowledge transfer from demonstrations to large PLMs.

2.2 Data Augmentation

Refer to caption
Figure 2: An overview of IDAICL: For each contextual input, our goal is to augment the deep feature of demonstrations for \mathcal{M}caligraphic_M pieces, using semantic vectors 𝜹𝜹\boldsymbol{\delta}bold_italic_δ drawn from the deep feature distribution 𝒩(𝝁,𝚺)𝒩𝝁𝚺\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})caligraphic_N ( bold_italic_μ , bold_Σ ) of demonstration examples linked to all queries. When \mathcal{M}caligraphic_M approaches infinity, we derive a novel prediction function, which incorporates two modulating factors: M(𝝁)𝑀𝝁M(\boldsymbol{\mu})italic_M ( bold_italic_μ ) and N(𝚺)𝑁𝚺N(\boldsymbol{\Sigma})italic_N ( bold_Σ ), to calibrate the original predictions.

Data augmentation Chen et al. (2023), which involves artificially creating training data through transformations, is a well-established research area in machine learning. Although data augmentation techniques have undergone extensive exploration in diverse machine learning domains Maharana et al. (2022); Shorten and Khoshgoftaar (2019), applying them to text data poses challenges due to the complexity of preserving labels during textual transformations Kobayashi (2018). Nonetheless, data augmentations in the latent space, such as adversarial training Zhang et al. (2022a); Zhu et al. (2020); Cheng et al. (2020), interpolation Chen et al. (2022b); Wu et al. (2022), and generative techniques Li et al. (2022); Malandrakis et al. (2019), have demonstrated notable enhancements when applied alongside large PLMs.

Recently, Wang et al. Wang et al. (2019) introduced the concept of implicit data augmentation in the context of image classification. This approach involves transforming training data within the deep feature space and boils down to the optimization of a novel robust loss function. Subsequent studies Chen et al. (2022c); Li et al. (2021); Zhou and Wu (2023a) for image classification tasks have further improved upon this approach. This study introduces an algorithm for implicitly augmenting demonstrations within the realm of ICL.

3 Methodology

3.1 In-Context Learning with PLMs

Considering a PLM 𝒢𝒢\mathcal{G}caligraphic_G, this study focuses on the following task: given a query input text 𝒙𝒙\boldsymbol{x}bold_italic_x and a candidate answer set 𝒴={y1,y2,,y|𝒴|}𝒴subscript𝑦1subscript𝑦2subscript𝑦𝒴\mathcal{Y}\!=\!\{y_{1},y_{2},\cdots,y_{|\mathcal{Y}|}\}caligraphic_Y = { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT | caligraphic_Y | end_POSTSUBSCRIPT }, we aim to predict the answer y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG based on m𝑚mitalic_m demonstration examples 𝒞={c1,c2,,cm}𝒞subscript𝑐1subscript𝑐2subscript𝑐𝑚\mathcal{C}\!=\!\{{c}_{1},c_{2},\cdots,c_{m}\}caligraphic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, where each cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a training example (𝒙i,yi)subscript𝒙𝑖subscript𝑦𝑖(\boldsymbol{x}_{i},y_{i})( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) after template formulation and m𝑚mitalic_m denotes the quantity of demonstration examples for each test sample. Formally, give a model 𝒢𝒢\mathcal{G}caligraphic_G, we first compute the probability of each answer yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

P𝒢(yj𝒞,𝒙).subscript𝑃𝒢conditionalsubscript𝑦𝑗𝒞𝒙P_{\mathcal{G}}\left(y_{j}\mid\mathcal{C},\boldsymbol{x}\right).italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ caligraphic_C , bold_italic_x ) . (1)

Subsequently, the ultimate prediction y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG, characterized by the highest probability is chosen from the candidate answer set 𝒴𝒴\mathcal{Y}caligraphic_Y:

y^=argmaxyj𝒴P𝒢(yj𝒞,𝒙).^𝑦subscriptsubscript𝑦𝑗𝒴subscript𝑃𝒢conditionalsubscript𝑦𝑗𝒞𝒙\hat{y}=\arg\max_{y_{j}\in\mathcal{Y}}P_{\mathcal{G}}\left(y_{j}\mid\mathcal{C% },\boldsymbol{x}\right).over^ start_ARG italic_y end_ARG = roman_arg roman_max start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ caligraphic_C , bold_italic_x ) . (2)

To simplify, the contextual input is denoted as 𝒙~=[𝒞,𝒙]bold-~𝒙𝒞𝒙\boldsymbol{\tilde{x}}\!=\![\mathcal{C},\boldsymbol{x}]overbold_~ start_ARG bold_italic_x end_ARG = [ caligraphic_C , bold_italic_x ] in the subsequent text. Then, the probability of answer yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, represented as P𝒢(yj|𝒙~)subscript𝑃𝒢conditionalsubscript𝑦𝑗bold-~𝒙P_{\mathcal{G}}(y_{j}|\boldsymbol{\tilde{x}})italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | overbold_~ start_ARG bold_italic_x end_ARG ), is computed using the Softmax function111We begin by examining situations in which the answer comprises a single token, and our subsequent analysis is equally applicable to scenarios involving multiple tokens.:

P𝒢(yj|𝒙~):=P𝒢(yj|𝒉𝒙~)=e𝒘yjT𝒉𝒙~+byjke𝒘kT𝒉𝒙~+bk,assignsubscript𝑃𝒢conditionalsubscript𝑦𝑗bold-~𝒙subscript𝑃𝒢conditionalsubscript𝑦𝑗subscript𝒉bold-~𝒙superscript𝑒superscriptsubscript𝒘subscript𝑦𝑗𝑇subscript𝒉~𝒙subscript𝑏subscript𝑦𝑗subscript𝑘superscript𝑒superscriptsubscript𝒘𝑘𝑇subscript𝒉~𝒙subscript𝑏𝑘P_{\mathcal{G}}(y_{j}|\boldsymbol{\tilde{x}}):=P_{\mathcal{G}}(y_{j}|% \boldsymbol{h}_{\boldsymbol{\tilde{x}}})=\frac{e^{\boldsymbol{w}_{y_{j}}^{T}% \boldsymbol{h}_{\tilde{\boldsymbol{x}}}+b_{y_{j}}}}{\sum_{k}e^{\boldsymbol{w}_% {k}^{T}\boldsymbol{h}_{\tilde{\boldsymbol{x}}}+b_{k}}},italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | overbold_~ start_ARG bold_italic_x end_ARG ) := italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_italic_h start_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , (3)

where 𝒉𝒙~=𝒢(𝒙~)subscript𝒉~𝒙𝒢~𝒙\boldsymbol{h}_{\tilde{\boldsymbol{x}}}=\mathcal{G}(\tilde{\boldsymbol{x}})bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT = caligraphic_G ( over~ start_ARG bold_italic_x end_ARG ) signifies the hidden state of the last block at the final position for 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG. 𝒘ksubscript𝒘𝑘\boldsymbol{w}_{k}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are the weight vector and bias corresponding to the final fully connected layer for the k𝑘kitalic_k-th token.

3.2 Demonstration Augmentation

Recognizing the established efficacy of data augmentation in machine learning Feng et al. (2021), this study investigates demonstration augmentation and suggests enhancing the deep features of demonstrations by transforming them along semantic directions sampled from the deep feature space of demonstration examples. This strategy is motivated by the intriguing observation that the deep features in networks are often linearized Bengio et al. (2013); Chen et al. (2022a). Building on this observation, we hypothesize that 𝒉𝒙~subscript𝒉~𝒙\boldsymbol{h}_{\tilde{\boldsymbol{x}}}bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT lies within the subspace spanned by 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and 𝒉𝒙subscript𝒉𝒙\boldsymbol{h}_{{\boldsymbol{x}}}bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT: 𝒉𝒙~=α𝒉𝒞+β𝒉𝒙subscript𝒉~𝒙𝛼subscript𝒉𝒞𝛽subscript𝒉𝒙\boldsymbol{h}_{\tilde{\boldsymbol{x}}}=\alpha\boldsymbol{h}_{\mathcal{C}}+% \beta\boldsymbol{h}_{\boldsymbol{x}}bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT = italic_α bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + italic_β bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT, where 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and 𝒉𝒙subscript𝒉𝒙\boldsymbol{h}_{{\boldsymbol{x}}}bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT represent the components of 𝒉𝒙~subscript𝒉~𝒙\boldsymbol{h}_{\tilde{\boldsymbol{x}}}bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT linked respectively to the demonstrations and the query. The necessity of this assumption stems from intricate relationships among token representations and the exclusive augmentation of the component related to demonstrations. Notably, this decomposition is not necessary in practical applications. In the subsequent text, we directly refer to α𝒉𝒞𝛼subscript𝒉𝒞\alpha\boldsymbol{h}_{\mathcal{C}}italic_α bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and β𝒉𝒙𝛽subscript𝒉𝒙\beta\boldsymbol{h}_{\boldsymbol{x}}italic_β bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT as 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and 𝒉𝒙subscript𝒉𝒙\boldsymbol{h}_{\boldsymbol{x}}bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT.

To augment 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, we randomly sample vectors from the deep feature space of demonstrations. In particular, vectors are drawn from a multivariate normal distribution 𝒩(𝝁,𝚺)𝒩𝝁𝚺\mathcal{N}(\boldsymbol{\mu},\boldsymbol{\Sigma})caligraphic_N ( bold_italic_μ , bold_Σ ), where 𝝁𝝁\boldsymbol{\mu}bold_italic_μ and 𝚺𝚺\boldsymbol{\Sigma}bold_Σ denote the feature mean and covariance matrix. These statistical properties are estimated from the deep features of the demonstration set 𝒟𝒟\mathcal{D}caligraphic_D, which includes demonstration examples linked to all queries. The feature mean 𝝁𝝁\boldsymbol{\mu}bold_italic_μ is computed as

𝝁=1|𝒟|i=1|𝒟|𝒉i,𝝁1𝒟superscriptsubscript𝑖1𝒟subscript𝒉𝑖\boldsymbol{\mu}=\frac{1}{|\mathcal{D}|}\sum\nolimits_{i=1}^{|\mathcal{D}|}% \boldsymbol{h}_{i},bold_italic_μ = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4)

where 𝒉i=𝒢(ci)subscript𝒉𝑖𝒢subscript𝑐𝑖\boldsymbol{h}_{i}=\mathcal{G}(c_{i})bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_G ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represents the hidden state of the last block at the final position for the i𝑖iitalic_i-th demonstration example cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝒟𝒟\mathcal{D}caligraphic_D, and |𝒟|𝒟|\mathcal{D}|| caligraphic_D | denotes the size of 𝒟𝒟\mathcal{D}caligraphic_D. The covariance matrix 𝚺𝚺\boldsymbol{\Sigma}bold_Σ is computed as

𝚺=1|𝒟|i=1|𝒟|(𝒉i𝝁)T(𝒉i𝝁).𝚺1𝒟superscriptsubscript𝑖1𝒟superscriptsubscript𝒉𝑖𝝁𝑇subscript𝒉𝑖𝝁\boldsymbol{\Sigma}=\frac{1}{|\mathcal{D}|}{\sum\nolimits_{i=1}^{|\mathcal{D}|% }(\boldsymbol{h}_{i}-\boldsymbol{\mu})^{T}(\boldsymbol{h}_{i}-\boldsymbol{\mu}% )}.bold_Σ = divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_D | end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_μ ) . (5)

Subsequently, 𝒉𝒞subscript𝒉𝒞{\boldsymbol{h}}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is shifted in the extracted semantic vectors, resulting in augmented features, 𝒉~𝒞subscript~𝒉𝒞\tilde{\boldsymbol{h}}_{\mathcal{C}}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT, which follows

𝒉~𝒞𝒩(𝒉𝒞+λ𝝁,λ𝚺),similar-tosubscript~𝒉𝒞𝒩subscript𝒉𝒞𝜆𝝁𝜆𝚺\tilde{\boldsymbol{h}}_{\mathcal{C}}\sim\mathcal{N}\left({\boldsymbol{h}}_{% \mathcal{C}}+\lambda\boldsymbol{\mu},\lambda\boldsymbol{\Sigma}\right),over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + italic_λ bold_italic_μ , italic_λ bold_Σ ) , (6)

where λ𝜆\lambdaitalic_λ refers to a positive coefficient controlling the strength of semantic augmentation. In real-world applications, it can be directly assigned a value of 0.5. Sensitivity tests for λ𝜆\lambdaitalic_λ are discussed in Section 5.4.

3.3 Novel Prediction Function

Selecting the answer with the highest probability is equivalent to favoring the answer with the lowest inverse probability. Therefore, the prediction can be determined by

y^=argminyj𝒴P𝒢(yj𝒉𝒙~)1.^𝑦subscriptsubscript𝑦𝑗𝒴subscript𝑃𝒢superscriptconditionalsubscript𝑦𝑗subscript𝒉~𝒙1\hat{y}=\arg\min_{y_{j}\in\mathcal{Y}}P_{\mathcal{G}}\left(y_{j}\mid% \boldsymbol{h}_{\tilde{\boldsymbol{x}}}\right)^{-1}.over^ start_ARG italic_y end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∣ bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT . (7)

Assume that each 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is augmented for \mathcal{M}caligraphic_M times, resulting in an augmented demonstration feature set {𝒉~𝒞1,,𝒉~𝒞}superscriptsubscript~𝒉𝒞1superscriptsubscript~𝒉𝒞\{\tilde{\boldsymbol{h}}_{\mathcal{C}}^{1},\cdots,\tilde{\boldsymbol{h}}_{% \mathcal{C}}^{\mathcal{M}}\}{ over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT } with size \mathcal{M}caligraphic_M. Here, 𝒉~𝒞isuperscriptsubscript~𝒉𝒞𝑖\tilde{\boldsymbol{h}}_{\mathcal{C}}^{i}over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT represents the i𝑖iitalic_i-th augmented feature for 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT. Then, the final prediction for the query 𝒙𝒙\boldsymbol{x}bold_italic_x depends on all augmented features of 𝒉𝒞subscript𝒉𝒞\boldsymbol{h}_{\mathcal{C}}bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT and can be expressed as

Pyj(𝒙~)=1i=1P𝒢(yj|𝒉~𝒞i,𝒉𝒙)1,superscriptsubscript𝑃subscript𝑦𝑗bold-~𝒙1superscriptsubscript𝑖1subscript𝑃𝒢superscriptconditionalsubscript𝑦𝑗superscriptsubscript~𝒉𝒞𝑖subscript𝒉𝒙1{P}_{y_{j}}^{\mathcal{M}}(\boldsymbol{\tilde{x}})=\frac{1}{\mathcal{M}}{\sum% \nolimits_{i=1}^{\mathcal{M}}P_{\mathcal{G}}(y_{j}|\tilde{\boldsymbol{h}}_{% \mathcal{C}}^{i},\boldsymbol{h}_{\boldsymbol{x}})^{-1}},italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) = divide start_ARG 1 end_ARG start_ARG caligraphic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (8)
y^=argminyj𝒴Pyj(𝒙~).^𝑦subscriptsubscript𝑦𝑗𝒴superscriptsubscript𝑃subscript𝑦𝑗bold-~𝒙\hat{y}={\arg\min_{y_{j}\in\mathcal{Y}}{P}_{y_{j}}^{\mathcal{M}}(\boldsymbol{% \tilde{x}})}.over^ start_ARG italic_y end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_M end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) . (9)

Given that the performance of ICL benefits from an increased number of demonstration instances Liu et al. (2022); Wu et al. (2023), we explore the scenario of augmenting an infinite number of times for the deep representation of demonstrations. Subsequently, an easily computable surrogate for the expected prediction can be derived, resulting in a highly efficient implementation. The whole pipeline of IDAICL is depicted in Figure 2.

As \mathcal{M}\rightarrow\inftycaligraphic_M → ∞, on the basis of the aforementioned decomposition of 𝒉𝒙~subscript𝒉~𝒙\boldsymbol{h}_{\tilde{\boldsymbol{x}}}bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT, the expected prediction for answer yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (denoted as Pyjsubscriptsuperscript𝑃subscript𝑦𝑗P^{\infty}_{y_{j}}italic_P start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT) within the augmented feature set can be expressed as follows:

Pyj(𝒙~)=𝔼𝒉~𝒞[keΔ𝒘k,yjT(𝒉~𝒞+𝒉𝒙)+Δbk,yj],superscriptsubscript𝑃subscript𝑦𝑗bold-~𝒙subscript𝔼subscriptbold-~𝒉𝒞delimited-[]subscript𝑘superscript𝑒Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇subscriptbold-~𝒉𝒞subscript𝒉𝒙Δsubscript𝑏𝑘subscript𝑦𝑗\displaystyle P_{y_{j}}^{\infty}(\boldsymbol{\tilde{x}})\!=\!\mathbb{E}_{% \boldsymbol{\tilde{h}}_{\mathcal{C}}}[{\sum_{k}e^{\Delta\boldsymbol{w}_{k,y_{j% }}^{T}(\boldsymbol{\tilde{h}}_{\mathcal{C}}+\boldsymbol{h}_{\boldsymbol{x}})+% \Delta b_{k,y_{j}}}}],italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) = blackboard_E start_POSTSUBSCRIPT overbold_~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) + roman_Δ italic_b start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] , (10)

where Δ𝒘k,yj=𝒘k𝒘yjΔsubscript𝒘𝑘subscript𝑦𝑗subscript𝒘𝑘subscript𝒘subscript𝑦𝑗\Delta\boldsymbol{w}_{k,y_{j}}=\boldsymbol{w}_{k}-\boldsymbol{w}_{y_{j}}roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Δbk,yj=bkbyjΔsubscript𝑏𝑘subscript𝑦𝑗subscript𝑏𝑘subscript𝑏subscript𝑦𝑗\Delta b_{k,y_{j}}=b_{k}-b_{y_{j}}roman_Δ italic_b start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

However, accurately calculating Pyjsubscriptsuperscript𝑃subscript𝑦𝑗P^{\infty}_{y_{j}}italic_P start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is challenging. Alternatively, we proceed to derive a surrogate calculation for it. Applying the linearity of expectation, Eq. (10) can be expressed as:

Pyj(𝒙~)=kE𝒉~𝒞[eΔ𝒘k,yjT(𝒉~𝒞+𝒉𝒙)+Δbk,yj].superscriptsubscript𝑃subscript𝑦𝑗bold-~𝒙subscript𝑘subscriptEsubscript~𝒉𝒞delimited-[]superscript𝑒Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇subscript~𝒉𝒞subscript𝒉𝒙Δsubscript𝑏𝑘subscript𝑦𝑗\displaystyle P_{y_{j}}^{\infty}(\boldsymbol{\tilde{x}})\!=\!\sum_{k}\mathrm{E% }_{\tilde{\boldsymbol{h}}_{\mathcal{C}}}[e^{\Delta\boldsymbol{w}_{k,y_{j}}^{T}% (\tilde{\boldsymbol{h}}_{\mathcal{C}}+\boldsymbol{h}_{\boldsymbol{x}})+\Delta b% _{k,y_{j}}}].italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) + roman_Δ italic_b start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] . (11)

Given that 𝒉~𝒞subscriptbold-~𝒉𝒞\boldsymbol{\tilde{h}}_{\mathcal{C}}overbold_~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT is a Gaussian random variable conforming to 𝒩(𝒉𝒞+λ𝝁,λ𝚺)𝒩subscript𝒉𝒞𝜆𝝁𝜆𝚺\mathcal{N}\left({\boldsymbol{h}}_{\mathcal{C}}+\lambda\boldsymbol{\mu},% \lambda\boldsymbol{\Sigma}\right)caligraphic_N ( bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + italic_λ bold_italic_μ , italic_λ bold_Σ ), we know that Δ𝒘k,yjT𝒉~𝒞Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇subscript~𝒉𝒞\Delta\boldsymbol{w}_{k,y_{j}}^{T}\tilde{\boldsymbol{h}}_{\mathcal{C}}roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over~ start_ARG bold_italic_h end_ARG start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT follows the multivariate normal distribution: 𝒩(Δ𝒘k,yjT(𝒉𝒞+λ𝝁),λΔ𝒘k,yjT𝚺Δ𝒘k,yj)𝒩Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇subscript𝒉𝒞𝜆𝝁𝜆Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇𝚺Δsubscript𝒘𝑘subscript𝑦𝑗\mathcal{N}(\Delta\boldsymbol{w}_{k,y_{j}}^{T}\left({\boldsymbol{h}}_{\mathcal% {C}}+\lambda\boldsymbol{\mu}\right),\lambda\Delta\boldsymbol{w}_{k,y_{j}}^{T}% \boldsymbol{\Sigma}\Delta\boldsymbol{w}_{k,y_{j}})caligraphic_N ( roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + italic_λ bold_italic_μ ) , italic_λ roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Then, utilizing the moment-generating function

𝔼[etX]=etμ+12t2σ2,X𝒩(μ,σ2),formulae-sequence𝔼delimited-[]superscript𝑒𝑡𝑋superscript𝑒𝑡𝜇12superscript𝑡2superscript𝜎2similar-to𝑋𝒩𝜇superscript𝜎2\mathbb{E}[e^{{t}{X}}]=e^{{t}{\mu}+\frac{1}{2}{t}^{2}\sigma^{2}},X\sim\mathcal% {N}(\mu,\sigma^{2}),blackboard_E [ italic_e start_POSTSUPERSCRIPT italic_t italic_X end_POSTSUPERSCRIPT ] = italic_e start_POSTSUPERSCRIPT italic_t italic_μ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_X ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (12)

Eq. (11) can be derived as

Pyj(𝒙~)=kMk,yjNk,yjeΔ𝒘k,yjT(𝒉𝒞+𝒉𝒙)+Δbk,yj,superscriptsubscript𝑃subscript𝑦𝑗bold-~𝒙subscript𝑘subscript𝑀𝑘subscript𝑦𝑗subscript𝑁𝑘subscript𝑦𝑗superscript𝑒Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇subscript𝒉𝒞subscript𝒉𝒙Δsubscript𝑏𝑘subscript𝑦𝑗\displaystyle P_{y_{j}}^{\infty}(\boldsymbol{\tilde{x}})\!=\!\sum_{k}M_{k,y_{j% }}N_{k,y_{j}}e^{\Delta\boldsymbol{w}_{k,y_{j}}^{T}(\boldsymbol{h}_{\mathcal{C}% }+\boldsymbol{h}_{\boldsymbol{x}})+\Delta b_{k,y_{j}}},italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) = ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT + bold_italic_h start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ) + roman_Δ italic_b start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (13)

where Mk,yj=exp(λΔ𝒘k,yjT𝝁)subscript𝑀𝑘subscript𝑦𝑗𝜆Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇𝝁M_{k,y_{j}}=\exp(\lambda{\Delta\boldsymbol{w}_{k,y_{j}}^{T}\boldsymbol{\mu}})italic_M start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_exp ( italic_λ roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_μ ) and Nk,yj=exp(λ2Δ𝒘k,yjT𝚺Δ𝒘k,yj)subscript𝑁𝑘subscript𝑦𝑗𝜆2Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇𝚺Δsubscript𝒘𝑘subscript𝑦𝑗N_{k,y_{j}}=\exp({{\frac{\lambda}{2}}\Delta\boldsymbol{w}_{k,y_{j}}^{T}% \boldsymbol{\Sigma}\Delta\boldsymbol{w}_{k,y_{j}}})italic_N start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_exp ( divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).

PLM Method m SST-2 SST-5 MR CR Amazon Subj TREC DBPedia AGNews CB
GPT-2 0.8B Vanilla ICL 4 57.67.1subscript57.67.1{57.6}_{{\color[rgb]{0,0,1}7.1}}57.6 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 30.46.3subscript30.46.3{30.4}_{{\color[rgb]{0,0,1}6.3}}30.4 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 59.36.5subscript59.36.5{59.3}_{{\color[rgb]{0,0,1}6.5}}59.3 start_POSTSUBSCRIPT 6.5 end_POSTSUBSCRIPT 56.88.4subscript56.88.4{56.8}_{{\color[rgb]{0,0,1}8.4}}56.8 start_POSTSUBSCRIPT 8.4 end_POSTSUBSCRIPT 32.78.5subscript32.78.532.7_{{\color[rgb]{0,0,1}8.5}}32.7 start_POSTSUBSCRIPT 8.5 end_POSTSUBSCRIPT 57.65.4subscript57.65.4{57.6}_{{\color[rgb]{0,0,1}5.4}}57.6 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 34.910.3subscript34.910.3{34.9}_{{\color[rgb]{0,0,1}10.3}}34.9 start_POSTSUBSCRIPT 10.3 end_POSTSUBSCRIPT 40.57.2subscript40.57.2{40.5}_{{\color[rgb]{0,0,1}7.2}}40.5 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 44.57.9subscript44.57.9{44.5}_{{\color[rgb]{0,0,1}7.9}}44.5 start_POSTSUBSCRIPT 7.9 end_POSTSUBSCRIPT 35.19.3subscript35.19.3{35.1}_{{\color[rgb]{0,0,1}9.3}}35.1 start_POSTSUBSCRIPT 9.3 end_POSTSUBSCRIPT
IDAICL 86.41.4subscript86.41.486.4_{{\color[rgb]{0,0,1}1.4}}86.4 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 38.32.9subscript38.32.938.3_{{\color[rgb]{0,0,1}2.9}}38.3 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 82.22.3subscript82.22.382.2_{{\color[rgb]{0,0,1}2.3}}82.2 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 78.40.7subscript78.40.778.4_{{\color[rgb]{0,0,1}0.7}}78.4 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 46.73.5subscript46.73.546.7_{{\color[rgb]{0,0,1}3.5}}46.7 start_POSTSUBSCRIPT 3.5 end_POSTSUBSCRIPT 77.02.3subscript77.02.377.0_{{\color[rgb]{0,0,1}2.3}}77.0 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 47.52.0subscript47.52.047.5_{{\color[rgb]{0,0,1}2.0}}47.5 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 81.31.8subscript81.31.881.3_{{\color[rgb]{0,0,1}1.8}}81.3 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 73.92.4subscript73.92.473.9_{{\color[rgb]{0,0,1}2.4}}73.9 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 41.52.0subscript41.52.041.5_{{\color[rgb]{0,0,1}2.0}}41.5 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT
Vanilla ICL 8 69.79.0subscript69.79.0{69.7}_{{\color[rgb]{0,0,1}9.0}}69.7 start_POSTSUBSCRIPT 9.0 end_POSTSUBSCRIPT 32.48.6subscript32.48.6{32.4}_{{\color[rgb]{0,0,1}8.6}}32.4 start_POSTSUBSCRIPT 8.6 end_POSTSUBSCRIPT 63.97.7subscript63.97.7{63.9}_{{\color[rgb]{0,0,1}7.7}}63.9 start_POSTSUBSCRIPT 7.7 end_POSTSUBSCRIPT 60.88.1subscript60.88.1{60.8}_{{\color[rgb]{0,0,1}8.1}}60.8 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 34.16.2subscript34.16.234.1_{{\color[rgb]{0,0,1}6.2}}34.1 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 59.78.7subscript59.78.7{59.7}_{{\color[rgb]{0,0,1}8.7}}59.7 start_POSTSUBSCRIPT 8.7 end_POSTSUBSCRIPT 40.46.3subscript40.46.3{40.4}_{{\color[rgb]{0,0,1}6.3}}40.4 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 62.613.6subscript62.613.6{62.6}_{{\color[rgb]{0,0,1}13.6}}62.6 start_POSTSUBSCRIPT 13.6 end_POSTSUBSCRIPT 49.28.4subscript49.28.4{49.2}_{{\color[rgb]{0,0,1}8.4}}49.2 start_POSTSUBSCRIPT 8.4 end_POSTSUBSCRIPT 38.87.6subscript38.87.6{38.8}_{{\color[rgb]{0,0,1}7.6}}38.8 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT
IDAICL 88.02.3subscript88.02.388.0_{{\color[rgb]{0,0,1}2.3}}88.0 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 39.61.9subscript39.61.939.6_{{\color[rgb]{0,0,1}1.9}}39.6 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 84.92.4subscript84.92.484.9_{{\color[rgb]{0,0,1}2.4}}84.9 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 85.62.5subscript85.62.5{85.6}_{{\color[rgb]{0,0,1}2.5}}85.6 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 47.92.6subscript47.92.647.9_{{\color[rgb]{0,0,1}2.6}}47.9 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 79.90.8subscript79.90.879.9_{{\color[rgb]{0,0,1}0.8}}79.9 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 50.33.3subscript50.33.3{50.3}_{{\color[rgb]{0,0,1}3.3}}50.3 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 86.52.9subscript86.52.9{86.5}_{{\color[rgb]{0,0,1}2.9}}86.5 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 76.81.7subscript76.81.776.8_{{\color[rgb]{0,0,1}1.7}}76.8 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 43.33.4subscript43.33.443.3_{{\color[rgb]{0,0,1}3.4}}43.3 start_POSTSUBSCRIPT 3.4 end_POSTSUBSCRIPT
Vanilla ICL 12 74.78.3subscript74.78.3{74.7}_{{\color[rgb]{0,0,1}8.3}}74.7 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 33.77.6subscript33.77.6{33.7}_{{\color[rgb]{0,0,1}7.6}}33.7 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT 64.49.4subscript64.49.4{64.4}_{{\color[rgb]{0,0,1}9.4}}64.4 start_POSTSUBSCRIPT 9.4 end_POSTSUBSCRIPT 68.79.7subscript68.79.7{68.7}_{{\color[rgb]{0,0,1}9.7}}68.7 start_POSTSUBSCRIPT 9.7 end_POSTSUBSCRIPT 36.06.6subscript36.06.6{36.0}_{{\color[rgb]{0,0,1}6.6}}36.0 start_POSTSUBSCRIPT 6.6 end_POSTSUBSCRIPT 60.77.7subscript60.77.7{60.7}_{{\color[rgb]{0,0,1}7.7}}60.7 start_POSTSUBSCRIPT 7.7 end_POSTSUBSCRIPT 40.57.8subscript40.57.8{40.5}_{{\color[rgb]{0,0,1}7.8}}40.5 start_POSTSUBSCRIPT 7.8 end_POSTSUBSCRIPT 64.55.4subscript64.55.4{64.5}_{{\color[rgb]{0,0,1}5.4}}64.5 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 51.18.0subscript51.18.0{51.1}_{{\color[rgb]{0,0,1}8.0}}51.1 start_POSTSUBSCRIPT 8.0 end_POSTSUBSCRIPT 40.48.5subscript40.48.5{40.4}_{{\color[rgb]{0,0,1}8.5}}40.4 start_POSTSUBSCRIPT 8.5 end_POSTSUBSCRIPT
IDAICL 88.52.1subscript88.52.188.5_{{\color[rgb]{0,0,1}2.1}}88.5 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 40.12.7subscript40.12.740.1_{{\color[rgb]{0,0,1}2.7}}40.1 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 85.23.1subscript85.23.185.2_{{\color[rgb]{0,0,1}3.1}}85.2 start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT 86.81.4subscript86.81.486.8_{{\color[rgb]{0,0,1}1.4}}86.8 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 49.62.2subscript49.62.249.6_{{\color[rgb]{0,0,1}2.2}}49.6 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 80.42.1subscript80.42.180.4_{{\color[rgb]{0,0,1}2.1}}80.4 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 51.41.6subscript51.41.651.4_{{\color[rgb]{0,0,1}1.6}}51.4 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 87.32.7subscript87.32.787.3_{{\color[rgb]{0,0,1}2.7}}87.3 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 77.92.0subscript77.92.0{77.9}_{{\color[rgb]{0,0,1}2.0}}77.9 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 44.62.2subscript44.62.244.6_{{\color[rgb]{0,0,1}2.2}}44.6 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT
MetaICL 12 80.86.2subscript80.86.2{80.8}_{{\color[rgb]{0,0,1}6.2}}80.8 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 35.84.7subscript35.84.7{35.8}_{{\color[rgb]{0,0,1}4.7}}35.8 start_POSTSUBSCRIPT 4.7 end_POSTSUBSCRIPT 75.35.6subscript75.35.6{75.3}_{{\color[rgb]{0,0,1}5.6}}75.3 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 77.68.1subscript77.68.177.6_{{\color[rgb]{0,0,1}8.1}}77.6 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 48.96.7subscript48.96.748.9_{{\color[rgb]{0,0,1}6.7}}48.9 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 73.58.8subscript73.58.8{73.5}_{{\color[rgb]{0,0,1}8.8}}73.5 start_POSTSUBSCRIPT 8.8 end_POSTSUBSCRIPT 48.66.1subscript48.66.148.6_{{\color[rgb]{0,0,1}6.1}}48.6 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 80.47.8subscript80.47.880.4_{{\color[rgb]{0,0,1}7.8}}80.4 start_POSTSUBSCRIPT 7.8 end_POSTSUBSCRIPT 66.80.7subscript66.80.7{66.8}_{{\color[rgb]{0,0,1}0.7}}66.8 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 43.14.1subscript43.14.1{{43.1}}_{{\color[rgb]{0,0,1}4.1}}43.1 start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT
+IDAICL 89.31.7subscript89.31.7{89.3}_{{\color[rgb]{0,0,1}1.7}}89.3 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 42.6¯2.4subscript¯42.62.4{\underline{42.6}}_{{\color[rgb]{0,0,1}2.4}}under¯ start_ARG 42.6 end_ARG start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 85.81.7subscript85.81.7{85.8}_{{\color[rgb]{0,0,1}1.7}}85.8 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 87.91.5subscript87.91.5{87.9}_{{\color[rgb]{0,0,1}1.5}}87.9 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 51.7¯0.7subscript¯51.70.7{\underline{51.7}}_{{\color[rgb]{0,0,1}0.7}}under¯ start_ARG 51.7 end_ARG start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 82.6¯2.4subscript¯82.62.4{\underline{82.6}}_{{\color[rgb]{0,0,1}2.4}}under¯ start_ARG 82.6 end_ARG start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 53.72.5subscript53.72.5{53.7}_{{\color[rgb]{0,0,1}2.5}}53.7 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 89.4¯4.1subscript¯89.44.1{\underline{89.4}}_{{\color[rgb]{0,0,1}4.1}}under¯ start_ARG 89.4 end_ARG start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 78.31.1subscript78.31.178.3_{{\color[rgb]{0,0,1}1.1}}78.3 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 47.92.8subscript47.92.8{\boldsymbol{47.9}}_{{\color[rgb]{0,0,1}2.8}}bold_47.9 start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT
Channel ICL 12 85.23.6subscript85.23.6{85.2}_{{\color[rgb]{0,0,1}3.6}}85.2 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 38.44.3subscript38.44.3{38.4}_{{\color[rgb]{0,0,1}4.3}}38.4 start_POSTSUBSCRIPT 4.3 end_POSTSUBSCRIPT 80.84.7subscript80.84.7{80.8}_{{\color[rgb]{0,0,1}4.7}}80.8 start_POSTSUBSCRIPT 4.7 end_POSTSUBSCRIPT 82.04.6subscript82.04.6{82.0}_{{\color[rgb]{0,0,1}4.6}}82.0 start_POSTSUBSCRIPT 4.6 end_POSTSUBSCRIPT 43.65.1subscript43.65.1{43.6}_{{\color[rgb]{0,0,1}5.1}}43.6 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 69.89.8subscript69.89.8{69.8}_{{\color[rgb]{0,0,1}9.8}}69.8 start_POSTSUBSCRIPT 9.8 end_POSTSUBSCRIPT 44.18.7subscript44.18.7{44.1}_{{\color[rgb]{0,0,1}8.7}}44.1 start_POSTSUBSCRIPT 8.7 end_POSTSUBSCRIPT 77.612.9subscript77.612.9{77.6}_{{\color[rgb]{0,0,1}12.9}}77.6 start_POSTSUBSCRIPT 12.9 end_POSTSUBSCRIPT 69.56.7subscript69.56.7{69.5}_{{\color[rgb]{0,0,1}6.7}}69.5 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 42.45.2subscript42.45.2{42.4}_{{\color[rgb]{0,0,1}5.2}}42.4 start_POSTSUBSCRIPT 5.2 end_POSTSUBSCRIPT
+IDAICL 90.52.3subscript90.52.3{\boldsymbol{90.5}}_{{\color[rgb]{0,0,1}2.3}}bold_90.5 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 41.82.7subscript41.82.7{41.8}_{{\color[rgb]{0,0,1}2.7}}41.8 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 87.71.6subscript87.71.6{\boldsymbol{87.7}}_{{\color[rgb]{0,0,1}1.6}}bold_87.7 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 89.51.2subscript89.51.2{\boldsymbol{89.5}}_{{\color[rgb]{0,0,1}1.2}}bold_89.5 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 50.82.4subscript50.82.4{50.8}_{{\color[rgb]{0,0,1}2.4}}50.8 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 80.50.9subscript80.50.9{80.5}_{{\color[rgb]{0,0,1}0.9}}80.5 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 52.91.6subscript52.91.6{52.9}_{{\color[rgb]{0,0,1}1.6}}52.9 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 87.82.4subscript87.82.4{87.8}_{{\color[rgb]{0,0,1}2.4}}87.8 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 81.0¯2.5subscript¯81.02.5{\underline{81.0}}_{{\color[rgb]{0,0,1}2.5}}under¯ start_ARG 81.0 end_ARG start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 46.33.3subscript46.33.3{46.3}_{{\color[rgb]{0,0,1}3.3}}46.3 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT
EPR 12 81.92.1subscript81.92.1{81.9}_{{\color[rgb]{0,0,1}2.1}}81.9 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 39.91.8subscript39.91.839.9_{{\color[rgb]{0,0,1}1.8}}39.9 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 78.12.4subscript78.12.478.1_{{\color[rgb]{0,0,1}2.4}}78.1 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 80.60.6subscript80.60.680.6_{{\color[rgb]{0,0,1}0.6}}80.6 start_POSTSUBSCRIPT 0.6 end_POSTSUBSCRIPT 49.12.4subscript49.12.449.1_{{\color[rgb]{0,0,1}2.4}}49.1 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 80.12.2subscript80.12.280.1_{{\color[rgb]{0,0,1}2.2}}80.1 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 76.2¯1.1subscript¯76.21.1{\underline{76.2}}_{{\color[rgb]{0,0,1}1.1}}under¯ start_ARG 76.2 end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 87.11.0subscript87.11.0{{87.1}}_{{\color[rgb]{0,0,1}1.0}}87.1 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 80.90.8subscript80.90.8{{80.9}}_{{\color[rgb]{0,0,1}0.8}}80.9 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 44.82.3subscript44.82.344.8_{{\color[rgb]{0,0,1}2.3}}44.8 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT
+IDAICL 90.1¯1.1subscript¯90.11.1{\underline{90.1}}_{{\color[rgb]{0,0,1}1.1}}under¯ start_ARG 90.1 end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 43.91.2subscript43.91.2{\boldsymbol{43.9}}_{{\color[rgb]{0,0,1}1.2}}bold_43.9 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 86.4¯2.0subscript¯86.42.0{\underline{86.4}}_{{\color[rgb]{0,0,1}2.0}}under¯ start_ARG 86.4 end_ARG start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 88.6¯0.6subscript¯88.60.6{\underline{88.6}}_{{\color[rgb]{0,0,1}0.6}}under¯ start_ARG 88.6 end_ARG start_POSTSUBSCRIPT 0.6 end_POSTSUBSCRIPT 52.51.7subscript52.51.7{\boldsymbol{52.5}}_{{\color[rgb]{0,0,1}1.7}}bold_52.5 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 83.61.0subscript83.61.0{\boldsymbol{83.6}}_{{\color[rgb]{0,0,1}1.0}}bold_83.6 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 79.10.9subscript79.10.9{\boldsymbol{79.1}}_{{\color[rgb]{0,0,1}0.9}}bold_79.1 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 90.80.7subscript90.80.7{\boldsymbol{90.8}}_{{\color[rgb]{0,0,1}0.7}}bold_90.8 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 83.70.5subscript83.70.5{\boldsymbol{83.7}}_{{\color[rgb]{0,0,1}0.5}}bold_83.7 start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT 46.7¯2.1subscript¯46.72.1{\underline{46.7}}_{{\color[rgb]{0,0,1}2.1}}under¯ start_ARG 46.7 end_ARG start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT
GPT-2 1.5B Vanilla ICL 4 66.38.6subscript66.38.6{66.3}_{{\color[rgb]{0,0,1}8.6}}66.3 start_POSTSUBSCRIPT 8.6 end_POSTSUBSCRIPT 30.38.9subscript30.38.9{30.3}_{{\color[rgb]{0,0,1}8.9}}30.3 start_POSTSUBSCRIPT 8.9 end_POSTSUBSCRIPT 56.56.6subscript56.56.6{56.5}_{{\color[rgb]{0,0,1}6.6}}56.5 start_POSTSUBSCRIPT 6.6 end_POSTSUBSCRIPT 53.48.1subscript53.48.1{53.4}_{{\color[rgb]{0,0,1}8.1}}53.4 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 34.77.5subscript34.77.534.7_{{\color[rgb]{0,0,1}7.5}}34.7 start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 54.25.5subscript54.25.5{54.2}_{{\color[rgb]{0,0,1}5.5}}54.2 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 30.88.1subscript30.88.1{30.8}_{{\color[rgb]{0,0,1}8.1}}30.8 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 61.98.7subscript61.98.7{61.9}_{{\color[rgb]{0,0,1}8.7}}61.9 start_POSTSUBSCRIPT 8.7 end_POSTSUBSCRIPT 54.69.9subscript54.69.9{54.6}_{{\color[rgb]{0,0,1}9.9}}54.6 start_POSTSUBSCRIPT 9.9 end_POSTSUBSCRIPT 40.87.8subscript40.87.8{40.8}_{{\color[rgb]{0,0,1}7.8}}40.8 start_POSTSUBSCRIPT 7.8 end_POSTSUBSCRIPT
IDAICL 87.41.5subscript87.41.587.4_{{\color[rgb]{0,0,1}1.5}}87.4 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 38.81.7subscript38.81.738.8_{{\color[rgb]{0,0,1}1.7}}38.8 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 80.91.2subscript80.91.280.9_{{\color[rgb]{0,0,1}1.2}}80.9 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 82.12.1subscript82.12.182.1_{{\color[rgb]{0,0,1}2.1}}82.1 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 48.10.6subscript48.10.648.1_{{\color[rgb]{0,0,1}0.6}}48.1 start_POSTSUBSCRIPT 0.6 end_POSTSUBSCRIPT 77.83.0subscript77.83.077.8_{{\color[rgb]{0,0,1}3.0}}77.8 start_POSTSUBSCRIPT 3.0 end_POSTSUBSCRIPT 49.51.9subscript49.51.949.5_{{\color[rgb]{0,0,1}1.9}}49.5 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 87.42.6subscript87.42.687.4_{{\color[rgb]{0,0,1}2.6}}87.4 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 79.21.8subscript79.21.879.2_{{\color[rgb]{0,0,1}1.8}}79.2 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 54.12.7subscript54.12.754.1_{{\color[rgb]{0,0,1}2.7}}54.1 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT
Vanilla ICL 8 57.27.0subscript57.27.0{57.2}_{{\color[rgb]{0,0,1}7.0}}57.2 start_POSTSUBSCRIPT 7.0 end_POSTSUBSCRIPT 30.86.1subscript30.86.1{30.8}_{{\color[rgb]{0,0,1}6.1}}30.8 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 64.98.3subscript64.98.3{64.9}_{{\color[rgb]{0,0,1}8.3}}64.9 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 57.66.4subscript57.66.4{57.6}_{{\color[rgb]{0,0,1}6.4}}57.6 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 38.66.4subscript38.66.438.6_{{\color[rgb]{0,0,1}6.4}}38.6 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 57.310.3subscript57.310.3{57.3}_{{\color[rgb]{0,0,1}10.3}}57.3 start_POSTSUBSCRIPT 10.3 end_POSTSUBSCRIPT 39.55.3subscript39.55.3{39.5}_{{\color[rgb]{0,0,1}5.3}}39.5 start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 67.48.1subscript67.48.1{67.4}_{{\color[rgb]{0,0,1}8.1}}67.4 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 56.35.4subscript56.35.4{56.3}_{{\color[rgb]{0,0,1}5.4}}56.3 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 47.45.1subscript47.45.1{47.4}_{{\color[rgb]{0,0,1}5.1}}47.4 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT
IDAICL 89.51.8subscript89.51.889.5_{{\color[rgb]{0,0,1}1.8}}89.5 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 40.81.9subscript40.81.940.8_{{\color[rgb]{0,0,1}1.9}}40.8 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 82.11.2subscript82.11.282.1_{{\color[rgb]{0,0,1}1.2}}82.1 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 84.32.1subscript84.32.184.3_{{\color[rgb]{0,0,1}2.1}}84.3 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 50.23.4subscript50.23.450.2_{{\color[rgb]{0,0,1}3.4}}50.2 start_POSTSUBSCRIPT 3.4 end_POSTSUBSCRIPT 80.12.9subscript80.12.980.1_{{\color[rgb]{0,0,1}2.9}}80.1 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 51.52.5subscript51.52.551.5_{{\color[rgb]{0,0,1}2.5}}51.5 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 89.81.7subscript89.81.789.8_{{\color[rgb]{0,0,1}1.7}}89.8 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 80.30.9subscript80.30.980.3_{{\color[rgb]{0,0,1}0.9}}80.3 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 55.50.6subscript55.50.655.5_{{\color[rgb]{0,0,1}0.6}}55.5 start_POSTSUBSCRIPT 0.6 end_POSTSUBSCRIPT
Vanilla ICL 12 70.99.6subscript70.99.6{70.9}_{{\color[rgb]{0,0,1}9.6}}70.9 start_POSTSUBSCRIPT 9.6 end_POSTSUBSCRIPT 34.76.7subscript34.76.7{34.7}_{{\color[rgb]{0,0,1}6.7}}34.7 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 65.25.6subscript65.25.6{65.2}_{{\color[rgb]{0,0,1}5.6}}65.2 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 59.96.7subscript59.96.7{59.9}_{{\color[rgb]{0,0,1}6.7}}59.9 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 38.310.2subscript38.310.238.3_{{\color[rgb]{0,0,1}10.2}}38.3 start_POSTSUBSCRIPT 10.2 end_POSTSUBSCRIPT 59.68.1subscript59.68.1{59.6}_{{\color[rgb]{0,0,1}8.1}}59.6 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 40.77.5subscript40.77.5{40.7}_{{\color[rgb]{0,0,1}7.5}}40.7 start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 72.511.6subscript72.511.6{72.5}_{{\color[rgb]{0,0,1}11.6}}72.5 start_POSTSUBSCRIPT 11.6 end_POSTSUBSCRIPT 57.69.5subscript57.69.5{57.6}_{{\color[rgb]{0,0,1}9.5}}57.6 start_POSTSUBSCRIPT 9.5 end_POSTSUBSCRIPT 48.55.7subscript48.55.7{48.5}_{{\color[rgb]{0,0,1}5.7}}48.5 start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT
IDAICL 90.02.8subscript90.02.890.0_{{\color[rgb]{0,0,1}2.8}}90.0 start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 41.11.3subscript41.11.341.1_{{\color[rgb]{0,0,1}1.3}}41.1 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 83.42.3subscript83.42.383.4_{{\color[rgb]{0,0,1}2.3}}83.4 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 85.62.4subscript85.62.4{85.6}_{{\color[rgb]{0,0,1}2.4}}85.6 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 51.62.9subscript51.62.951.6_{{\color[rgb]{0,0,1}2.9}}51.6 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 80.52.5subscript80.52.580.5_{{\color[rgb]{0,0,1}2.5}}80.5 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 51.83.6subscript51.83.651.8_{{\color[rgb]{0,0,1}3.6}}51.8 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 90.52.7subscript90.52.790.5_{{\color[rgb]{0,0,1}2.7}}90.5 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 81.13.0subscript81.13.081.1_{{\color[rgb]{0,0,1}3.0}}81.1 start_POSTSUBSCRIPT 3.0 end_POSTSUBSCRIPT 55.72.1subscript55.72.1{55.7}_{{\color[rgb]{0,0,1}2.1}}55.7 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT
MetaICL 12 79.17.0subscript79.17.0{79.1}_{{\color[rgb]{0,0,1}7.0}}79.1 start_POSTSUBSCRIPT 7.0 end_POSTSUBSCRIPT 38.63.7subscript38.63.738.6_{{\color[rgb]{0,0,1}3.7}}38.6 start_POSTSUBSCRIPT 3.7 end_POSTSUBSCRIPT 76.46.3subscript76.46.376.4_{{\color[rgb]{0,0,1}6.3}}76.4 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 75.34.5subscript75.34.575.3_{{\color[rgb]{0,0,1}4.5}}75.3 start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT 50.57.1subscript50.57.150.5_{{\color[rgb]{0,0,1}7.1}}50.5 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 73.97.6subscript73.97.673.9_{{\color[rgb]{0,0,1}7.6}}73.9 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT 46.76.3subscript46.76.346.7_{{\color[rgb]{0,0,1}6.3}}46.7 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 86.87.8subscript86.87.886.8_{{\color[rgb]{0,0,1}7.8}}86.8 start_POSTSUBSCRIPT 7.8 end_POSTSUBSCRIPT 76.45.4subscript76.45.476.4_{{\color[rgb]{0,0,1}5.4}}76.4 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 53.11.6subscript53.11.6{53.1}_{{\color[rgb]{0,0,1}1.6}}53.1 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT
+IDAICL 89.62.2subscript89.62.2{89.6}_{{\color[rgb]{0,0,1}2.2}}89.6 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 42.9¯2.3subscript¯42.92.3{\underline{42.9}}_{{\color[rgb]{0,0,1}2.3}}under¯ start_ARG 42.9 end_ARG start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 84.23.4subscript84.23.4{84.2}_{{\color[rgb]{0,0,1}3.4}}84.2 start_POSTSUBSCRIPT 3.4 end_POSTSUBSCRIPT 87.9¯1.1subscript¯87.91.1{\underline{87.9}}_{{\color[rgb]{0,0,1}1.1}}under¯ start_ARG 87.9 end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 53.81.2subscript53.81.2{\boldsymbol{53.8}}_{{\color[rgb]{0,0,1}1.2}}bold_53.8 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 83.4¯3.2subscript¯83.43.2{\underline{83.4}}_{{\color[rgb]{0,0,1}3.2}}under¯ start_ARG 83.4 end_ARG start_POSTSUBSCRIPT 3.2 end_POSTSUBSCRIPT 53.61.3subscript53.61.3{{53.6}}_{{\color[rgb]{0,0,1}1.3}}53.6 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 91.9¯0.9subscript¯91.90.9{\underline{91.9}}_{{\color[rgb]{0,0,1}0.9}}under¯ start_ARG 91.9 end_ARG start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 84.3¯1.4subscript¯84.31.4{\underline{84.3}}_{{\color[rgb]{0,0,1}1.4}}under¯ start_ARG 84.3 end_ARG start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 57.3¯1.5subscript¯57.31.5{\underline{57.3}}_{{\color[rgb]{0,0,1}1.5}}under¯ start_ARG 57.3 end_ARG start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT
Channel ICL 12 83.35.9subscript83.35.9{83.3}_{{\color[rgb]{0,0,1}5.9}}83.3 start_POSTSUBSCRIPT 5.9 end_POSTSUBSCRIPT 37.54.6subscript37.54.637.5_{{\color[rgb]{0,0,1}4.6}}37.5 start_POSTSUBSCRIPT 4.6 end_POSTSUBSCRIPT 80.64.1subscript80.64.1{80.6}_{{\color[rgb]{0,0,1}4.1}}80.6 start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 77.15.5subscript77.15.5{77.1}_{{\color[rgb]{0,0,1}5.5}}77.1 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 48.96.7subscript48.96.748.9_{{\color[rgb]{0,0,1}6.7}}48.9 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 68.28.3subscript68.28.368.2_{{\color[rgb]{0,0,1}8.3}}68.2 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 43.37.2subscript43.37.2{43.3}_{{\color[rgb]{0,0,1}7.2}}43.3 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 70.49.3subscript70.49.370.4_{{\color[rgb]{0,0,1}9.3}}70.4 start_POSTSUBSCRIPT 9.3 end_POSTSUBSCRIPT 67.95.5subscript67.95.5{67.9}_{{\color[rgb]{0,0,1}5.5}}67.9 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 53.68.9subscript53.68.9{53.6}_{{\color[rgb]{0,0,1}8.9}}53.6 start_POSTSUBSCRIPT 8.9 end_POSTSUBSCRIPT
+IDAICL 91.22.1subscript91.22.1{\boldsymbol{91.2}}_{{\color[rgb]{0,0,1}2.1}}bold_91.2 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 40.81.5subscript40.81.5{40.8}_{{\color[rgb]{0,0,1}1.5}}40.8 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 86.5¯2.6subscript¯86.52.6{\underline{86.5}}_{{\color[rgb]{0,0,1}2.6}}under¯ start_ARG 86.5 end_ARG start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 88.21.8subscript88.21.8{\boldsymbol{88.2}}_{{\color[rgb]{0,0,1}1.8}}bold_88.2 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 52.42.9subscript52.42.9{52.4}_{{\color[rgb]{0,0,1}2.9}}52.4 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 82.32.4subscript82.32.4{82.3}_{{\color[rgb]{0,0,1}2.4}}82.3 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 50.51.8subscript50.51.8{50.5}_{{\color[rgb]{0,0,1}1.8}}50.5 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 88.71.2subscript88.71.2{88.7}_{{\color[rgb]{0,0,1}1.2}}88.7 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 82.60.9subscript82.60.9{82.6}_{{\color[rgb]{0,0,1}0.9}}82.6 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 56.52.1subscript56.52.1{56.5}_{{\color[rgb]{0,0,1}2.1}}56.5 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT
EPR 12 82.82.6subscript82.82.6{82.8}_{{\color[rgb]{0,0,1}2.6}}82.8 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 40.62.1subscript40.62.140.6_{{\color[rgb]{0,0,1}2.1}}40.6 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 79.51.4subscript79.51.479.5_{{\color[rgb]{0,0,1}1.4}}79.5 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 74.72.7subscript74.72.774.7_{{\color[rgb]{0,0,1}2.7}}74.7 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 50.72.3subscript50.72.350.7_{{\color[rgb]{0,0,1}2.3}}50.7 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 83.30.7subscript83.30.7{{83.3}}_{{\color[rgb]{0,0,1}0.7}}83.3 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 82.2¯2.4subscript¯82.22.4{\underline{82.2}}_{{\color[rgb]{0,0,1}2.4}}under¯ start_ARG 82.2 end_ARG start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 91.50.8subscript91.50.891.5_{{\color[rgb]{0,0,1}0.8}}91.5 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 83.21.6subscript83.21.6{{83.2}}_{{\color[rgb]{0,0,1}1.6}}83.2 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 54.81.9subscript54.81.954.8_{{\color[rgb]{0,0,1}1.9}}54.8 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT
+IDAICL 90.5¯1.5subscript¯90.51.5{\underline{90.5}}_{{\color[rgb]{0,0,1}1.5}}under¯ start_ARG 90.5 end_ARG start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 43.81.0subscript43.81.0{\boldsymbol{43.8}}_{{\color[rgb]{0,0,1}1.0}}bold_43.8 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 87.40.9subscript87.40.9{\boldsymbol{87.4}}_{{\color[rgb]{0,0,1}0.9}}bold_87.4 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 86.51.5subscript86.51.5{86.5}_{{\color[rgb]{0,0,1}1.5}}86.5 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 52.9¯1.8subscript¯52.91.8{\underline{52.9}}_{{\color[rgb]{0,0,1}1.8}}under¯ start_ARG 52.9 end_ARG start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 85.80.5subscript85.80.5{\boldsymbol{85.8}}_{{\color[rgb]{0,0,1}0.5}}bold_85.8 start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT 84.71.1subscript84.71.1{\boldsymbol{84.7}}_{{\color[rgb]{0,0,1}1.1}}bold_84.7 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 93.52.5subscript93.52.5{\boldsymbol{93.5}}_{{\color[rgb]{0,0,1}2.5}}bold_93.5 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 86.42.2subscript86.42.2{\boldsymbol{86.4}}_{{\color[rgb]{0,0,1}2.2}}bold_86.4 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 57.51.5subscript57.51.5{\boldsymbol{57.5}}_{{\color[rgb]{0,0,1}1.5}}bold_57.5 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT
GPT-Neo MetaICL 12 87.86.7subscript87.86.7{87.8}_{{\color[rgb]{0,0,1}6.7}}87.8 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 42.56.1subscript42.56.142.5_{{\color[rgb]{0,0,1}6.1}}42.5 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 82.25.9subscript82.25.982.2_{{\color[rgb]{0,0,1}5.9}}82.2 start_POSTSUBSCRIPT 5.9 end_POSTSUBSCRIPT 80.74.8subscript80.74.8{80.7}_{{\color[rgb]{0,0,1}4.8}}80.7 start_POSTSUBSCRIPT 4.8 end_POSTSUBSCRIPT 51.55.3subscript51.55.351.5_{{\color[rgb]{0,0,1}5.3}}51.5 start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 72.28.2subscript72.28.272.2_{{\color[rgb]{0,0,1}8.2}}72.2 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 54.16.8subscript54.16.854.1_{{\color[rgb]{0,0,1}6.8}}54.1 start_POSTSUBSCRIPT 6.8 end_POSTSUBSCRIPT 84.45.5subscript84.45.584.4_{{\color[rgb]{0,0,1}5.5}}84.4 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 74.38.2subscript74.38.274.3_{{\color[rgb]{0,0,1}8.2}}74.3 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 50.36.4subscript50.36.4{50.3}_{{\color[rgb]{0,0,1}6.4}}50.3 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT
+IDAICL 92.1¯1.1subscript¯92.11.1{\underline{92.1}}_{{\color[rgb]{0,0,1}1.1}}under¯ start_ARG 92.1 end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 44.32.3subscript44.32.3{{44.3}}_{{\color[rgb]{0,0,1}2.3}}44.3 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 88.82.1subscript88.82.1{\boldsymbol{88.8}}_{{\color[rgb]{0,0,1}2.1}}bold_88.8 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 88.11.8subscript88.11.8{\boldsymbol{88.1}}_{{\color[rgb]{0,0,1}1.8}}bold_88.1 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 53.21.7subscript53.21.7{\boldsymbol{53.2}}_{{\color[rgb]{0,0,1}1.7}}bold_53.2 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 84.32.1subscript84.32.1{{84.3}}_{{\color[rgb]{0,0,1}2.1}}84.3 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 64.31.9subscript64.31.9{{64.3}}_{{\color[rgb]{0,0,1}1.9}}64.3 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 94.31.2subscript94.31.2{{94.3}}_{{\color[rgb]{0,0,1}1.2}}94.3 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 86.50.9subscript86.50.9{{86.5}}_{{\color[rgb]{0,0,1}0.9}}86.5 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 53.42.1subscript53.42.1{\boldsymbol{53.4}}_{{\color[rgb]{0,0,1}2.1}}bold_53.4 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT
Channel ICL 12 83.45.4subscript83.45.4{83.4}_{{\color[rgb]{0,0,1}5.4}}83.4 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 39.86.4subscript39.86.439.8_{{\color[rgb]{0,0,1}6.4}}39.8 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 79.55.7subscript79.55.779.5_{{\color[rgb]{0,0,1}5.7}}79.5 start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT 79.45.9subscript79.45.9{79.4}_{{\color[rgb]{0,0,1}5.9}}79.4 start_POSTSUBSCRIPT 5.9 end_POSTSUBSCRIPT 50.13.8subscript50.13.850.1_{{\color[rgb]{0,0,1}3.8}}50.1 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 70.68.2subscript70.68.270.6_{{\color[rgb]{0,0,1}8.2}}70.6 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 50.85.1subscript50.85.150.8_{{\color[rgb]{0,0,1}5.1}}50.8 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 78.37.1subscript78.37.178.3_{{\color[rgb]{0,0,1}7.1}}78.3 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 72.56.9subscript72.56.972.5_{{\color[rgb]{0,0,1}6.9}}72.5 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 48.74.5subscript48.74.5{48.7}_{{\color[rgb]{0,0,1}4.5}}48.7 start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT
+IDAICL 91.52.2subscript91.52.2{91.5}_{{\color[rgb]{0,0,1}2.2}}91.5 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 41.61.8subscript41.61.8{41.6}_{{\color[rgb]{0,0,1}1.8}}41.6 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 85.41.9subscript85.41.9{85.4}_{{\color[rgb]{0,0,1}1.9}}85.4 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 87.2¯2.5subscript¯87.22.5{\underline{87.2}}_{{\color[rgb]{0,0,1}2.5}}under¯ start_ARG 87.2 end_ARG start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 52.7¯2.2subscript¯52.72.2{\underline{52.7}}_{{\color[rgb]{0,0,1}2.2}}under¯ start_ARG 52.7 end_ARG start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 83.71.4subscript83.71.4{83.7}_{{\color[rgb]{0,0,1}1.4}}83.7 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 62.80.7subscript62.80.7{62.8}_{{\color[rgb]{0,0,1}0.7}}62.8 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 93.53.3subscript93.53.3{93.5}_{{\color[rgb]{0,0,1}3.3}}93.5 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 84.63.1subscript84.63.1{84.6}_{{\color[rgb]{0,0,1}3.1}}84.6 start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT 52.01.8subscript52.01.8{52.0}_{{\color[rgb]{0,0,1}1.8}}52.0 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT
EPR 12 88.21.6subscript88.21.6{88.2}_{{\color[rgb]{0,0,1}1.6}}88.2 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 45.7¯2.2subscript¯45.72.2{\underline{45.7}}_{{\color[rgb]{0,0,1}2.2}}under¯ start_ARG 45.7 end_ARG start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 81.81.9subscript81.81.9{81.8}_{{\color[rgb]{0,0,1}1.9}}81.8 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 71.82.9subscript71.82.9{71.8}_{{\color[rgb]{0,0,1}2.9}}71.8 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 49.91.1subscript49.91.1{49.9}_{{\color[rgb]{0,0,1}1.1}}49.9 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 89.4¯2.4subscript¯89.42.4{\underline{89.4}}_{{\color[rgb]{0,0,1}2.4}}under¯ start_ARG 89.4 end_ARG start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 92.3¯2.2subscript¯92.32.2{\underline{92.3}}_{{\color[rgb]{0,0,1}2.2}}under¯ start_ARG 92.3 end_ARG start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 96.1¯1.2subscript¯96.11.2{\underline{96.1}}_{{\color[rgb]{0,0,1}1.2}}under¯ start_ARG 96.1 end_ARG start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 88.8¯1.1subscript¯88.81.1{\underline{88.8}}_{{\color[rgb]{0,0,1}1.1}}under¯ start_ARG 88.8 end_ARG start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 49.40.7subscript49.40.749.4_{{\color[rgb]{0,0,1}0.7}}49.4 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT
+IDAICL 93.20.8subscript93.20.8{\boldsymbol{93.2}}_{{\color[rgb]{0,0,1}0.8}}bold_93.2 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 47.21.3subscript47.21.3{\boldsymbol{47.2}}_{{\color[rgb]{0,0,1}1.3}}bold_47.2 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 88.5¯1.2subscript¯88.51.2{\underline{88.5}}_{{\color[rgb]{0,0,1}1.2}}under¯ start_ARG 88.5 end_ARG start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 86.62.0subscript86.62.0{86.6}_{{\color[rgb]{0,0,1}2.0}}86.6 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 52.10.4subscript52.10.4{52.1}_{{\color[rgb]{0,0,1}0.4}}52.1 start_POSTSUBSCRIPT 0.4 end_POSTSUBSCRIPT 93.11.2subscript93.11.2{\boldsymbol{93.1}}_{{\color[rgb]{0,0,1}1.2}}bold_93.1 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 94.42.4subscript94.42.4{\boldsymbol{94.4}}_{{\color[rgb]{0,0,1}2.4}}bold_94.4 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 97.81.5subscript97.81.5{\boldsymbol{97.8}}_{{\color[rgb]{0,0,1}1.5}}bold_97.8 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 91.20.7subscript91.20.7{\boldsymbol{91.2}}_{{\color[rgb]{0,0,1}0.7}}bold_91.2 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 52.1¯0.5subscript¯52.10.5{\underline{52.1}}_{{\color[rgb]{0,0,1}0.5}}under¯ start_ARG 52.1 end_ARG start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT
Table 1: Comparison results of three PLMs. Two numbers indicate the mean accuracy (%) and standard deviation over different seeds. The best and second-best results per PLM per dataset are highlighted in bold and underlined, respectively. "+IDAICL" means that the current approach is used in conjunction with IDAICL. The results for different numbers of demonstration examples (i.e., m𝑚mitalic_m values) using the GPT-Neo model are illustrated in Figure 3.

Subsequently, our newly proposed prediction function, referred to as IDA-Softmax, is defined as

PyjIDA(𝒙~):=kMk,yjNk,yjeΔ𝒘k,yjT𝒉𝒙~+Δbk,yj.assignsuperscriptsubscript𝑃subscript𝑦𝑗IDAbold-~𝒙subscript𝑘subscript𝑀𝑘subscript𝑦𝑗subscript𝑁𝑘subscript𝑦𝑗superscript𝑒Δsuperscriptsubscript𝒘𝑘subscript𝑦𝑗𝑇subscript𝒉~𝒙Δsubscript𝑏𝑘subscript𝑦𝑗\begin{aligned} P_{y_{j}}^{\text{IDA}}(\boldsymbol{\tilde{x}}):=\sum_{k}M_{k,y% _{j}}N_{k,y_{j}}e^{\Delta\boldsymbol{w}_{k,y_{j}}^{T}\boldsymbol{h}_{\tilde{% \boldsymbol{x}}}+\Delta b_{k,y_{j}}}.\end{aligned}start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IDA end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) := ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT roman_Δ bold_italic_w start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG end_POSTSUBSCRIPT + roman_Δ italic_b start_POSTSUBSCRIPT italic_k , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . end_CELL end_ROW

(14)

Consequently, instead of conducting the augmentation process explicitly, we can directly employ IDA-Softmax, PyjIDAsuperscriptsubscript𝑃subscript𝑦𝑗IDAP_{y_{j}}^{\text{IDA}}italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IDA end_POSTSUPERSCRIPT, for prediction. IDA-Softmax essentially utilizes two modulating factors associated with statistical properties derived from 𝒟𝒟\mathcal{D}caligraphic_D to calibrate the sample logits. Previous studies Min et al. (2022c); Chan et al. (2022) have underscored the pivotal role of knowledge about the input data distribution in predictions made by PLMs. Intuitively, PLMs can better capture the patterns and underlying structures within data, such as the spatial relationships between demonstrations and queries, ultimately enhancing their prediction performance.

Furthermore, to mitigate the imbalance among different answer types in demonstrations Holtzman et al. (2021); Zhao et al. (2021), we adopt a post-hoc adjustment approach inspired by Menon et al. Menon et al. (2021), which adjusts predictions by considering the class proportions within 𝒟𝒟\mathcal{D}caligraphic_D. Thus, the prediction for answer yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is computed as

P~yjIDA(𝒙~)=PyjIDA(𝒙~)+τlogπyj,superscriptsubscript~𝑃subscript𝑦𝑗IDA~𝒙superscriptsubscript𝑃subscript𝑦𝑗IDA~𝒙𝜏subscript𝜋subscript𝑦𝑗\tilde{P}_{y_{j}}^{\text{IDA}}(\tilde{\boldsymbol{x}})=P_{y_{j}}^{\text{IDA}}(% \tilde{\boldsymbol{x}})+\tau\log\pi_{y_{j}},over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IDA end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) = italic_P start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IDA end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) + italic_τ roman_log italic_π start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (15)

where τ𝜏\tauitalic_τ is a positive hyperparameter, and πyjsubscript𝜋subscript𝑦𝑗\pi_{y_{j}}italic_π start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT demotes the proportion of answer yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in 𝒟𝒟\mathcal{D}caligraphic_D. In practical applications, the value of τ𝜏\tauitalic_τ can be fixed at 1. This approach compensates for predictions of minor classes. When different answers are uniformly distributed, τlogπyj𝜏subscript𝜋subscript𝑦𝑗\tau\log\pi_{y_{j}}italic_τ roman_log italic_π start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT exerts an equal influence on all answer types. Consequently, the final prediction is given by

y^=argminyj𝒴P~yjIDA(𝒙~).^𝑦subscriptsubscript𝑦𝑗𝒴superscriptsubscript~𝑃subscript𝑦𝑗IDAbold-~𝒙\hat{y}={\arg\min_{y_{j}\in\mathcal{Y}}\tilde{P}_{y_{j}}^{\text{IDA}}(% \boldsymbol{\tilde{x}})}.over^ start_ARG italic_y end_ARG = roman_arg roman_min start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_Y end_POSTSUBSCRIPT over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT IDA end_POSTSUPERSCRIPT ( overbold_~ start_ARG bold_italic_x end_ARG ) . (16)

4 Experimental Setup

4.1 Models and Datasets

We evaluated the performance of IDAICL across seven large PLMs, including GPT-2 Radford et al. (2019) (with 0.1B, 0.3B, 0.8B, and 1.5B parameters), GPT-Neo Black et al. (2021) (with 2.7B parameters), and LLaMA Touvron et al. (2023) (with 13B and 33B parameters). Following previous research Min et al. (2022a); Han et al. (2023); Lu et al. (2022), our evaluation encompasses ten text classification datasets. Among these, SST-2 Socher et al. (2013), SST-5 Socher et al. (2013), MR Pang and Lee (2005), CR Hu and Liu (2004), and Amazon McAuley and Leskovec (2013) are five sentiment classification tasks. Subj Pang and Lee (2004), TREC Voorhees and Tice (2000), DBPedia Lehmann et al. (2015), and AGNews Zhang et al. (2015) cater to subjectivity, question, ontology, and news classification tasks, respectively. Additionally, CB De Marneffe et al. (2019) is utilized for natural language inference. Among these datasets, SST-5, Amazon, TREC, and CB are characterized by imbalanced training data. Details of all datasets are provided in Section A of the Appendix.

PLM Method SST-2 SST-5 MR CR Subj TREC DBPedia AGNews CB Avg.
LLaMA 13B Vanilla ICL 95.67.1subscript95.67.1{95.6}_{{\color[rgb]{0,0,1}7.1}}95.6 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 29.56.2subscript29.56.2{29.5}_{{\color[rgb]{0,0,1}6.2}}29.5 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 90.05.8subscript90.05.8{90.0}_{{\color[rgb]{0,0,1}5.8}}90.0 start_POSTSUBSCRIPT 5.8 end_POSTSUBSCRIPT 91.47.4subscript91.47.4{91.4}_{{\color[rgb]{0,0,1}7.4}}91.4 start_POSTSUBSCRIPT 7.4 end_POSTSUBSCRIPT 72.96.9subscript72.96.9{72.9}_{{\color[rgb]{0,0,1}6.9}}72.9 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 62.89.1subscript62.89.1{62.8}_{{\color[rgb]{0,0,1}9.1}}62.8 start_POSTSUBSCRIPT 9.1 end_POSTSUBSCRIPT 80.97.6subscript80.97.6{80.9}_{{\color[rgb]{0,0,1}7.6}}80.9 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT 80.25.9subscript80.25.980.2_{{\color[rgb]{0,0,1}5.9}}80.2 start_POSTSUBSCRIPT 5.9 end_POSTSUBSCRIPT 51.58.2subscript51.58.251.5_{{\color[rgb]{0,0,1}8.2}}51.5 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 72.8
ConCa 96.75.4subscript96.75.4{\boldsymbol{96.7}}_{{\color[rgb]{0,0,1}5.4}}bold_96.7 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 40.36.2subscript40.36.240.3_{{\color[rgb]{0,0,1}6.2}}40.3 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 91.77.3subscript91.77.3{91.7}_{{\color[rgb]{0,0,1}7.3}}91.7 start_POSTSUBSCRIPT 7.3 end_POSTSUBSCRIPT 90.84.2subscript90.84.2{90.8}_{{\color[rgb]{0,0,1}4.2}}90.8 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 79.69.1subscript79.69.1{79.6}_{{\color[rgb]{0,0,1}9.1}}79.6 start_POSTSUBSCRIPT 9.1 end_POSTSUBSCRIPT 68.25.6subscript68.25.6{68.2}_{{\color[rgb]{0,0,1}5.6}}68.2 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 94.3¯4.1subscript¯94.34.1{\underline{94.3}}_{{\color[rgb]{0,0,1}4.1}}under¯ start_ARG 94.3 end_ARG start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 85.2¯7.5subscript¯85.27.5{\underline{85.2}}_{{\color[rgb]{0,0,1}7.5}}under¯ start_ARG 85.2 end_ARG start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 46.65.0subscript46.65.046.6_{{\color[rgb]{0,0,1}5.0}}46.6 start_POSTSUBSCRIPT 5.0 end_POSTSUBSCRIPT 77.0
PROCA 95.43.8subscript95.43.8{95.4}_{{\color[rgb]{0,0,1}3.8}}95.4 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 43.4¯5.7subscript¯43.45.7{\underline{43.4}}_{{\color[rgb]{0,0,1}5.7}}under¯ start_ARG 43.4 end_ARG start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT 90.39.6subscript90.39.6{90.3}_{{\color[rgb]{0,0,1}9.6}}90.3 start_POSTSUBSCRIPT 9.6 end_POSTSUBSCRIPT 92.1¯3.1subscript¯92.13.1{\underline{92.1}}_{{\color[rgb]{0,0,1}3.1}}under¯ start_ARG 92.1 end_ARG start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT 84.8¯2.5subscript¯84.82.5{\underline{84.8}}_{{\color[rgb]{0,0,1}2.5}}under¯ start_ARG 84.8 end_ARG start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 69.92.1subscript69.92.1{69.9}_{{\color[rgb]{0,0,1}2.1}}69.9 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 92.54.9subscript92.54.9{92.5}_{{\color[rgb]{0,0,1}4.9}}92.5 start_POSTSUBSCRIPT 4.9 end_POSTSUBSCRIPT 81.63.6subscript81.63.6{81.6}_{{\color[rgb]{0,0,1}3.6}}81.6 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 51.44.2subscript51.44.251.4_{{\color[rgb]{0,0,1}4.2}}51.4 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 77.9
D-ConCa 96.33.8subscript96.33.8{96.3}_{{\color[rgb]{0,0,1}3.8}}96.3 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 42.54.5subscript42.54.5{42.5}_{{\color[rgb]{0,0,1}4.5}}42.5 start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT 92.0¯4.1subscript¯92.04.1{\underline{92.0}}_{{\color[rgb]{0,0,1}4.1}}under¯ start_ARG 92.0 end_ARG start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 90.52.9subscript90.52.9{90.5}_{{\color[rgb]{0,0,1}2.9}}90.5 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 82.94.5subscript82.94.5{82.9}_{{\color[rgb]{0,0,1}4.5}}82.9 start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT 73.7¯3.9subscript¯73.73.9{\underline{73.7}}_{{\color[rgb]{0,0,1}3.9}}under¯ start_ARG 73.7 end_ARG start_POSTSUBSCRIPT 3.9 end_POSTSUBSCRIPT 87.47.2subscript87.47.287.4_{{\color[rgb]{0,0,1}7.2}}87.4 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 82.53.3subscript82.53.382.5_{{\color[rgb]{0,0,1}3.3}}82.5 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 52.2¯4.1subscript¯52.24.1{\underline{52.2}}_{{\color[rgb]{0,0,1}4.1}}under¯ start_ARG 52.2 end_ARG start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 77.8
IDAICL 96.72.5subscript96.72.5{\boldsymbol{96.7}}_{{\color[rgb]{0,0,1}2.5}}bold_96.7 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 47.11.1subscript47.11.1{\boldsymbol{47.1}}_{{\color[rgb]{0,0,1}1.1}}bold_47.1 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 93.01.9subscript93.01.9{\boldsymbol{93.0}}_{{\color[rgb]{0,0,1}1.9}}bold_93.0 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 93.30.8subscript93.30.8{\boldsymbol{93.3}}_{{\color[rgb]{0,0,1}0.8}}bold_93.3 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 87.82.3subscript87.82.3{\boldsymbol{87.8}}_{{\color[rgb]{0,0,1}2.3}}bold_87.8 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 76.02.6subscript76.02.6{\boldsymbol{76.0}}_{{\color[rgb]{0,0,1}2.6}}bold_76.0 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 94.91.0subscript94.91.0{\boldsymbol{94.9}}_{{\color[rgb]{0,0,1}1.0}}bold_94.9 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 87.72.4subscript87.72.4{\boldsymbol{87.7}}_{{\color[rgb]{0,0,1}2.4}}bold_87.7 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 59.41.9subscript59.41.9{\boldsymbol{59.4}}_{{\color[rgb]{0,0,1}1.9}}bold_59.4 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 81.881.8\boldsymbol{81.8}bold_81.8
LLaMA 33B Vanilla ICL 95.57.2subscript95.57.2{95.5}_{{\color[rgb]{0,0,1}7.2}}95.5 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 29.45.6subscript29.45.6{29.4}_{{\color[rgb]{0,0,1}5.6}}29.4 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 91.75.4subscript91.75.4{91.7}_{{\color[rgb]{0,0,1}5.4}}91.7 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 91.5¯8.1subscript¯91.58.1{\underline{91.5}}_{{\color[rgb]{0,0,1}8.1}}under¯ start_ARG 91.5 end_ARG start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 85.16.0subscript85.16.0{85.1}_{{\color[rgb]{0,0,1}6.0}}85.1 start_POSTSUBSCRIPT 6.0 end_POSTSUBSCRIPT 70.94.4subscript70.94.4{70.9}_{{\color[rgb]{0,0,1}4.4}}70.9 start_POSTSUBSCRIPT 4.4 end_POSTSUBSCRIPT 86.64.5subscript86.64.5{86.6}_{{\color[rgb]{0,0,1}4.5}}86.6 start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT 76.26.1subscript76.26.176.2_{{\color[rgb]{0,0,1}6.1}}76.2 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 59.25.3subscript59.25.3{{59.2}}_{{\color[rgb]{0,0,1}5.3}}59.2 start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 76.2
ConCa 95.9¯6.5subscript¯95.96.5{\underline{95.9}}_{{\color[rgb]{0,0,1}6.5}}under¯ start_ARG 95.9 end_ARG start_POSTSUBSCRIPT 6.5 end_POSTSUBSCRIPT 39.14.4subscript39.14.4{39.1}_{{\color[rgb]{0,0,1}4.4}}39.1 start_POSTSUBSCRIPT 4.4 end_POSTSUBSCRIPT 90.37.2subscript90.37.290.3_{{\color[rgb]{0,0,1}7.2}}90.3 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 91.23.6subscript91.23.6{91.2}_{{\color[rgb]{0,0,1}3.6}}91.2 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 74.65.7subscript74.65.7{74.6}_{{\color[rgb]{0,0,1}5.7}}74.6 start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT 76.76.2subscript76.76.2{76.7}_{{\color[rgb]{0,0,1}6.2}}76.7 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 92.4¯3.9subscript¯92.43.9{\underline{92.4}}_{{\color[rgb]{0,0,1}3.9}}under¯ start_ARG 92.4 end_ARG start_POSTSUBSCRIPT 3.9 end_POSTSUBSCRIPT 87.35.7subscript87.35.787.3_{{\color[rgb]{0,0,1}5.7}}87.3 start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT 57.96.0subscript57.96.0{{57.9}}_{{\color[rgb]{0,0,1}6.0}}57.9 start_POSTSUBSCRIPT 6.0 end_POSTSUBSCRIPT 78.4
PROCA 95.54.2subscript95.54.2{95.5}_{{\color[rgb]{0,0,1}4.2}}95.5 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 39.26.3subscript39.26.3{{39.2}}_{{\color[rgb]{0,0,1}6.3}}39.2 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 92.4¯4.1subscript¯92.44.1{\underline{92.4}}_{{\color[rgb]{0,0,1}4.1}}under¯ start_ARG 92.4 end_ARG start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 91.33.5subscript91.33.5{91.3}_{{\color[rgb]{0,0,1}3.5}}91.3 start_POSTSUBSCRIPT 3.5 end_POSTSUBSCRIPT 88.3¯2.2subscript¯88.32.2{\underline{88.3}}_{{\color[rgb]{0,0,1}2.2}}under¯ start_ARG 88.3 end_ARG start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 64.73.8subscript64.73.8{64.7}_{{\color[rgb]{0,0,1}3.8}}64.7 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 86.95.1subscript86.95.186.9_{{\color[rgb]{0,0,1}5.1}}86.9 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 85.87.1subscript85.87.1{85.8}_{{\color[rgb]{0,0,1}7.1}}85.8 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 59.9¯3.8subscript¯59.93.8{\underline{59.9}}_{{\color[rgb]{0,0,1}3.8}}under¯ start_ARG 59.9 end_ARG start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 78.2
D-ConCa 95.43.8subscript95.43.8{95.4}_{{\color[rgb]{0,0,1}3.8}}95.4 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 40.7¯4.5subscript¯40.74.5{\underline{40.7}}_{{\color[rgb]{0,0,1}4.5}}under¯ start_ARG 40.7 end_ARG start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT 92.14.2subscript92.14.2{92.1}_{{\color[rgb]{0,0,1}4.2}}92.1 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 91.02.9subscript91.02.9{91.0}_{{\color[rgb]{0,0,1}2.9}}91.0 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 76.43.6subscript76.43.6{76.4}_{{\color[rgb]{0,0,1}3.6}}76.4 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 80.22.1subscript80.22.1{\boldsymbol{80.2}}_{{\color[rgb]{0,0,1}2.1}}bold_80.2 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 87.64.2subscript87.64.287.6_{{\color[rgb]{0,0,1}4.2}}87.6 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 87.7¯4.3subscript¯87.74.3{\underline{87.7}}_{{\color[rgb]{0,0,1}4.3}}under¯ start_ARG 87.7 end_ARG start_POSTSUBSCRIPT 4.3 end_POSTSUBSCRIPT 56.53.4subscript56.53.456.5_{{\color[rgb]{0,0,1}3.4}}56.5 start_POSTSUBSCRIPT 3.4 end_POSTSUBSCRIPT 78.6
IDAICL 96.51.1subscript96.51.1{\boldsymbol{96.5}}_{{\color[rgb]{0,0,1}1.1}}bold_96.5 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 46.82.4subscript46.82.4{\boldsymbol{46.8}}_{{\color[rgb]{0,0,1}2.4}}bold_46.8 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 93.61.3subscript93.61.3{\boldsymbol{93.6}}_{{\color[rgb]{0,0,1}1.3}}bold_93.6 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 92.33.3subscript92.33.3{\boldsymbol{92.3}}_{{\color[rgb]{0,0,1}3.3}}bold_92.3 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 89.32.4subscript89.32.4{\boldsymbol{89.3}}_{{\color[rgb]{0,0,1}2.4}}bold_89.3 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 79.1¯1.5subscript¯79.11.5{\underline{79.1}}_{{\color[rgb]{0,0,1}1.5}}under¯ start_ARG 79.1 end_ARG start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 95.62.3subscript95.62.3{\boldsymbol{95.6}}_{{\color[rgb]{0,0,1}2.3}}bold_95.6 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 88.41.9subscript88.41.9{\boldsymbol{88.4}}_{{\color[rgb]{0,0,1}1.9}}bold_88.4 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 64.62.8subscript64.62.8{\boldsymbol{64.6}}_{{\color[rgb]{0,0,1}2.8}}bold_64.6 start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 82.982.9\boldsymbol{82.9}bold_82.9
Table 2: Comparison results of Macro-F1 for the LLaMA model with 13B and 33B parameters, setting m𝑚mitalic_m to 4.
Refer to caption
Figure 3: Comparison results between Vanilla ICL and IDAICL across different values of m𝑚mitalic_m on the GPT-Neo model. IDAICL significantly outperforms Vanilla ICL, particularly when the number of demonstration examples is small.

4.2 Compared Baselines

Besides Vanilla ICL, we compared and integrated IDAICL with three popular ICL algorithms, focusing on learning process design and demonstration retrieval. These include MetaICL Min et al. (2022b), Channel ICL Min et al. (2022a), and Efficient Prompt Retrieval (EPR) Rubin et al. (2022). Moreover, we compared IDAICL with other advanced prediction calibration methods: Contextual Calibration (ConCa) Zhao et al. (2021), Prototypical Calibration (PROCAHan et al. (2023), and Domain-Context Calibration (D-ConCa) Fei et al. (2023). Introductions to all compared methods and comprehensive experimental settings are presented in Sections B and C of the Appendix.

Refer to caption
Figure 4: (a) and (b): Macro-F1 of SST-5 and AGNews datasets using the LLaMA model with 33B parameters under three demonstration selection settings, setting m𝑚mitalic_m to 4. (c) and (d): Accuracy of Vanilla ICL and IDAICL on the SST-2 dataset using the GPT-2 model with 1.5B parameters across six templates, setting m𝑚mitalic_m to 12. IDAICL demonstrates greater robustness across various demonstration examples and templates compared to Vanilla ICL.

5 Experimental Results

5.1 Main Results

Table 1 displays the comparison results between IDAICL and four ICL baselines (Vanilla ICL, MetaICL, Channel ICL, and EPR) across GPT-2 models (with 0.8B and 1.5B parameters) and the GPT-Neo model. These results lead to three main findings. Firstly, IDAICL consistently exhibits high effectiveness across various model sizes and datasets, highlighting its strong generalization capacity, even under scenarios involving imbalanced training data. Compared to Vanilla ICL, IDAICL outperforms by an average of 17.7% and 18.4% across diverse datasets and m𝑚mitalic_m values for GPT-2 with 0.8B and 1.5B parameters, respectively. Secondly, in comparison to other ICL baselines like Channel ICL, MetaICL, and EPR, the integration of IDAICL consistently delivers notable performance improvements, emphasizing the efficacy of enhancing demonstrations for refined predictions. The inclusion of IDAICL led to an average performance boost of 7.3% for MetaICL and 8.2% for Channel ICL. Lastly, IDAICL notably enhances worst-case accuracy and diminishes performance variance across different seeds, showcasing its ability to improve prediction stability. Additional results on LLaMA and smaller GPT-2 models are available in Tables 7 and 8 of the Appendix.

5.2 Comparison with Calibration Methods

We compared IDAICL with three advanced prediction calibration methods (ConCa, PROCA, and D-ConCa) across three PLMs: GPT-2, GPT-Neo, and LLaMA. Table 2 presents the comparison results for the LLaMA models, where IDAICL consistently achieves state-of-the-art performance, except for TREC using the LLaMA model with 33B parameters. These findings suggest that IDAICL which leverages statistical information derived from the input data distribution for prediction calibration, generally outperforms methods relying on estimated biases for correction. Further comparison results can be found in Table 9 of the Appendix.

5.3 Stability Analysis

Previous studies Zhao et al. (2021); Sorensen et al. (2022); Min et al. (2022a); Zhang et al. (2022b) have highlighted the considerable variability in ICL’s performance. In this section, we verified that IDAICL can effectively enhance performance stability across diverse scenarios.

Varying numbers of demonstrations

We have presented the results across different numbers of demonstrations in Table 1. For a clearer depiction, the outcomes regarding GPT-Neo are illustrated in Figure 3. As the number of demonstration examples (represented by m𝑚mitalic_m) increases, both Vanilla ICL and IDAICL exhibit improved performance, emphasizing the importance of comprehensive statistical properties of the input data for IDAICL’s effectiveness. Notably, IDAICL significantly enhances performance stability across various numbers of demonstrations and consistently outperforms Vanilla ICL. The performance improvement is particularly pronounced when m𝑚mitalic_m takes on smaller values, indicating the efficacy of IDAICL in enriching the available knowledge for PLMs.

Varying demonstrations

To confirm that augmenting demonstrations can enhance the robustness of the ICL strategy across various demonstrations, we investigated three distinct demonstration selection settings. Setting I: Training samples most similar to the test sample are chosen. Setting II: Samples are randomly selected from the training data. Setting III: Training samples exhibiting the greatest dissimilarity from the test sample are selected. As shown in Figures 4(a) and (b), IDAICL significantly outperforms Vanilla ICL and demonstrates greater robustness across the three selection settings. Additionally, our discoveries suggest that selecting demonstrations that are more similar to the test samples leads to better performance than exclusively selecting dissimilar ones, which aligns with the findings obtained by Wang et al. Wang et al. (2022).

Refer to caption
Figure 5: (a) and (b): Accuracy comparison of the SST-2 and MR datasets, where the proportions of the negative class in demonstrations (denoted as p𝑝pitalic_p) are varied from 0.1 to 0.5. (c) and (d): Confusion matrices for the CR and Subj datasets, representing scenarios where the proportions of one category in demonstrations are set to 0.1 and 0.2. The analysis is conducted using the GPT-2 model with 1.5B parameters, with m𝑚mitalic_m setting to 12. IDAICL demonstrates greater robustness in handling imbalanced class distributions within demonstrations.
Refer to caption
Figure 6: Accuracy across different λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ values, using GPT-2 with 0.8B parameters, setting m𝑚mitalic_m to 12. λ=0𝜆0\lambda\!=\!0italic_λ = 0 and τ=0𝜏0\tau\!=\!0italic_τ = 0 signify that the two modulating factors and the class proportion term are not utilized, respectively.

Varying templates

To assess the performance of IDAICL across various templates, we employed fifteen templates on the SST-2 dataset following those outlined by Zhao et al. Zhao et al. (2021). The templates are elaborated in Table 10 of the Appendix. Figures 4(c) and (d) display the performance of Vanilla ICL and IDAICL across six templates. Some templates achieve higher average performance than others. Nevertheless, IDAICL consistently enhances both average and worst-case accuracy, simultaneously reducing performance variance across different templates. The complete results are available in Figure 7 of the Appendix.

Impact of imbalance in labels

Figures 5(a) and (b) depict comparison results among Vanilla ICL, MetaICL, Channel ICL, and IDAICL across different degrees of imbalances. It is evident that the performance of Vanilla ICL is sensitive to class imbalance, while that of IDAICL and Channel ICL exhibit robustness to the imbalance. Moreover, notable performance improvements are observed with higher levels of imbalance. Additionally, Figures 5(c) and (d) illustrate the confusion matrices for CR and Subj datasets, with the proportion of one category (i.e., "Negative" and "Subjective") in demonstrations setting to 0.1 and 0.2. IDAICL significantly improves the accuracy of the underrepresented classes when compared to Vanilla ICL, thereby contributing to enhanced fairness among classes. In the subsequent section, we demonstrate that the strong performance of IDAICL in handling imbalanced label distributions stems from both the statistical properties and the class proportion term.

Dataset 0-shot 1-shot 4-shot IDAICL
SST-2 63.263.263.263.2 61.39.4subscript61.39.461.3_{{\color[rgb]{0,0,1}9.4}}61.3 start_POSTSUBSCRIPT 9.4 end_POSTSUBSCRIPT 57.67.1subscript57.67.157.6_{{\color[rgb]{0,0,1}7.1}}57.6 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 76.376.3\boldsymbol{76.3}bold_76.3
SST-5 25.025.025.025.0 27.37.9subscript27.37.927.3_{{\color[rgb]{0,0,1}7.9}}27.3 start_POSTSUBSCRIPT 7.9 end_POSTSUBSCRIPT 30.46.3subscript30.46.330.4_{{\color[rgb]{0,0,1}6.3}}30.4 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 33.533.5\boldsymbol{33.5}bold_33.5
MR 58.958.958.958.9 54.36.8subscript54.36.854.3_{{\color[rgb]{0,0,1}6.8}}54.3 start_POSTSUBSCRIPT 6.8 end_POSTSUBSCRIPT 59.36.5subscript59.36.559.3_{{\color[rgb]{0,0,1}6.5}}59.3 start_POSTSUBSCRIPT 6.5 end_POSTSUBSCRIPT 71.271.2\boldsymbol{71.2}bold_71.2
Subj 48.948.948.948.9 47.18.3subscript47.18.347.1_{{\color[rgb]{0,0,1}8.3}}47.1 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 57.65.4subscript57.65.457.6_{{\color[rgb]{0,0,1}5.4}}57.6 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 67.367.3\boldsymbol{67.3}bold_67.3
Table 3: Accuracy comparison between Vanilla ICL and IDAICL based solely on statistical properties, using the GPT-2 model with 0.8B parameters.

5.4 Sensitivity and Ablation Studies

We conducted ablation studies on IDAICL to investigate the influence of the two modulating factors and the class proportion term. The parameters λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ govern the augmentation strength and the impact of the class proportion term, respectively. In Figure 6(a), a significant performance drop is observed when predictions are not calibrated using statistical properties derived from the demonstrations. Additionally, optimal performance is achieved when λ𝜆\lambdaitalic_λ equals 0.5.

Figure 6(b) showcases the accuracy of SST-2 and MR datasets with the negative class proportion in demonstrations setting to 0.1. Results indicate that solely leveraging statistical properties (i.e., τ𝜏\tauitalic_τ equals 0) enhances performance under imbalanced demonstrations, with further improvements observed upon the inclusion of the class proportion term. Additionally, optimal performance is attained when τ𝜏\tauitalic_τ equals 1. Consequently, we recommend setting λ𝜆\lambdaitalic_λ to 0.5 and τ𝜏\tauitalic_τ to 1 for practical applications. More results are presented in Appendix F.

5.5 Further Discussion

To further investigate the effect of statistical properties within demonstrations on model performance, we exclusively employed queries along with statistical information for inference, excluding the inclusion of demonstrations for each test sample. These statistics were estimated using deep features of all training samples. As shown in Table 3, IDAICL relying solely on statistical properties distinctly outperforms Vanilla ICL across scenarios with zero, one, and even four demonstrations. This emphasizes the crucial role of prior statistics obtained from training data in PLMs’ predictions. This phenomenon is understandable as statistical properties inherently encompass richer global information compared to individual demonstrations.

6 Conclusion

This study introduces IDAICL, a novel ICL approach designed to enhance demonstrations by utilizing semantic directions sampled from the deep feature distribution of demonstration examples. Our augmentation strategy enriches the knowledge available to PLMs without extending the context length. A new prediction function is then theoretically established considering the number of augmented pieces approaching infinity. This eliminates the need for explicit augmentation and allows for direct utilization of this derived function for predictions. Our extensive experiments, spanning various tasks and PLMs, demonstrate that IDAICL significantly enhances both prediction accuracy and stability when compared to other ICL baselines.

Limitations

While IDAICL proves to be competitive in few-shot learning, there are limitations that open up avenues for future research. First, due to the necessity of accessing the parameters of the final fully connected layer in PLMs, IDAICL is exclusively suitable for open-source models. Future research is expected to develop alternative augmentation strategies tailored for black-box PLMs. Second, our evaluation of IDAICL focused on seven PLMs and ten text classification tasks. We defer further explorations involving other PLMs and non-classification tasks for future work. Additionally, IDAICL relies on a small set of demonstrations to estimate the feature mean and covariance matrix. If such a collection is unavailable or extremely scarce, IDAICL may need to be used in conjunction with demonstration generation methods.

Other avenues for future work involve exploring more effective augmentation distributions. This entails exploring finer-grained distributions, such as category-level or sample-level distributions, to emphasize the unique characteristics of individual categories or samples, and extending these distributions beyond the constraints of training data. Furthermore, given the effectiveness of data augmentation in model training, future research could explore the utilization of our derived prediction function in both the training and fine-tuning phases of large PLMs.

References

Dataset Task Avg. length Classes Balanced
SST-2 Socher et al. (2013) Sentiment analysis 12.4 2 Yes
SST-5 Socher et al. (2013) Sentiment analysis 23.1 5 No
MR Pang and Lee (2005) Sentiment analysis 25.7 2 Yes
CR Hu and Liu (2004) Sentiment analysis 22.1 2 Yes
Amazon McAuley and Leskovec (2013) Sentiment analysis 78.5 5 No
Subj Pang and Lee (2004) Subjectivity classification 28.9 2 Yes
TREC Voorhees and Tice (2000) Question classification 11.6 6 No
DBPedia Lehmann et al. (2015) Ontology classification 65.5 14 Yes
AGNews Zhang et al. (2015) News classification 53.8 4 Yes
CB De Marneffe et al. (2019) Natural language inference 69.7/8.4 3 No
Table 4: Statistical information of ten datasets. The average length is calculated based on the GPT-2 sentence-piece length. For tasks involving sentence pairs, we provide the average length for each individual sentence.
Dataset Instances Label names
SST-2 1. This movie is amazing! (Label = "Positive") 2. Horrific movie, don’t see it. (Label = "Negative") Positive, Negative
SST-5 1. A pretensions – and disposable story — sink the movie. (Label = "Great") 2. Apparently reassembled from the cutting-room floor of any given daytime soap. (Label = "Terrible") Terrible, Bad, Okay, Good, Great
MR 1. Lame sweet home leaves no southern stereotype unturned. (Label = "Negative") 2. Not so much farcical as sour. (Label = "Negative") Negative, Positive
CR 1. It takes excellent pics and is very easy to use, if you read the manual. (Label = "Negative") 2. Bluetooth does not work on this phone. (Label = "Negative") Negative, Positive
Amazon 1. Don’t waste your money if you already have 2003… There isn’t one reason to get this update if you already have MS Money 2003 Deluxe and Business. (Label ="Terrible") 2. The game was in perfect condition! came before it said it should have by 2 days!! I love the game and I suggest it to a lot of my friends! (Label ="Great") Terrible, Bad, Okay, Good, Great
Subj 1. This is a story about the warm relationship between a little girl and her father despite the difficult conditions they have to live in. (Label = "Objective") 2. Too slow, too boring, and occasionally annoying. (Label = "Subjective") Subjective, Objective
TREC 1. When did the neanderthal man live? (Label = "Number") 2. How do you get a broken cork out of a bottle? (Label = "Description") Description, Entity, Expression, Human, Location, Number
DBPedia 1. CMC Aviation is a charter airline based in Nairobi Kenya. (Label = "Company") 2. Dialectica aemula is a moth of the Gracillariidae family. (Label = "Animal") Company, School, Artist, Athlete, Politics, Transportation, Building, Nature, Village, Animal, Plant, Album, Film, Book
AGNews 1. Walk in park for Yankees Drained by a difficult week, the New York Yankees needed an uplifting victory. (Label = "Sports") 2. NASA Mountain View claims world’s fastest computer. (Label = "Technology") World, Sports, Business, Technology
CB 1. It was a complex language. Not written down but handed down. One might say it was peeled down. The language was peeled down. (Label = "True") 2. “Do you mind if I use your phone?” Ronni could see that Guido’s brain was whirring. Guido’s brain was whirring. (Label = "True") True, False, Neither
Table 5: Examples and label names from all datasets.
Dataset Template Label mapping
SST-2 Review: {Sentence} Sentiment: {Label} Positive / Negative
SST-5 Review: {Sentence} Sentiment: {Label} terrible / bad / okay / good / great
MR Review: {Sentence} Sentiment: {Label} Positive / Negative
CR Review: {Sentence} Sentiment: {Label} Positive / Negative
Subj Input: {Sentence} Type: {Label} objective / subjective
TREC Question: {Sentence} Type: {Label} description / entity / expression / human / location / number
Amazon Review: {Sentence} Sentiment: {Label} terrible / bad / okay / good / great
AGNews Input: {Sentence} Type: {Label} world / sports / business / technology
DBPedia Input: {Sentence} Type: {Label} company / school / artist / athlete / politics / transportation building / nature / village / animal / plant / album / film / book
CB Premise: {Sentence} Hypothesis: {Sentence} Prediction: {Label} true / false / neither
Table 6: Prompt templates and label mappings for each dataset.

Appendix A Details of Applied Datasets

Table 4 presents comprehensive statistics for all datasets utilized in this study. The information includes task descriptions, average sentence lengths, class counts, and details on class imbalance. Additionally, Table 5 provides sample instances and label names for each of the datasets.

PLM Method m SST-2 SST-5 MR CR Subj TREC DBPedia AGNews CB Avg.
13B Vanilla ICL 4 95.67.1subscript95.67.1{95.6}_{{\color[rgb]{0,0,1}7.1}}95.6 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 29.56.2subscript29.56.2{29.5}_{{\color[rgb]{0,0,1}6.2}}29.5 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 90.05.8subscript90.05.8{90.0}_{{\color[rgb]{0,0,1}5.8}}90.0 start_POSTSUBSCRIPT 5.8 end_POSTSUBSCRIPT 91.47.4subscript91.47.4{91.4}_{{\color[rgb]{0,0,1}7.4}}91.4 start_POSTSUBSCRIPT 7.4 end_POSTSUBSCRIPT 72.96.9subscript72.96.9{72.9}_{{\color[rgb]{0,0,1}6.9}}72.9 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 62.89.1subscript62.89.1{62.8}_{{\color[rgb]{0,0,1}9.1}}62.8 start_POSTSUBSCRIPT 9.1 end_POSTSUBSCRIPT 80.97.6subscript80.97.6{80.9}_{{\color[rgb]{0,0,1}7.6}}80.9 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT 80.25.9subscript80.25.980.2_{{\color[rgb]{0,0,1}5.9}}80.2 start_POSTSUBSCRIPT 5.9 end_POSTSUBSCRIPT 51.58.2subscript51.58.251.5_{{\color[rgb]{0,0,1}8.2}}51.5 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 72.8
IDAICL 96.72.5subscript96.72.5{{96.7}}_{{\color[rgb]{0,0,1}2.5}}96.7 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 47.11.1subscript47.11.1{{47.1}}_{{\color[rgb]{0,0,1}1.1}}47.1 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 93.01.9subscript93.01.9{{93.0}}_{{\color[rgb]{0,0,1}1.9}}93.0 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 93.30.8subscript93.30.8{{93.3}}_{{\color[rgb]{0,0,1}0.8}}93.3 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 87.82.3subscript87.82.3{{87.8}}_{{\color[rgb]{0,0,1}2.3}}87.8 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 76.02.6subscript76.02.6{{76.0}}_{{\color[rgb]{0,0,1}2.6}}76.0 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 94.91.0subscript94.91.0{{94.9}}_{{\color[rgb]{0,0,1}1.0}}94.9 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 87.72.4subscript87.72.4{{87.7}}_{{\color[rgb]{0,0,1}2.4}}87.7 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 59.41.9subscript59.41.9{{59.4}}_{{\color[rgb]{0,0,1}1.9}}59.4 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 81.881.8{81.8}81.8
Vanilla ICL 8 96.77.1subscript96.77.1{96.7}_{{\color[rgb]{0,0,1}7.1}}96.7 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 39.45.6subscript39.45.6{39.4}_{{\color[rgb]{0,0,1}5.6}}39.4 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 92.36.2subscript92.36.2{92.3}_{{\color[rgb]{0,0,1}6.2}}92.3 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 92.24.8subscript92.24.8{92.2}_{{\color[rgb]{0,0,1}4.8}}92.2 start_POSTSUBSCRIPT 4.8 end_POSTSUBSCRIPT 70.85.1subscript70.85.1{70.8}_{{\color[rgb]{0,0,1}5.1}}70.8 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 71.29.1subscript71.29.1{71.2}_{{\color[rgb]{0,0,1}9.1}}71.2 start_POSTSUBSCRIPT 9.1 end_POSTSUBSCRIPT 83.74.2subscript83.74.2{83.7}_{{\color[rgb]{0,0,1}4.2}}83.7 start_POSTSUBSCRIPT 4.2 end_POSTSUBSCRIPT 79.56.3subscript79.56.3{79.5}_{{\color[rgb]{0,0,1}6.3}}79.5 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 52.43.7subscript52.43.7{52.4}_{{\color[rgb]{0,0,1}3.7}}52.4 start_POSTSUBSCRIPT 3.7 end_POSTSUBSCRIPT 75.4
IDAICL 96.92.1subscript96.92.196.9_{{\color[rgb]{0,0,1}2.1}}96.9 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 49.21.9subscript49.21.949.2_{{\color[rgb]{0,0,1}1.9}}49.2 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 93.41.6subscript93.41.693.4_{{\color[rgb]{0,0,1}1.6}}93.4 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 92.91.9subscript92.91.9{92.9}_{{\color[rgb]{0,0,1}1.9}}92.9 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 87.53.0subscript87.53.087.5_{{\color[rgb]{0,0,1}3.0}}87.5 start_POSTSUBSCRIPT 3.0 end_POSTSUBSCRIPT 79.92.1subscript79.92.1{79.9}_{{\color[rgb]{0,0,1}2.1}}79.9 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 93.60.9subscript93.60.993.6_{{\color[rgb]{0,0,1}0.9}}93.6 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 88.01.7subscript88.01.788.0_{{\color[rgb]{0,0,1}1.7}}88.0 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 62.42.5subscript62.42.562.4_{{\color[rgb]{0,0,1}2.5}}62.4 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 82.6
33B Vanilla ICL 4 95.57.2subscript95.57.2{95.5}_{{\color[rgb]{0,0,1}7.2}}95.5 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 29.45.6subscript29.45.6{29.4}_{{\color[rgb]{0,0,1}5.6}}29.4 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 91.75.4subscript91.75.4{91.7}_{{\color[rgb]{0,0,1}5.4}}91.7 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 91.58.1subscript91.58.1{{91.5}}_{{\color[rgb]{0,0,1}8.1}}91.5 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 85.16.0subscript85.16.0{85.1}_{{\color[rgb]{0,0,1}6.0}}85.1 start_POSTSUBSCRIPT 6.0 end_POSTSUBSCRIPT 70.94.4subscript70.94.4{70.9}_{{\color[rgb]{0,0,1}4.4}}70.9 start_POSTSUBSCRIPT 4.4 end_POSTSUBSCRIPT 86.64.5subscript86.64.5{86.6}_{{\color[rgb]{0,0,1}4.5}}86.6 start_POSTSUBSCRIPT 4.5 end_POSTSUBSCRIPT 76.26.1subscript76.26.176.2_{{\color[rgb]{0,0,1}6.1}}76.2 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 59.25.3subscript59.25.3{{59.2}}_{{\color[rgb]{0,0,1}5.3}}59.2 start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 76.2
IDAICL 96.51.1subscript96.51.1{{96.5}}_{{\color[rgb]{0,0,1}1.1}}96.5 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 46.82.4subscript46.82.4{{46.8}}_{{\color[rgb]{0,0,1}2.4}}46.8 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 93.61.3subscript93.61.3{{93.6}}_{{\color[rgb]{0,0,1}1.3}}93.6 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 92.33.3subscript92.33.3{{92.3}}_{{\color[rgb]{0,0,1}3.3}}92.3 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 89.32.4subscript89.32.4{{89.3}}_{{\color[rgb]{0,0,1}2.4}}89.3 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 79.11.5subscript79.11.5{{79.1}}_{{\color[rgb]{0,0,1}1.5}}79.1 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 95.62.3subscript95.62.3{{95.6}}_{{\color[rgb]{0,0,1}2.3}}95.6 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 88.41.9subscript88.41.9{{88.4}}_{{\color[rgb]{0,0,1}1.9}}88.4 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 64.62.8subscript64.62.8{{64.6}}_{{\color[rgb]{0,0,1}2.8}}64.6 start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 82.982.9{82.9}82.9
Vanilla ICL 8 96.87.3subscript96.87.3{96.8}_{{\color[rgb]{0,0,1}7.3}}96.8 start_POSTSUBSCRIPT 7.3 end_POSTSUBSCRIPT 34.35.4subscript34.35.4{34.3}_{{\color[rgb]{0,0,1}5.4}}34.3 start_POSTSUBSCRIPT 5.4 end_POSTSUBSCRIPT 93.45.8subscript93.45.8{93.4}_{{\color[rgb]{0,0,1}5.8}}93.4 start_POSTSUBSCRIPT 5.8 end_POSTSUBSCRIPT 92.76.4subscript92.76.4{92.7}_{{\color[rgb]{0,0,1}6.4}}92.7 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 83.55.5subscript83.55.5{83.5}_{{\color[rgb]{0,0,1}5.5}}83.5 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 66.94.8subscript66.94.8{66.9}_{{\color[rgb]{0,0,1}4.8}}66.9 start_POSTSUBSCRIPT 4.8 end_POSTSUBSCRIPT 84.16.2subscript84.16.2{84.1}_{{\color[rgb]{0,0,1}6.2}}84.1 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 84.75.5subscript84.75.5{84.7}_{{\color[rgb]{0,0,1}5.5}}84.7 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 62.05.2subscript62.05.2{62.0}_{{\color[rgb]{0,0,1}5.2}}62.0 start_POSTSUBSCRIPT 5.2 end_POSTSUBSCRIPT 77.6
IDAICL 96.92.3subscript96.92.396.9_{{\color[rgb]{0,0,1}2.3}}96.9 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 50.31.5subscript50.31.550.3_{{\color[rgb]{0,0,1}1.5}}50.3 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 93.92.2subscript93.92.293.9_{{\color[rgb]{0,0,1}2.2}}93.9 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 93.01.4subscript93.01.4{93.0}_{{\color[rgb]{0,0,1}1.4}}93.0 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT 89.01.0subscript89.01.089.0_{{\color[rgb]{0,0,1}1.0}}89.0 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 83.11.7subscript83.11.7{83.1}_{{\color[rgb]{0,0,1}1.7}}83.1 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 95.92.0subscript95.92.095.9_{{\color[rgb]{0,0,1}2.0}}95.9 start_POSTSUBSCRIPT 2.0 end_POSTSUBSCRIPT 88.01.2subscript88.01.288.0_{{\color[rgb]{0,0,1}1.2}}88.0 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 70.41.8subscript70.41.870.4_{{\color[rgb]{0,0,1}1.8}}70.4 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 84.5
Table 7: Comparison results of Macro-F1 between Vanilla ICL and IDAICL under varying values of m𝑚mitalic_m on the LLaMA models with 13B and 33B parameters.
PLM Method m SST-2 SST-5 MR CR Amazon Subj TREC DBPedia AGNews CB
GPT-2 0.1B Vanilla ICL 4 56.37.1subscript56.37.1{56.3}_{{\color[rgb]{0,0,1}7.1}}56.3 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 28.48.8subscript28.48.8{28.4}_{{\color[rgb]{0,0,1}8.8}}28.4 start_POSTSUBSCRIPT 8.8 end_POSTSUBSCRIPT 55.47.4subscript55.47.4{55.4}_{{\color[rgb]{0,0,1}7.4}}55.4 start_POSTSUBSCRIPT 7.4 end_POSTSUBSCRIPT 54.26.2subscript54.26.2{54.2}_{{\color[rgb]{0,0,1}6.2}}54.2 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 30.88.4subscript30.88.430.8_{{\color[rgb]{0,0,1}8.4}}30.8 start_POSTSUBSCRIPT 8.4 end_POSTSUBSCRIPT 52.97.9subscript52.97.9{52.9}_{{\color[rgb]{0,0,1}7.9}}52.9 start_POSTSUBSCRIPT 7.9 end_POSTSUBSCRIPT 32.25.1subscript32.25.1{32.2}_{{\color[rgb]{0,0,1}5.1}}32.2 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 44.36.2subscript44.36.2{44.3}_{{\color[rgb]{0,0,1}6.2}}44.3 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 42.89.3subscript42.89.3{42.8}_{{\color[rgb]{0,0,1}9.3}}42.8 start_POSTSUBSCRIPT 9.3 end_POSTSUBSCRIPT 42.19.6subscript42.19.6{42.1}_{{\color[rgb]{0,0,1}9.6}}42.1 start_POSTSUBSCRIPT 9.6 end_POSTSUBSCRIPT
IDAICL 69.52.6subscript69.52.669.5_{{\color[rgb]{0,0,1}2.6}}69.5 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 35.31.1subscript35.31.135.3_{{\color[rgb]{0,0,1}1.1}}35.3 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 66.42.3subscript66.42.366.4_{{\color[rgb]{0,0,1}2.3}}66.4 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 67.22.7subscript67.22.7{67.2}_{{\color[rgb]{0,0,1}2.7}}67.2 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 39.32.9subscript39.32.939.3_{{\color[rgb]{0,0,1}2.9}}39.3 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 57.22.6subscript57.22.657.2_{{\color[rgb]{0,0,1}2.6}}57.2 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 44.31.8subscript44.31.8{44.3}_{{\color[rgb]{0,0,1}1.8}}44.3 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 62.22.3subscript62.22.362.2_{{\color[rgb]{0,0,1}2.3}}62.2 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 65.52.7subscript65.52.765.5_{{\color[rgb]{0,0,1}2.7}}65.5 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 49.21.9subscript49.21.949.2_{{\color[rgb]{0,0,1}1.9}}49.2 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT
Vanilla ICL 8 60.88.3subscript60.88.3{60.8}_{{\color[rgb]{0,0,1}8.3}}60.8 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 30.66.9subscript30.66.9{30.6}_{{\color[rgb]{0,0,1}6.9}}30.6 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 57.59.7subscript57.59.7{57.5}_{{\color[rgb]{0,0,1}9.7}}57.5 start_POSTSUBSCRIPT 9.7 end_POSTSUBSCRIPT 56.05.1subscript56.05.1{56.0}_{{\color[rgb]{0,0,1}5.1}}56.0 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 33.67.8subscript33.67.833.6_{{\color[rgb]{0,0,1}7.8}}33.6 start_POSTSUBSCRIPT 7.8 end_POSTSUBSCRIPT 53.75.6subscript53.75.6{53.7}_{{\color[rgb]{0,0,1}5.6}}53.7 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 33.010.7subscript33.010.7{33.0}_{{\color[rgb]{0,0,1}10.7}}33.0 start_POSTSUBSCRIPT 10.7 end_POSTSUBSCRIPT 52.15.8subscript52.15.8{52.1}_{{\color[rgb]{0,0,1}5.8}}52.1 start_POSTSUBSCRIPT 5.8 end_POSTSUBSCRIPT 45.69.1subscript45.69.1{45.6}_{{\color[rgb]{0,0,1}9.1}}45.6 start_POSTSUBSCRIPT 9.1 end_POSTSUBSCRIPT 45.46.2subscript45.46.2{45.4}_{{\color[rgb]{0,0,1}6.2}}45.4 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT
IDAICL 71.41.8subscript71.41.871.4_{{\color[rgb]{0,0,1}1.8}}71.4 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 36.12.9subscript36.12.936.1_{{\color[rgb]{0,0,1}2.9}}36.1 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 67.61.8subscript67.61.867.6_{{\color[rgb]{0,0,1}1.8}}67.6 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 68.62.2subscript68.62.2{68.6}_{{\color[rgb]{0,0,1}2.2}}68.6 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 40.00.7subscript40.00.740.0_{{\color[rgb]{0,0,1}0.7}}40.0 start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 58.52.5subscript58.52.558.5_{{\color[rgb]{0,0,1}2.5}}58.5 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 45.61.9subscript45.61.9{45.6}_{{\color[rgb]{0,0,1}1.9}}45.6 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 63.61.1subscript63.61.163.6_{{\color[rgb]{0,0,1}1.1}}63.6 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 66.91.6subscript66.91.666.9_{{\color[rgb]{0,0,1}1.6}}66.9 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 50.62.7subscript50.62.750.6_{{\color[rgb]{0,0,1}2.7}}50.6 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT
Vanilla ICL 12 64.56.0subscript64.56.0{64.5}_{{\color[rgb]{0,0,1}6.0}}64.5 start_POSTSUBSCRIPT 6.0 end_POSTSUBSCRIPT 30.87.1subscript30.87.1{30.8}_{{\color[rgb]{0,0,1}7.1}}30.8 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 59.35.6subscript59.35.6{59.3}_{{\color[rgb]{0,0,1}5.6}}59.3 start_POSTSUBSCRIPT 5.6 end_POSTSUBSCRIPT 59.18.4subscript59.18.4{59.1}_{{\color[rgb]{0,0,1}8.4}}59.1 start_POSTSUBSCRIPT 8.4 end_POSTSUBSCRIPT 33.95.5subscript33.95.533.9_{{\color[rgb]{0,0,1}5.5}}33.9 start_POSTSUBSCRIPT 5.5 end_POSTSUBSCRIPT 56.68.9subscript56.68.9{56.6}_{{\color[rgb]{0,0,1}8.9}}56.6 start_POSTSUBSCRIPT 8.9 end_POSTSUBSCRIPT 35.87.1subscript35.87.1{35.8}_{{\color[rgb]{0,0,1}7.1}}35.8 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 52.311.4subscript52.311.4{52.3}_{{\color[rgb]{0,0,1}11.4}}52.3 start_POSTSUBSCRIPT 11.4 end_POSTSUBSCRIPT 47.46.0subscript47.46.0{47.4}_{{\color[rgb]{0,0,1}6.0}}47.4 start_POSTSUBSCRIPT 6.0 end_POSTSUBSCRIPT 47.47.7subscript47.47.7{47.4}_{{\color[rgb]{0,0,1}7.7}}47.4 start_POSTSUBSCRIPT 7.7 end_POSTSUBSCRIPT
IDAICL 72.21.1subscript72.21.172.2_{{\color[rgb]{0,0,1}1.1}}72.2 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 36.72.2subscript36.72.236.7_{{\color[rgb]{0,0,1}2.2}}36.7 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 70.11.7subscript70.11.770.1_{{\color[rgb]{0,0,1}1.7}}70.1 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 69.31.8subscript69.31.869.3_{{\color[rgb]{0,0,1}1.8}}69.3 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 40.81.2subscript40.81.240.8_{{\color[rgb]{0,0,1}1.2}}40.8 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 60.91.5subscript60.91.560.9_{{\color[rgb]{0,0,1}1.5}}60.9 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 47.02.7subscript47.02.747.0_{{\color[rgb]{0,0,1}2.7}}47.0 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 65.51.9subscript65.51.965.5_{{\color[rgb]{0,0,1}1.9}}65.5 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 67.82.2subscript67.82.267.8_{{\color[rgb]{0,0,1}2.2}}67.8 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 51.23.3subscript51.23.351.2_{{\color[rgb]{0,0,1}3.3}}51.2 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT
Vanilla ICL 16 64.36.1subscript64.36.1{64.3}_{{\color[rgb]{0,0,1}6.1}}64.3 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 33.57.1subscript33.57.1{33.5}_{{\color[rgb]{0,0,1}7.1}}33.5 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 59.96.6subscript59.96.6{59.9}_{{\color[rgb]{0,0,1}6.6}}59.9 start_POSTSUBSCRIPT 6.6 end_POSTSUBSCRIPT 61.77.5subscript61.77.5{61.7}_{{\color[rgb]{0,0,1}7.5}}61.7 start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 34.66.9subscript34.66.934.6_{{\color[rgb]{0,0,1}6.9}}34.6 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 56.16.2subscript56.16.2{56.1}_{{\color[rgb]{0,0,1}6.2}}56.1 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 36.95.7subscript36.95.7{36.9}_{{\color[rgb]{0,0,1}5.7}}36.9 start_POSTSUBSCRIPT 5.7 end_POSTSUBSCRIPT 54.17.2subscript54.17.2{54.1}_{{\color[rgb]{0,0,1}7.2}}54.1 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 47.98.0subscript47.98.0{47.9}_{{\color[rgb]{0,0,1}8.0}}47.9 start_POSTSUBSCRIPT 8.0 end_POSTSUBSCRIPT 48.97.7subscript48.97.7{48.9}_{{\color[rgb]{0,0,1}7.7}}48.9 start_POSTSUBSCRIPT 7.7 end_POSTSUBSCRIPT
IDAICL 72.92.5subscript72.92.572.9_{{\color[rgb]{0,0,1}2.5}}72.9 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 38.02.4subscript38.02.438.0_{{\color[rgb]{0,0,1}2.4}}38.0 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 69.71.3subscript69.71.369.7_{{\color[rgb]{0,0,1}1.3}}69.7 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 69.92.1subscript69.92.169.9_{{\color[rgb]{0,0,1}2.1}}69.9 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 41.70.9subscript41.70.941.7_{{\color[rgb]{0,0,1}0.9}}41.7 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 60.61.1subscript60.61.160.6_{{\color[rgb]{0,0,1}1.1}}60.6 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 46.61.9subscript46.61.946.6_{{\color[rgb]{0,0,1}1.9}}46.6 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 65.92.6subscript65.92.665.9_{{\color[rgb]{0,0,1}2.6}}65.9 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 65.71.0subscript65.71.065.7_{{\color[rgb]{0,0,1}1.0}}65.7 start_POSTSUBSCRIPT 1.0 end_POSTSUBSCRIPT 51.82.2subscript51.82.251.8_{{\color[rgb]{0,0,1}2.2}}51.8 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT
GPT-2 0.3B Vanilla ICL 4 60.87.5subscript60.87.5{60.8}_{{\color[rgb]{0,0,1}7.5}}60.8 start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 26.66.8subscript26.66.8{26.6}_{{\color[rgb]{0,0,1}6.8}}26.6 start_POSTSUBSCRIPT 6.8 end_POSTSUBSCRIPT 50.57.1subscript50.57.1{50.5}_{{\color[rgb]{0,0,1}7.1}}50.5 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 52.36.1subscript52.36.1{52.3}_{{\color[rgb]{0,0,1}6.1}}52.3 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 30.55.2subscript30.55.230.5_{{\color[rgb]{0,0,1}5.2}}30.5 start_POSTSUBSCRIPT 5.2 end_POSTSUBSCRIPT 53.28.3subscript53.28.3{53.2}_{{\color[rgb]{0,0,1}8.3}}53.2 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 32.88.1subscript32.88.1{32.8}_{{\color[rgb]{0,0,1}8.1}}32.8 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 50.54.8subscript50.54.8{50.5}_{{\color[rgb]{0,0,1}4.8}}50.5 start_POSTSUBSCRIPT 4.8 end_POSTSUBSCRIPT 41.35.9subscript41.35.9{41.3}_{{\color[rgb]{0,0,1}5.9}}41.3 start_POSTSUBSCRIPT 5.9 end_POSTSUBSCRIPT 42.77.1subscript42.77.1{42.7}_{{\color[rgb]{0,0,1}7.1}}42.7 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT
IDAICL 78.41.7subscript78.41.778.4_{{\color[rgb]{0,0,1}1.7}}78.4 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 33.12.5subscript33.12.533.1_{{\color[rgb]{0,0,1}2.5}}33.1 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 66.60.9subscript66.60.966.6_{{\color[rgb]{0,0,1}0.9}}66.6 start_POSTSUBSCRIPT 0.9 end_POSTSUBSCRIPT 70.32.3subscript70.32.3{70.3}_{{\color[rgb]{0,0,1}2.3}}70.3 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 40.11.5subscript40.11.540.1_{{\color[rgb]{0,0,1}1.5}}40.1 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 69.41.7subscript69.41.769.4_{{\color[rgb]{0,0,1}1.7}}69.4 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 45.63.3subscript45.63.3{45.6}_{{\color[rgb]{0,0,1}3.3}}45.6 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 66.22.1subscript66.22.166.2_{{\color[rgb]{0,0,1}2.1}}66.2 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 62.83.7subscript62.83.762.8_{{\color[rgb]{0,0,1}3.7}}62.8 start_POSTSUBSCRIPT 3.7 end_POSTSUBSCRIPT 50.41.8subscript50.41.850.4_{{\color[rgb]{0,0,1}1.8}}50.4 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT
Vanilla ICL 8 58.98.7subscript58.98.7{58.9}_{{\color[rgb]{0,0,1}8.7}}58.9 start_POSTSUBSCRIPT 8.7 end_POSTSUBSCRIPT 29.46.1subscript29.46.1{29.4}_{{\color[rgb]{0,0,1}6.1}}29.4 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 52.48.9subscript52.48.9{52.4}_{{\color[rgb]{0,0,1}8.9}}52.4 start_POSTSUBSCRIPT 8.9 end_POSTSUBSCRIPT 54.88.2subscript54.88.2{54.8}_{{\color[rgb]{0,0,1}8.2}}54.8 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 32.77.9subscript32.77.932.7_{{\color[rgb]{0,0,1}7.9}}32.7 start_POSTSUBSCRIPT 7.9 end_POSTSUBSCRIPT 53.56.7subscript53.56.7{53.5}_{{\color[rgb]{0,0,1}6.7}}53.5 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 34.08.2subscript34.08.2{34.0}_{{\color[rgb]{0,0,1}8.2}}34.0 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 59.19.7subscript59.19.7{59.1}_{{\color[rgb]{0,0,1}9.7}}59.1 start_POSTSUBSCRIPT 9.7 end_POSTSUBSCRIPT 43.86.4subscript43.86.4{43.8}_{{\color[rgb]{0,0,1}6.4}}43.8 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 46.97.6subscript46.97.6{46.9}_{{\color[rgb]{0,0,1}7.6}}46.9 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT
IDAICL 80.81.7subscript80.81.780.8_{{\color[rgb]{0,0,1}1.7}}80.8 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 34.81.9subscript34.81.934.8_{{\color[rgb]{0,0,1}1.9}}34.8 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 69.51.1subscript69.51.169.5_{{\color[rgb]{0,0,1}1.1}}69.5 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 71.50.8subscript71.50.8{71.5}_{{\color[rgb]{0,0,1}0.8}}71.5 start_POSTSUBSCRIPT 0.8 end_POSTSUBSCRIPT 41.51.7subscript41.51.741.5_{{\color[rgb]{0,0,1}1.7}}41.5 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 70.32.6subscript70.32.670.3_{{\color[rgb]{0,0,1}2.6}}70.3 start_POSTSUBSCRIPT 2.6 end_POSTSUBSCRIPT 46.22.2subscript46.22.2{46.2}_{{\color[rgb]{0,0,1}2.2}}46.2 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 68.11.7subscript68.11.768.1_{{\color[rgb]{0,0,1}1.7}}68.1 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 63.32.1subscript63.32.163.3_{{\color[rgb]{0,0,1}2.1}}63.3 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 51.52.5subscript51.52.551.5_{{\color[rgb]{0,0,1}2.5}}51.5 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT
Vanilla ICL 12 62.914.4subscript62.914.4{62.9}_{{\color[rgb]{0,0,1}14.4}}62.9 start_POSTSUBSCRIPT 14.4 end_POSTSUBSCRIPT 30.67.8subscript30.67.8{30.6}_{{\color[rgb]{0,0,1}7.8}}30.6 start_POSTSUBSCRIPT 7.8 end_POSTSUBSCRIPT 55.26.2subscript55.26.2{55.2}_{{\color[rgb]{0,0,1}6.2}}55.2 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 56.16.7subscript56.16.7{56.1}_{{\color[rgb]{0,0,1}6.7}}56.1 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 34.27.5subscript34.27.534.2_{{\color[rgb]{0,0,1}7.5}}34.2 start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 56.87.1subscript56.87.1{56.8}_{{\color[rgb]{0,0,1}7.1}}56.8 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 36.29.8subscript36.29.8{36.2}_{{\color[rgb]{0,0,1}9.8}}36.2 start_POSTSUBSCRIPT 9.8 end_POSTSUBSCRIPT 58.07.3subscript58.07.3{58.0}_{{\color[rgb]{0,0,1}7.3}}58.0 start_POSTSUBSCRIPT 7.3 end_POSTSUBSCRIPT 46.59.3subscript46.59.3{46.5}_{{\color[rgb]{0,0,1}9.3}}46.5 start_POSTSUBSCRIPT 9.3 end_POSTSUBSCRIPT 48.66.6subscript48.66.6{48.6}_{{\color[rgb]{0,0,1}6.6}}48.6 start_POSTSUBSCRIPT 6.6 end_POSTSUBSCRIPT
IDAICL 82.22.3subscript82.22.382.2_{{\color[rgb]{0,0,1}2.3}}82.2 start_POSTSUBSCRIPT 2.3 end_POSTSUBSCRIPT 36.11.8subscript36.11.836.1_{{\color[rgb]{0,0,1}1.8}}36.1 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 68.92.4subscript68.92.468.9_{{\color[rgb]{0,0,1}2.4}}68.9 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 72.01.5subscript72.01.5{72.0}_{{\color[rgb]{0,0,1}1.5}}72.0 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 43.70.6subscript43.70.643.7_{{\color[rgb]{0,0,1}0.6}}43.7 start_POSTSUBSCRIPT 0.6 end_POSTSUBSCRIPT 71.42.4subscript71.42.471.4_{{\color[rgb]{0,0,1}2.4}}71.4 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 48.31.3subscript48.31.3{48.3}_{{\color[rgb]{0,0,1}1.3}}48.3 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 70.51.9subscript70.51.970.5_{{\color[rgb]{0,0,1}1.9}}70.5 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 65.22.2subscript65.22.265.2_{{\color[rgb]{0,0,1}2.2}}65.2 start_POSTSUBSCRIPT 2.2 end_POSTSUBSCRIPT 52.91.4subscript52.91.452.9_{{\color[rgb]{0,0,1}1.4}}52.9 start_POSTSUBSCRIPT 1.4 end_POSTSUBSCRIPT
Vanilla ICL 16 67.46.3subscript67.46.3{67.4}_{{\color[rgb]{0,0,1}6.3}}67.4 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 31.77.1subscript31.77.1{31.7}_{{\color[rgb]{0,0,1}7.1}}31.7 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 57.68.6subscript57.68.6{57.6}_{{\color[rgb]{0,0,1}8.6}}57.6 start_POSTSUBSCRIPT 8.6 end_POSTSUBSCRIPT 56.65.2subscript56.65.2{56.6}_{{\color[rgb]{0,0,1}5.2}}56.6 start_POSTSUBSCRIPT 5.2 end_POSTSUBSCRIPT 34.76.2subscript34.76.234.7_{{\color[rgb]{0,0,1}6.2}}34.7 start_POSTSUBSCRIPT 6.2 end_POSTSUBSCRIPT 57.05.3subscript57.05.3{57.0}_{{\color[rgb]{0,0,1}5.3}}57.0 start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 38.16.9subscript38.16.9{38.1}_{{\color[rgb]{0,0,1}6.9}}38.1 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 59.38.2subscript59.38.2{59.3}_{{\color[rgb]{0,0,1}8.2}}59.3 start_POSTSUBSCRIPT 8.2 end_POSTSUBSCRIPT 45.27.6subscript45.27.6{45.2}_{{\color[rgb]{0,0,1}7.6}}45.2 start_POSTSUBSCRIPT 7.6 end_POSTSUBSCRIPT 49.48.7subscript49.48.7{49.4}_{{\color[rgb]{0,0,1}8.7}}49.4 start_POSTSUBSCRIPT 8.7 end_POSTSUBSCRIPT
IDAICL 81.52.8subscript81.52.881.5_{{\color[rgb]{0,0,1}2.8}}81.5 start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 36.81.2subscript36.81.236.8_{{\color[rgb]{0,0,1}1.2}}36.8 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 70.41.7subscript70.41.770.4_{{\color[rgb]{0,0,1}1.7}}70.4 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 72.92.1subscript72.92.172.9_{{\color[rgb]{0,0,1}2.1}}72.9 start_POSTSUBSCRIPT 2.1 end_POSTSUBSCRIPT 43.11.3subscript43.11.343.1_{{\color[rgb]{0,0,1}1.3}}43.1 start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 71.92.7subscript71.92.771.9_{{\color[rgb]{0,0,1}2.7}}71.9 start_POSTSUBSCRIPT 2.7 end_POSTSUBSCRIPT 48.71.1subscript48.71.148.7_{{\color[rgb]{0,0,1}1.1}}48.7 start_POSTSUBSCRIPT 1.1 end_POSTSUBSCRIPT 70.92.9subscript70.92.970.9_{{\color[rgb]{0,0,1}2.9}}70.9 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 65.81.2subscript65.81.265.8_{{\color[rgb]{0,0,1}1.2}}65.8 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 52.41.8subscript52.41.852.4_{{\color[rgb]{0,0,1}1.8}}52.4 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT
Table 8: Accuracy comparison between Vanilla ICL and IDAICL under varying values of m𝑚mitalic_m on the GPT-2 models with 0.1B and 0.3B parameters.
Refer to caption
Figure 7: Comparison results between Vanilla ICL and IDAICL across fifteen templates. The evaluation is conducted using the GPT-2 model with 1.5B parameters. The performance of IDAICL exceeds that of Vanilla ICL and demonstrates greater robustness across various templates.
PLM Method SST-5 MR AGNews TREC SST-2 Subj DBPedia Avg.
GPT-2 1.5B Vanilla ICL 30.86.1subscript30.86.1{30.8}_{{\color[rgb]{0,0,1}6.1}}30.8 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 64.98.3subscript64.98.3{64.9}_{{\color[rgb]{0,0,1}8.3}}64.9 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 57.56.7subscript57.56.7{57.5}_{{\color[rgb]{0,0,1}6.7}}57.5 start_POSTSUBSCRIPT 6.7 end_POSTSUBSCRIPT 40.45.1subscript40.45.1{40.4}_{{\color[rgb]{0,0,1}5.1}}40.4 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 57.27.0subscript57.27.0{57.2}_{{\color[rgb]{0,0,1}7.0}}57.2 start_POSTSUBSCRIPT 7.0 end_POSTSUBSCRIPT 57.310.3subscript57.310.3{57.3}_{{\color[rgb]{0,0,1}10.3}}57.3 start_POSTSUBSCRIPT 10.3 end_POSTSUBSCRIPT 67.67.5subscript67.67.5{67.6}_{{\color[rgb]{0,0,1}7.5}}67.6 start_POSTSUBSCRIPT 7.5 end_POSTSUBSCRIPT 53.7
ConCa 32.87.1subscript32.87.1{32.8}_{{\color[rgb]{0,0,1}7.1}}32.8 start_POSTSUBSCRIPT 7.1 end_POSTSUBSCRIPT 74.55.1subscript74.55.174.5_{{\color[rgb]{0,0,1}5.1}}74.5 start_POSTSUBSCRIPT 5.1 end_POSTSUBSCRIPT 62.76.1subscript62.76.162.7_{{\color[rgb]{0,0,1}6.1}}62.7 start_POSTSUBSCRIPT 6.1 end_POSTSUBSCRIPT 45.82.5subscript45.82.5{45.8}_{{\color[rgb]{0,0,1}2.5}}45.8 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 73.98.6subscript73.98.6{73.9}_{{\color[rgb]{0,0,1}8.6}}73.9 start_POSTSUBSCRIPT 8.6 end_POSTSUBSCRIPT 68.37.4subscript68.37.4{68.3}_{{\color[rgb]{0,0,1}7.4}}68.3 start_POSTSUBSCRIPT 7.4 end_POSTSUBSCRIPT 75.04.0subscript75.04.0{75.0}_{{\color[rgb]{0,0,1}4.0}}75.0 start_POSTSUBSCRIPT 4.0 end_POSTSUBSCRIPT 61.9
PROCAsuperscriptA{\text{A}}^{*}A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 36.5¯4.4subscript¯36.54.4{\underline{36.5}}_{{\color[rgb]{0,0,1}4.4}}under¯ start_ARG 36.5 end_ARG start_POSTSUBSCRIPT 4.4 end_POSTSUBSCRIPT 80.86.4subscript80.86.4{{80.8}}_{{\color[rgb]{0,0,1}6.4}}80.8 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 75.53.2subscript75.53.2{75.5}_{{\color[rgb]{0,0,1}3.2}}75.5 start_POSTSUBSCRIPT 3.2 end_POSTSUBSCRIPT 46.02.5subscript46.02.5{46.0}_{{\color[rgb]{0,0,1}2.5}}46.0 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 88.0¯1.3subscript¯88.01.3{\underline{88.0}}_{{\color[rgb]{0,0,1}1.3}}under¯ start_ARG 88.0 end_ARG start_POSTSUBSCRIPT 1.3 end_POSTSUBSCRIPT 80.23.3subscript80.23.3{\boldsymbol{80.2}}_{{\color[rgb]{0,0,1}3.3}}bold_80.2 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 89.4¯0.7subscript¯89.40.7{\underline{89.4}}_{{\color[rgb]{0,0,1}0.7}}under¯ start_ARG 89.4 end_ARG start_POSTSUBSCRIPT 0.7 end_POSTSUBSCRIPT 70.9¯¯70.9\underline{70.9}under¯ start_ARG 70.9 end_ARG
D-ConCa 31.73.3subscript31.73.3{{31.7}}_{{\color[rgb]{0,0,1}3.3}}31.7 start_POSTSUBSCRIPT 3.3 end_POSTSUBSCRIPT 80.9¯3.7subscript¯80.93.7{\underline{80.9}}_{{\color[rgb]{0,0,1}3.7}}under¯ start_ARG 80.9 end_ARG start_POSTSUBSCRIPT 3.7 end_POSTSUBSCRIPT 77.0¯4.1subscript¯77.04.1{\underline{77.0}}_{{\color[rgb]{0,0,1}4.1}}under¯ start_ARG 77.0 end_ARG start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 47.1¯2.8subscript¯47.12.8{\underline{47.1}}_{{\color[rgb]{0,0,1}2.8}}under¯ start_ARG 47.1 end_ARG start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 86.54.4subscript86.54.4{86.5}_{{\color[rgb]{0,0,1}4.4}}86.5 start_POSTSUBSCRIPT 4.4 end_POSTSUBSCRIPT 76.85.2subscript76.85.2{76.8}_{{\color[rgb]{0,0,1}5.2}}76.8 start_POSTSUBSCRIPT 5.2 end_POSTSUBSCRIPT 86.16.3subscript86.16.3{86.1}_{{\color[rgb]{0,0,1}6.3}}86.1 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 69.4
IDAICL 40.81.9subscript40.81.9{\boldsymbol{40.8}}_{{\color[rgb]{0,0,1}1.9}}bold_40.8 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 82.11.2subscript82.11.2{\boldsymbol{82.1}}_{{\color[rgb]{0,0,1}1.2}}bold_82.1 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 80.82.4subscript80.82.4{\boldsymbol{80.8}}_{{\color[rgb]{0,0,1}2.4}}bold_80.8 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 52.02.5subscript52.02.5{\boldsymbol{52.0}}_{{\color[rgb]{0,0,1}2.5}}bold_52.0 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 89.51.8subscript89.51.8{\boldsymbol{89.5}}_{{\color[rgb]{0,0,1}1.8}}bold_89.5 start_POSTSUBSCRIPT 1.8 end_POSTSUBSCRIPT 80.1¯2.9subscript¯80.12.9{\underline{80.1}}_{{\color[rgb]{0,0,1}2.9}}under¯ start_ARG 80.1 end_ARG start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 91.02.5subscript91.02.5{\boldsymbol{91.0}}_{{\color[rgb]{0,0,1}2.5}}bold_91.0 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 73.873.8\boldsymbol{73.8}bold_73.8
GPT-Neo Vanilla ICL 31.56.4subscript31.56.4{31.5}_{{\color[rgb]{0,0,1}6.4}}31.5 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 70.68.1subscript70.68.1{70.6}_{{\color[rgb]{0,0,1}8.1}}70.6 start_POSTSUBSCRIPT 8.1 end_POSTSUBSCRIPT 71.96.8subscript71.96.8{71.9}_{{\color[rgb]{0,0,1}6.8}}71.9 start_POSTSUBSCRIPT 6.8 end_POSTSUBSCRIPT 53.06.9subscript53.06.9{53.0}_{{\color[rgb]{0,0,1}6.9}}53.0 start_POSTSUBSCRIPT 6.9 end_POSTSUBSCRIPT 74.98.3subscript74.98.3{74.9}_{{\color[rgb]{0,0,1}8.3}}74.9 start_POSTSUBSCRIPT 8.3 end_POSTSUBSCRIPT 57.96.3subscript57.96.3{57.9}_{{\color[rgb]{0,0,1}6.3}}57.9 start_POSTSUBSCRIPT 6.3 end_POSTSUBSCRIPT 78.56.5subscript78.56.5{78.5}_{{\color[rgb]{0,0,1}6.5}}78.5 start_POSTSUBSCRIPT 6.5 end_POSTSUBSCRIPT 62.6
ConCa 33.94.3subscript33.94.3{33.9}_{{\color[rgb]{0,0,1}4.3}}33.9 start_POSTSUBSCRIPT 4.3 end_POSTSUBSCRIPT 78.25.3subscript78.25.378.2_{{\color[rgb]{0,0,1}5.3}}78.2 start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 73.63.8subscript73.63.873.6_{{\color[rgb]{0,0,1}3.8}}73.6 start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 55.97.2subscript55.97.2{55.9}_{{\color[rgb]{0,0,1}7.2}}55.9 start_POSTSUBSCRIPT 7.2 end_POSTSUBSCRIPT 82.09.5subscript82.09.5{82.0}_{{\color[rgb]{0,0,1}9.5}}82.0 start_POSTSUBSCRIPT 9.5 end_POSTSUBSCRIPT 71.36.4subscript71.36.4{71.3}_{{\color[rgb]{0,0,1}6.4}}71.3 start_POSTSUBSCRIPT 6.4 end_POSTSUBSCRIPT 90.03.6subscript90.03.6{90.0}_{{\color[rgb]{0,0,1}3.6}}90.0 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 69.3
PROCAsuperscriptA{\text{A}}^{*}A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT 39.4¯4.0subscript¯39.44.0{\underline{39.4}}_{{\color[rgb]{0,0,1}4.0}}under¯ start_ARG 39.4 end_ARG start_POSTSUBSCRIPT 4.0 end_POSTSUBSCRIPT 77.813.9subscript77.813.9{{77.8}}_{{\color[rgb]{0,0,1}13.9}}77.8 start_POSTSUBSCRIPT 13.9 end_POSTSUBSCRIPT 78.92.5subscript78.92.5{78.9}_{{\color[rgb]{0,0,1}2.5}}78.9 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 56.03.6subscript56.03.6{56.0}_{{\color[rgb]{0,0,1}3.6}}56.0 start_POSTSUBSCRIPT 3.6 end_POSTSUBSCRIPT 91.91.2subscript91.91.2{\boldsymbol{91.9}}_{{\color[rgb]{0,0,1}1.2}}bold_91.9 start_POSTSUBSCRIPT 1.2 end_POSTSUBSCRIPT 81.3¯3.8subscript¯81.33.8{\underline{81.3}}_{{\color[rgb]{0,0,1}3.8}}under¯ start_ARG 81.3 end_ARG start_POSTSUBSCRIPT 3.8 end_POSTSUBSCRIPT 92.0¯1.5subscript¯92.01.5{\underline{92.0}}_{{\color[rgb]{0,0,1}1.5}}under¯ start_ARG 92.0 end_ARG start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 73.9¯¯73.9\underline{73.9}under¯ start_ARG 73.9 end_ARG
D-ConCa 32.94.1subscript32.94.1{{32.9}}_{{\color[rgb]{0,0,1}4.1}}32.9 start_POSTSUBSCRIPT 4.1 end_POSTSUBSCRIPT 84.6¯2.8subscript¯84.62.8{\underline{84.6}}_{{\color[rgb]{0,0,1}2.8}}under¯ start_ARG 84.6 end_ARG start_POSTSUBSCRIPT 2.8 end_POSTSUBSCRIPT 81.2¯3.9subscript¯81.23.9{\underline{81.2}}_{{\color[rgb]{0,0,1}3.9}}under¯ start_ARG 81.2 end_ARG start_POSTSUBSCRIPT 3.9 end_POSTSUBSCRIPT 57.6¯4.7subscript¯57.64.7{\underline{57.6}}_{{\color[rgb]{0,0,1}4.7}}under¯ start_ARG 57.6 end_ARG start_POSTSUBSCRIPT 4.7 end_POSTSUBSCRIPT 91.6¯5.3subscript¯91.65.3{\underline{91.6}}_{{\color[rgb]{0,0,1}5.3}}under¯ start_ARG 91.6 end_ARG start_POSTSUBSCRIPT 5.3 end_POSTSUBSCRIPT 70.92.9subscript70.92.9{70.9}_{{\color[rgb]{0,0,1}2.9}}70.9 start_POSTSUBSCRIPT 2.9 end_POSTSUBSCRIPT 85.73.1subscript85.73.1{85.7}_{{\color[rgb]{0,0,1}3.1}}85.7 start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT 72.1
IDAICL 42.22.5subscript42.22.5{\boldsymbol{42.2}}_{{\color[rgb]{0,0,1}2.5}}bold_42.2 start_POSTSUBSCRIPT 2.5 end_POSTSUBSCRIPT 85.91.6subscript85.91.6{\boldsymbol{85.9}}_{{\color[rgb]{0,0,1}1.6}}bold_85.9 start_POSTSUBSCRIPT 1.6 end_POSTSUBSCRIPT 83.11.9subscript83.11.9{\boldsymbol{83.1}}_{{\color[rgb]{0,0,1}1.9}}bold_83.1 start_POSTSUBSCRIPT 1.9 end_POSTSUBSCRIPT 61.41.7subscript61.41.7{\boldsymbol{61.4}}_{{\color[rgb]{0,0,1}1.7}}bold_61.4 start_POSTSUBSCRIPT 1.7 end_POSTSUBSCRIPT 91.22.4subscript91.22.4{{91.2}}_{{\color[rgb]{0,0,1}2.4}}91.2 start_POSTSUBSCRIPT 2.4 end_POSTSUBSCRIPT 82.33.1subscript82.33.1{\boldsymbol{82.3}}_{{\color[rgb]{0,0,1}3.1}}bold_82.3 start_POSTSUBSCRIPT 3.1 end_POSTSUBSCRIPT 93.01.5subscript93.01.5{\boldsymbol{93.0}}_{{\color[rgb]{0,0,1}1.5}}bold_93.0 start_POSTSUBSCRIPT 1.5 end_POSTSUBSCRIPT 77.077.0\boldsymbol{77.0}bold_77.0
Table 9: Accuracy comparison between IDAICL and other prediction calibration approaches using the GPT-2 (with 1.5B parameters) and GPT-Neo models, with m𝑚mitalic_m setting to 8. The templates used align with those utilized by Han et al. Han et al. (2023). * indicates that the results were derived from the original paper.
Refer to caption
Figure 8: Results of sensitivity tests for two hyperparameters within IDAICL, i.e., λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ, using the GPT-2 model with 0.8B parameters, with m𝑚mitalic_m setting to 12. Optimal performance is achieved when λ0.5𝜆0.5\lambda\approx 0.5italic_λ ≈ 0.5 and τ1𝜏1\tau\approx 1italic_τ ≈ 1.

Appendix B Details of Compared Baselines

The compared methods are described as follows:

  • Vanilla ICL: We use the PLMs as they are and implement ICL by conditioning it on a concatenation of m𝑚mitalic_m training examples, following the approach outlined by Brown et al. Brown et al. (2020).

  • MetaICL: The fundamental concept underlying MetaICL is to utilize a multi-task learning framework across a diverse range of meta-training tasks Min et al. (2022b).

  • Channel ICL: It employs a noisy channel approach for language model prompting in few-shot text classification Min et al. (2022a).

  • EPR: It employs language models to autonomously label examples that are suitable as effective prompts and subsequently trains a prompt retriever based on this acquired signal Rubin et al. (2022).

  • ConCa: It assesses the model’s inclination towards specific answers by introducing a dummy test input that lacks content Zhao et al. (2021).

  • PROCA: The prediction of PROCA is calibrated based on the likelihood of prototypical clusters Han et al. (2023).

  • D-ConCa: It initially assesses the impacts of various label biases by employing randomly sampled words from the task corpus. During inference, it utilizes the estimated label bias to calibrate the model’s output probabilities Fei et al. (2023).

Appendix C More Details of Experimental Settings

The entire implementation is conducted utilizing PyTorch Paszke et al. (2019) and Transformers Wolf et al. (2020). We follow the parameter configurations and details specified in previous research Min et al. (2022a). The number of demonstrations is primarily set to m=12𝑚12m=12italic_m = 12, but we also explore m𝑚mitalic_m values of {1,4,8,12,16}1481216\{1,4,8,12,16\}{ 1 , 4 , 8 , 12 , 16 } in the ablations, with the specific settings detailed in the respective sections. Demonstration examples for each test sample are randomly selected from the training data, unless specific methods employ a specially designed selection method, such as EPR Rubin et al. (2022). The values of the feature mean and covariance matrix are estimated from the demonstration set containing demonstration examples corresponding to all test samples. We depart from the assumption made in previous studies, which presupposes an equal distribution of training examples across all classes Gao et al. (2021); Logan IV et al. (2022), in order to facilitate a more realistic and demanding evaluation.

Each experiment is repeated under five different random seeds. The batch size is set to 32, and the sequence length is configured to 128 for datasets with shorter texts, including SST-2 Socher et al. (2013), SST-5 Socher et al. (2013), MR Pang and Lee (2005), CR Hu and Liu (2004), and TREC Voorhees and Tice (2000). On the other hand, for datasets with longer input texts, including AGNews Zhang et al. (2015), DBPedia Lehmann et al. (2015), Subj Pang and Lee (2004), CB De Marneffe et al. (2019), and Amazon McAuley and Leskovec (2013), a batch size of 16 and a sequence length of 256 are employed. Regarding the hyperparameters in IDAICL, the values of λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ are fixed at 0.5 and 1, respectively, except in sensitivity tests. The settings used for the compared methods adhere to those specified in the original papers Min et al. (2022a, b); Rubin et al. (2022); Zhao et al. (2021); Han et al. (2023); Fei et al. (2023). Accuracy serves as the primary evaluation metric, alongside the provided values of Macro-F1 for the LLaMA model. For each task, a specific template is utilized for inference, as detailed in Table 6. Additionally, we also examine the impact of different templates on the performance of IDAICL following those outlined by Zhao et al. Zhao et al. (2021), which include question-answer templates, conversation-style templates, prompts resembling web pages, and variations on label names, as listed in Table 10.

Appendix D More Comparison Results

The comparison results between Vanilla ICL and IDAICL on LLaMA models with 13B and 33B parameters across various datasets are presented in Table 7. Additionally, the corresponding results for GPT-2 models with 0.1B and 0.3B parameters are outlined in Table 8. It is evident that IDAICL consistently outperforms Vanilla ICL across all datasets and different model sizes, highlighting the high generalization capability of IDAICL. Additionally, IDAICL showcases reduced performance variance and significantly enhances the worst-case performance. Based on the findings presented in Table 9, IDAICL generally outperforms other prediction calibration methods, demonstrating the significance of statistical properties derived from the input data distribution in the predictions of PLMs.

Refer to caption
Figure 9: (a) and (b): Average accuracy across ten datasets for various values of λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ. Optimal average performance is attained when λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 and τ=1𝜏1\tau=1italic_τ = 1. (c) and (d): Confusion matrices for the SST-2 and MR datasets under two levels of imbalance, where the proportions of the negative class in demonstrations are set to 0.1 and 0.2, respectively. When compared to Vanilla ICL, IDAICL improves the performance of the minor class. These experiments are conducted on the GPT-2 model with 1.5B parameters, setting m𝑚mitalic_m to 12.
Refer to caption
Figure 10: Comparison results between Vanilla ICL and IDAICL across various demonstrations and permutations. The GPT-2 model with 0.8B parameters is employed for this analysis, setting m𝑚mitalic_m to 12. IDAICL exhibits smaller performance variance across different demonstrations and permutations compared to Vanilla ICL.

Appendix E More Results for Varying Templates

The comparison results between Vanilla ICL and IDAICL under all fifteen prompt templates are presented in Figure 7, illustrating that IDAICL consistently enhances both average and worst-case accuracy across all templates. Furthermore, the performance variance of IDAICL among different templates is notably smaller when compared to Vanilla ICL, highlighting the robustness of IDAICL’s performance across diverse templates.

Appendix F More Sensitivity and Ablation Studies

We performed sensitivity tests on two hyperparameters within IDAICL: λ𝜆\lambdaitalic_λ and τ𝜏\tauitalic_τ. These values govern the strength of implicit augmentation and the influence of the class proportion term, respectively. As depicted in Figure 8, optimal performance is achieved when λ0.5𝜆0.5\lambda\!\approx\!0.5italic_λ ≈ 0.5 and τ1𝜏1\tau\!\approx\!1italic_τ ≈ 1 for both datasets. Furthermore, Figures 9(a) and (b) illustrate the average performance of ten datasets across different hyperparameter settings. Much like the earlier findings, the best average performance is identified at λ=0.5𝜆0.5\lambda\!=\!0.5italic_λ = 0.5 and τ=1𝜏1\tau\!=\!1italic_τ = 1. Consequently, setting λ𝜆\lambdaitalic_λ as 0.5 and τ𝜏\tauitalic_τ as 1 is recommended for real applications. Furthermore, the performance remains stable within the ranges of λ{0.25,0.5,0.75}𝜆0.250.50.75\lambda\in\{0.25,0.5,0.75\}italic_λ ∈ { 0.25 , 0.5 , 0.75 } and τ{0.5,1,1.5}𝜏0.511.5\tau\in\{0.5,1,1.5\}italic_τ ∈ { 0.5 , 1 , 1.5 }, indicating that adjustments can be made within these stable ranges.

Appendix G More Results for Imbalanced Labels

The imbalanced label distribution in the training data has a significant impact on the classification performance of the model Zhou and Wu (2023b); Zhou et al. (2022). We depicted the confusion matrices for the SST-2 and MR datasets under two imbalance levels in Figures 9(c) and (d), in which the proportion of the negative class in demonstrations is set to 0.1 and 0.2. These results manifest that IDAICL significantly enhances the performance of the underrepresented classes in comparison to Vanilla ICL, thus proving its capability to address the class imbalance in demonstrations.

Appendix H Varying Demonstration Permutations

Research has substantiated that the performance of ICL is sensitive to the permutation of demonstrations Lu et al. (2022); Zhao et al. (2021). We assessed the performance of IDAICL under varying demonstration permutations. Specifically, we selected ten different sets of twelve training examples from the SST-2 datasets. For each set of examples, we shuffled the order ten times and calculated the accuracy for each permutation. The findings are depicted in Figure 10, indicating that IDAICL exhibits relatively stable performance across different demonstrations and permutations, while Vanilla ICL demonstrates high variance.

Format ID Prompt Label names
1 Review: This movie is amazing! Answer: Positive Review: Horrific movie, don’t see it. Answer: Positive / Negative
2 Review: This movie is amazing! Answer: good Review: Horrific movie, don’t see it. Answer: good / bad
3 My review for last night’s film: This movie is amazing! The critics agreed that this movie was good My review for last night’s film: Horrific movie, don’t see it. The critics agreed that this movie was good / bad
4 Here is what our critics think for this month’s films. One of our critics wrote "This movie is amazing!". Her sentiment towards the film was positive. One of our critics wrote "Horrific movie, don’t see it". Her sentiment towards the film was positive / negative
5 Critical reception [ edit ] In a contemporary review, Roger Ebert wrote "This movie is amazing!". Entertainment Weekly agreed, and the overall critical reception of the film was good. In a contemporary review, Roger Ebert wrote "Horrific movie, don’t see it". Entertainment Weekly agreed, and the overall critical reception of the film was good / bad
6 Review: This movie is amazing! Positive Review? Yes Review: Horrific movie, don’t see it. Positive Review? Yes / No
7 Review: This movie is amazing! Question: Is the sentiment of the above review Positive or Negative? Answer: Positive Review: Horrific movie, don’t see it. Question: Is the sentiment of the above review Positive or Negative? Answer: Positive / Negative
8 Review: This movie is amazing! Question: Did the author think that the movie was good or bad? Answer: good Review: Horrific movie, don’t see it. Question: Did the author think that the movie was good or bad? Answer: good / bad
9 Question: Did the author of the following tweet think that the movie was good or bad? Tweet: This movie is amazing! Answer: good Question: Did the author of the following tweet think that the movie was good or bad? Tweet: Horrific movie, don’t see it Answer: good / bad
10 This movie is amazing! My overall feeling was that the movie was good Horrific movie, don’t see it. My overall feeling was that the movie was good / bad
11 This movie is amazing! I liked the movie. Horrific movie, don’t see it. I liked / hated
12 This movie is amazing! My friend asked me if I would give the movie 0 or 5 stars, I said 5 Horrific movie, don’t see it. My friend asked me if I would give the movie 0 or 5 stars, I said 0 / 5
13 Input: This movie is amazing! Sentiment: Positive Input: Horrific movie, don’t see it. Sentiment: Positive / Negative
14 Review: This movie is amazing! Positive: True Review: Horrific movie, don’t see it. Positive: True / False
15 Review: This movie is amazing! Stars: 5 Review: Horrific movie, don’t see it. Stars: 5 / 0
Table 10: The templates employed for examining the influence of formats on the SST-2 dataset, following those outlined by Zhao et al. Zhao et al. (2021). An example from the training data is used for illustration.