A Causal Explainable Guardrails for Large Language Models

Zhixuan Chu^1∗♠, Yan Wang^1∗, Longfei Li¹, Zhibo Wang², Zhan Qin², Kui Ren²
¹ Ant Group ² Zhejiang University
{chuzhixuan.czx, luli.wy, longyao.llf}@antgroup.com
{zhibowang, qinzhan, kuiren}@zju.edu.cn

Abstract

Large Language Models (LLMs) have shown impressive performance in natural language tasks, but their outputs can exhibit undesirable attributes or biases. Existing methods for steering LLMs towards desired attributes often assume unbiased representations and rely solely on steering prompts. However, the representations learned from pre-training can introduce semantic biases that influence the steering process, leading to suboptimal results. We propose LLMGuardaril, a novel framework that incorporates causal analysis and adversarial learning to obtain unbiased steering representations in LLMs. LLMGuardaril systematically identifies and blocks the confounding effects of biases, enabling the extraction of unbiased steering representations. Additionally, it includes an explainable component that provides insights into the alignment between the generated output and the desired direction. Experiments demonstrate LLMGuardaril’s effectiveness in steering LLMs towards desired attributes while mitigating biases. Our work contributes to the development of safe and reliable LLMs that align with desired attributes. We discuss the limitations and future research directions, highlighting the need for ongoing research to address the ethical implications of large language models.

^*^*footnotetext: These authors contributed equally to this work.^$\spadesuit$^$\spadesuit$footnotetext: Corresponding author.

1 Introduction

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation, enabling a wide range of applications such as dialogue systems, content creation, time series forecasting, recommendation systems, professional agents [1; 2; 3; 4; 5]. However, the training of these models on massive web-scraped corpora has led to the manifestation of undesirable behaviors, such as generating offensive, toxic, or false outputs, facilitating the spread of disinformation campaigns. Addressing these issues is crucial, as content safety, fairness, toxicity, harmfulness, and factuality demand rigorous consideration, especially in an era where language models are increasingly deployed in high-stakes applications and user-facing systems. To address these challenges, recent research efforts have focused on developing methods to steer the output of LLMs towards desired attributes or concepts.

Several approaches have been proposed to control and steer the generation process, aiming to mitigate these harmful behaviors and improve the overall quality and safety of generated content [6]. Fine-tuning on carefully curated datasets has been a common technique to enhance content safety and regulate outputs. By training language models on datasets that have been filtered and cleaned to remove offensive, biased, or misleading content, researchers aim to reduce the likelihood of the model generating such harmful outputs. In addition to fine-tuning, other techniques have been explored to further improve the generation process. Reinforcement learning from human feedback (RLHF) [7] is one such approach, where the language model is trained using feedback from human annotators who evaluate the quality and safety of generated outputs. Another promising approach is reinforcement learning from AI feedback (RLAIF) [8], which extends the concept of RLHF by using an AI system to provide feedback instead of human annotators. However, they all require huge annotation and computation resources. Furthermore, the training process often involves a human or AI annotator providing feedback and guidance to the language model. This raises the possibility that some form of deception or manipulation could be introduced during the training process, either intentionally or unintentionally [9; 10; 11]. In addition, these methods often lack explainability and interpretability, leading to inconsistent performance and limited generalizability [12; 13; 14; 15].

Recent studies have shed light on the rich semantic information encoded within LLM representations, including specific information [16], concepts [17; 18], or attributes [19; 20]. Building upon these findings, the activation engineering has emerged, which involves creating vectors of activations that, when added to the forward passes of a frozen LLM, can induce desired changes in the output text. This approach holds promise for controlling the generation process in a more interpretable and fine-grained manner. Existing approaches for steering LLMs typically involve constructing positive and negative prompts that represent the desired and undesired attributes, respectively. By comparing the activations of the model for these prompts, a steering vector is obtained, which is then used to guide the model’s output during inference. While these methods have shown promising results, they often rely on the assumption that the constructed prompts are sufficient to capture the desired steering direction and that the model’s representations are unbiased. However, a closer examination reveals that the representations learned by LLMs during pre-training can introduce biases that influence the steering process. As shown in Figure 1, the semantic context of the prompts used for steering can implicitly encode biases, even without explicit steering prompts. Consequently, the obtained steering vectors may be confounded by these inherent biases, leading to suboptimal or unintended steering results. Therefore, simply structuring a controlled trial superficially is not enough to get the unbiased steering representation.

Refer to caption — Figure 1: Proportion of representations of semantic prompts that implicitly encode positive or negative directions for semantically neutral prompts without explicit steering prompt, across different language models. The varying proportions of positive and negative directions learned by the probing classifier, even in the absence of steering prompts, demonstrate the presence of inherent biases in the representations of semantic prompts due to differences in pre-training data. This observation supports the existence of a direct edge from the semantic direction representation $R_{cd}$ of the semantic prompt to the direction representation $R_{+}/R_{-}$ , as discussed in the causal analysis section.

To address these limitations, we propose a novel framework called LLMGuardaril that incorporates causal analysis to obtain unbiased steering representations for LLMs. Our framework aims to disentangle the influence of biases from the steering process by employing adversarial learning techniques. By systematically identifying and blocking the confounding effects of biases, LLMGuardaril enables the extraction of steering representations that accurately capture the desired attributes or concepts. The key contributions of this work are as follows:

•

A causal analysis of the steering process in LLMs, identifying the confounding effects of semantic prompt and their impact on steering representations. We provide a theoretical formulation of the problem and discuss the limitations of existing methods.
•

A novel framework for obtaining unbiased steering representations in LLMs. Our framework employs adversarial learning techniques to disentangle the influence of semantic biases from the steering process, enabling the extraction of accurate steering representations.
•

Comprehensive experiments and analysis demonstrating the effectiveness of LLMGuardaril in steering LLMs towards desired attributes while mitigating the influence of biases. We evaluate our framework on various benchmark datasets and compare its performance with existing methods.

2 Background

2.1 Representations in Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks, generating human-like text and exhibiting a broad understanding of language. Central to their success is the rich set of representations they learn during training, which encode various concepts, attributes, and semantic information. Extensive research has been dedicated to understanding the representations learned by large language models (LLMs) and how they encode various concepts and attributes [18; 21; 22; 23]. LLM representations refer to the patterns of activations across the model’s parameters that correspond to specific semantic concepts, properties, or features. These representations act as the model’s internal codification of the information learned from the training data. Recent studies have provided compelling evidence that LLM representations contain rich semantic information, including abstract concepts like space and time [16], as well as more granular attributes related to truthfulness, toxicity, bias, and harmfulness of the generated text [24; 20; 25; 26]. In addition, linear classifier probes, which are trained to predict input properties from intermediate layers of a network, have successfully identified representations of concepts [27; 28]. Latent space analysis has enabled researchers to locate or edit factual associations within LLMs [29; 30; 31]. The ability to locate and manipulate these representations opens up avenues for controlling and steering the model’s outputs, enabling the development of principled methods for mitigating harmful behaviors and enhancing the interpretability and trustworthiness of large language models.

2.2 LLM Activation Engineering

Activation engineering is a set of techniques that modify the internal activations of a pre-trained language model during inference to steer its output in a desired direction. The main idea is to identify and manipulate specific activations or attention heads associated with particular attributes or behaviors to control the model’s generation process. Early approaches, like the Plug-and-Play Language Model [32], use a separate classifier to detect target attributes in the generated text and perturb the language model’s activations accordingly, encouraging it to generate text that aligns with the desired attribute. Recent work focuses on extracting latent steering vectors from a frozen language model, which can be added to the activations during inference to steer the model’s completions toward specific goals, such as achieving high BLEU scores [33] or generating truthful statements [20]. Instead of requiring additional optimization or labeled data, Activation Addition [34] takes activation differences resulting from pairs of prompts. To avoid the identification of an opposite behaviour, [35] takes the average of activations associated with a target dataset and then subtracts the mean of all training activations, resulting in effective steering vectors. The growing interest in activation engineering stems from the desire to control and improve the performance of large language models on specific tasks or domains by identifying and manipulating the relevant activations.

2.3 Causal Inference

Causal inference [36; 37; 38; 39; 40] has been an attractive research topic for a long time as it provides an effective way to uncover causal relationships in real-world problems. Nowadays, causal inference has shown potential in enhancing LLMs from a causal view in improving the LLMs’ reasoning capacity [1; 41; 42; 43; 44], addressing fairness issues in LLMs [45; 46; 47; 44], and safety [48; 49; 50], complementing LLMs with explanations [51; 13; 52; 53], and handling multimodality [54; 55]. In causal inference, if a variable is the common cause of two variables, it is called the confounder. The confounder will induce a spurious correlation between these two variables to disturb the recognition of the causal effect between them [56]. As shown in Figure 2(a), $X\rightarrow Y$ denotes that $X$ is the cause of $Y$ . $C$ is the cause of both $X$ and $Y$ . Thus, it is a confounder that will induce a spurious correlation between $X$ and $Y$ to disturb the recognition of the causal effect between them. In particular, such spurious correlation is brought by the backdoor path created by the confounder. Formally, a backdoor path between $X$ and $Y$ is defined as any path from $X$ to $Y$ that starts with an arrow pointing into $X$ . For example, the path $X\leftarrow C\rightarrow Y$ is a backdoor path. If we want to deconfound two variables $X$ and $Y$ to calculate the true causal effect, we should block every backdoor path between them [57]. For example, in Figure 2(b), we should block $X\leftarrow C\rightarrow Y$ to get the causal effect between $X$ and $Y$ .

3 Causal Analysis

3.1 The Hypothetical Situation

Recent activation engineering-based work [26; 18; 34; 35; 20] utilize the steering vectors to control the direction of LLM output by constructing the randomized controlled trials. For example, some work [26; 18; 34] construct a pair of natural-language prompts ( $p_{+}$ , $p_{-}$ ), where $p_{+}$ represents the attribute we wish output text to emphasize and $p_{-}$ represents its opposite. $R_{+}/R_{-}$ is the representation for the prompt $p_{+}/p_{-}$ . The difference $\Delta R$ is a new steering vector that (intuitively) captures the difference between a prompt with the attribute and without it. To obtain a steering vector, they perform a forward pass on each prompt, record the activations at the given location in each pass, take the difference, and then finally rescale this difference in activations by an “injection coefficient” $\beta$ . To steer, they add the resulting steering representation to the original representations and allow the forward pass to continue, and obtain the steered output.

To systemically analyze this series of methods, we give the formal definitions. Now, let’s further break down and analysis of the constructed prompt pairs. In fact, the input prompt is comprised of the semantic prompt $C$ and the steering prompt $S$ . For example, in the [34], “I love talking about weddings” is the positive prompt and “I hate talking about weddings” is the negative prompt. In these two examples, “love” and “hate” are the steering prompts, which control the directions of output. The “I … talking about weddings” is the same for this pair of prompts, which only contains the semantic information. Therefore, they assume this is a randomized controlled trial, where there are two groups, i.e., treatment group “love” and control group “hate”. As shown in Figure 3 (a), the semantic prompt $C$ is a confounder, that can affect the steering prompt $S$ and output $Y$ . In this hypothetical situation, they construct the randomized controlled trial by creating the pairs, i.e., positive steering prompt plus the same semantic prompt and negative steering prompt plus the same semantic prompt. The only difference is the steering prompt since the semantic prompts are the totally same. Based on this assumption, the edge from the semantic prompt to the steering prompt is blocked. Therefore, they can get the difference in activations, which can intervene in the generated output. The strength of this steering vector is called the treatment effect in causal inference. On the face of it, this is a perfect randomized controlled trial, where only the steering prompt can affect the direction representation $R_{+}/R_{-}$ . The combination of direction representation $R_{+}/R_{-}$ and the semantic prompt together have an influence on the generated output $Y$ . In this case, the steering vector can be obtained by the operation between positive direction representation $R_{+}$ and negative direction representation $R_{-}$ .

3.2 The Real Situation

Before delving into the analysis of the real situation, we first provide formal definitions for the various types of representations involved in the steering process.

Definition 3.1 (Direction Representation $R_{+/-}$ ).

A direction representation $R_{+/-}$ is a representation that solely affects the direction of the output with respect to specific attributes such as truthfulness, bias, harmfulness, or toxicity. It is learned from the steering prompt and should be independent of the semantic context of the output.

Definition 3.2 (Semantic Context Representation $R^{cy}$ ).

A semantic context representation $R^{cy}$ is a representation learned from the semantic prompt, which contains information about the context of the output. It does not provide guidance for the direction of the output with respect to specific attributes.

Definition 3.3 (Semantic Direction Representation $R^{cd}$ ).

A semantic direction representation $R^{cd}$ is a representation learned from the semantic prompt that implicitly influences the direction of the output with respect to specific attributes. This influence stems from associations learned during the pre-training of the large language model, which may introduce biases related to the desired attributes.

Definition 3.4 (Steering Representation $\Delta R$ ).

A steering representation $\Delta R$ is a representation that stands for the direction of the output with respect to specific attributes such as truthfulness, bias, harmfulness, or toxicity. It is obtained by computing the difference between the positive direction representation $R_{+}$ and the negative direction representation $R_{-}$ , which are learned from the corresponding steering prompts. The steering representation should be independent of the semantic direction representation $R^{cd}$ to ensure unbiased steering of the output.

Based on the above definitions, the assumed randomized controlled trial in the hypothetical situation is not qualified because it fails to consider the influence of confounders, specifically the semantic prompt, on the direction representation $R_{+}/R_{-}$ due to biases in the pre-training of the large language model (LLM). As a result, the obtained steering vector is biased. As proven in the introduction section, the semantic prompt $C$ affects not only the content of the output $Y$ but also its direction. We can further break down the semantic prompt into two parts, i.e., the semantic context representation $R^{cy}$ , which influences the content of the output $Y$ , and the semantic direction representation $R^{cd}$ , which influences the direction representation $R_{+}/R_{-}$ .

In addition to the edge from the semantic prompt $C$ to the steering prompt $S$ and the direction representation $R_{+}/R_{-}$ , there is a direct edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representation $R_{+}/R_{-}$ . The constructed pair of prompts can only block the edge from the semantic prompt $C$ to the steering prompt $S$ and the direction representation $R_{+}/R_{-}$ , but it ignores the edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representation $R_{+}/R_{-}$ . As shown in Figure 3 (b), both the steering prompt $S_{+}/S_{-}$ and the semantic direction representation $R^{cd}$ affect the direction representation $R_{+}/R_{-}$ . Consequently, the obtained steering vector $\Delta R$ is biased by the semantic direction representation $R^{cd}$ of the semantic prompt, which is not solely affected by the constructed steering prompt $S$ .

This bias originates from the biases in the pre-training data. For example, consider the semantic prompt “I … talking about weddings”. Due to biases present in the pre-training data, the LLM may have learned to associate weddings with positive sentiments such as love, happiness, and celebration. As a result, the semantic direction representation $R^{cd}$ learned from this prompt may implicitly suggest a positive direction (e.g., "love") for the output, even if the explicit steering prompt is not provided. As shown in Figure 1, this edge can be visually observed experimentally. The implicit influence of the semantic direction representation $R^{cd}$ on the output direction can be problematic when attempting to steer the LLM’s output towards a desired attribute. If the steering representation $\Delta R$ is not independent of $R^{cd}$ , the resulting output may be biased by the inherent associations learned during pre-training, leading to suboptimal or unintended results. To ensure unbiased steering of the output, it is crucial to disentangle the steering representation $\Delta R$ from the semantic direction representation $R^{cd}$ .

3.3 Causal Analysis of Our Solutions

Based on the causal analysis presented in the previous sections, we propose a solution called LLMGuardaril to address the bias introduced by the semantic direction representation $R^{cd}$ in the steering process. As shown in Figure 3 (c), in addition to constructing pairs of prompts (positive and negative steering prompts with the same semantic prompt), we need to block the edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representation $R_{+}/R_{-}$ . By doing so, the direction representation is only influenced by the steering prompt $S_{+}/S_{-}$ , enabling us to obtain an unbiased steering representation $\Delta R$ through steering engineering. This approach aligns with the desired hypothetical situation.

We utilize adversarial learning to remove this confounding bias [58; 59]. The objective is to achieve debiasing of the influence $R^{cd}$ on the direction representation $R_{+}/R_{-}$ during the adversarial learning process. The adversarial learning process includes two main components, i.e., prediction reconstruction learning and debias learning. In prediction reconstruction learning, the goal is to ensure that the original semantic information remains unchanged during the debiasing process. This is achieved by minimizing the prediction reconstruction loss, which measures the cross-entropy between the original output and the output generated.

By ensuring that the steering representation is independent of the semantic direction representation $R^{cd}$ , it becomes an unbiased representation that can effectively steer the output of the language model toward the desired attribute. This approach mitigates the influence of unwanted biases present in the model, enabling more accurate and controlled steering of the language model’s output. The debiased intermediate states generated by the Debias LoRA Block can then be used to guide the LLM’s output toward the desired attributes or concepts while mitigating the impact of unwanted biases. This allows for more precise steering of the language model’s output, leading to improved performance in various downstream tasks.

4 Methodology

4.1 Intervened Layer Selection

Previous studies have investigated the information encoded in different layers of transformer-based models. Tenney et al. [60] found that earlier layers in BERT encode lower-level information, such as part-of-speech tags, while later layers capture more semantic information. Similarly, Zou et al. [18] and Li et al. [20] have observed that the optimal intervention layers for steering the model’s output vary depending on the specific task and desired attribute.

To identify the most effective layers for intervention, we propose a systematic approach based on probing accuracy. Let $L=\{l_{1},l_{2},\dots,l_{D}\}$ denote the set of all layers in the pre-trained language model, where $D$ is the total number of layers. We aim to select a subset of layers $L^{*}\subseteq L$ that maximizes the steering effectiveness and represents the control direction.

We employ a probing classifier to measure the outcome-relatedness of each layer. The probing classifier is trained to predict the target attribute or concept based on the representations at each layer. Specifically, for each layer $l\in L$ , we extract the representations $r\in\mathbb{R}^{n\times d}$ , where $n$ is the sequence length and $d$ is the hidden dimension. We then train a linear classifier $g:\mathbb{R}^{n\times d}\rightarrow\mathbb{R}^{c}$ on top of the representations, where $c$ is the number of classes for the target attribute. The probing accuracy of each layer $l$ is evaluated on a validation set. We rank the layers based on their probing accuracy and select the top- $K$ layers as the intervened layers $L^{*}$ . The number of intervened layers $K$ is a hyperparameter that can be tuned based on the specific task and desired trade-off between steering effectiveness and computational efficiency.

Empirically, we find that the middle layers of the pre-trained language model tend to be the most effective for intervention. This observation aligns with previous findings [18; 20] suggesting that the middle layers capture a balance between low-level syntactic information and high-level semantic information, making them suitable for steering the model’s output towards the desired attribute or concept. By selecting the intervened layers based on probing accuracy, we can focus the intervention on the most relevant layers, thereby improving the efficiency and effectiveness of the steering process. The selected layers $L^{*}$ are then used in the Debias LoRA Block to obtain the debiased representations and calculate the steering representation.

4.2 Unbiased Steering Representations

As discussed in the causal analysis, to obtain unbiased steering representations, we must block the edge from the semantic direction representation $R^{cd}$ of the semantic prompt to the direction representations $R_{+}$ and $R_{-}$ . By doing so, the direction representations are solely influenced by the steering prompts $S_{+}$ and $S_{-}$ , enabling us to obtain an unbiased steering representation $\Delta R$ through steering engineering. To achieve this, we introduce a debiased training framework for the intervened layers $L^{*}$ .

4.2.1 Debias Training Framework

As shown in Figure 4, LLMGuardaril is a plug-and-play algorithmic framework designed to obtain unbiased steering representations for LLMs while seamlessly integrating with their existing architecture. The framework consists of two key components: the Debias LoRA Block and the Domain Probing module. The Debias LoRA Block is a modified version of the standard LoRA (Low-Rank Adaptation) technique. Unlike standard LoRA, which adds a residual update to the original intermediate state, the Debias LoRA Block directly replaces the original intermediate state $r^{l}$ with the debiased intermediate state $\hat{r}^{l}$ . This is achieved through the following operation:

\hat{r}^{l}=\Delta Wr^{l-1}=BAr^{l-1}

(1)

where $B$ and $A$ are learned matrices of size $d\times m$ and $m\times d$ , respectively, with $m\ll d$ . This low-rank adaptation allows for efficient fine-tuning of the LLM while introducing minimal additional parameters. The Domain Probing module is implemented as a multi-layer perceptron (MLP) and plays a crucial role in the adversarial learning process. Its purpose is to probe whether the representation of the semantic prompt can be distinguished in terms of bias.

4.2.2 Adversarial Learning

The training process of LLMGuardaril involves adversarial learning, where the Debias LoRA Block and the Domain Probing module are optimized simultaneously. The objective is to use the Domain Probing module to debias the influence of $R^{cd}$ on the direction representations $R_{+}$ and $R_{-}$ during the adversarial learning process [61].

Given an input prompt $I=[S,C]$ , where $S$ is the prefix steering prompt and $C$ is the semantic prompt, we first calculate the token lengths of $S$ and $C$ as $L_{s}$ and $L_{c}$ , respectively. When $I$ passes through the $l$ -th layer of the original LLM, we obtain the intermediate representation $r^{l}$ . After passing through the $l$ -th layer of the Debias LoRA Block, we obtain the debiased intermediate representation $\hat{r}^{l}$ . We define the set of intermediate representations from the original LLM as $R=[r^{0},r^{1},r^{l},\cdots,r^{D}]$ and the set of debiased intermediate representations as $\hat{R}=\{r^{-*},\hat{r}^{*}\}$ , where $-*$ represents the non-intervened layers and $*$ represents the intervened layers. The $r^{-*}=\{r^{l}\}$ , where $l\in L^{-*}$ is a non-intervened layer, and $\hat{r}^{*}=\{\hat{r}^{l}\}$ , where $l\in L^{*}$ is an intervened layer.

The adversarial learning process consists of two main components: prediction reconstruction learning and debias learning. In the debias learning, the objective is to debias the representation through the Domain Probing module. The Domain Probing module processes $\hat{r}^{l}[-L_{c}:]$ , which corresponds to the semantic prompt portion of the whole prompt. The goal is to train the Domain Probing module such that it cannot determine the bias of the LLM based on $\hat{r}^{l}[-L_{c}:]$ . To facilitate this, a Gradient Reversal Layer (GRL) [62] is introduced in the debias learning process. The debias loss is defined as follows:

\mathcal{L}_{debias}=\sum^{N}_{i=1}\Bigl{(}y^{direction}-\text{GradRev}(f(\hat% {r}^{l}[-L_{c}:],\eta))\Bigl{)}

(2)

where $N$ is the number of samples, $y^{direction}$ is the direction label of output, i.e., desired or undesired attributes or concepts (e.g., truthful or untruthful, harmful or unharmful, and so on), $f$ is the Domain Probing module, and $\eta$ is the proportional coefficient of the Gradient Reversal Layer.

In prediction reconstruction learning, the objective is to ensure that the original semantic information remains unchanged during the debiasing process. This is achieved by minimizing the prediction reconstruction loss, which measures the cross-entropy between the original output $y^{output}$ by original framework $\phi$ using original representation $R$ and the output generated by LLMGuardaril $\hat{\phi}$ using the debiased representations $\hat{R}$ :

\mathcal{L}_{pre}=\text{CEloss}(y^{output},\hat{\phi}(\hat{R}))

(3)

The overall loss function is a combination of the prediction reconstruction loss and the debias loss with the hyperparameter $\alpha$ :

\mathcal{L}=\mathcal{L}_{pre}+\alpha\mathcal{L}_{debias}

(4)

During the adversarial learning process, the Debias LoRA Block and the Domain Probing module are optimized jointly using this loss function. The Debias LoRA Block aims to generate debiased intermediate representations $\hat{r}^{l}$ that maintain the original semantic information while reducing bias. On the other hand, the Domain Probing module learns to become invariant to the bias present in $R^{cd}$ by minimizing the debias loss through the Gradient Reversal Layer.

By employing this adversarial learning framework, LLMGuardaril enables the extraction of steering representations that are less influenced by the inherent biases in the LLM’s pre-training data. The debiased intermediate representations generated by the Debias LoRA Block can then be used to guide the LLM’s output toward the desired attributes or concepts while mitigating the impact of unwanted biases.

4.2.3 Obtaining the Steering Representation

Now, we have obtained the debiased representation $\hat{r}_{i}^{*}$ for the $i$ -th token. To calculate the unbiased steering representation, we follow these steps:

1.

Positive and Negative Debiased Representations: For each token $i$ , we obtain the debiased representations corresponding to the positive steering prompt and the negative steering prompt, denoted as $\hat{r}_{i,+}^{*}$ and $\hat{r}_{i,-}^{*}$ , respectively. These debiased representations are obtained from the intervened layers of the Debias LoRA Block.
2.

Difference Calculation: We calculate the difference between the positive and negative debiased representations for each token $i$ , i.e., $\Delta\hat{r}_{i}^{*}=\hat{r}_{i_{+}}^{*}-\hat{r}_{i_{-}}^{*}$ . This difference represents the steering direction for the $i$ -th token, capturing the contrast between the positive and negative steering prompts.

Sample-wise and Token-wise Averaging: In order to further refine the steering representation and make it more robust, we average the differences across all tokens and multiple samples,

\Delta\hat{r}^{*}=\frac{1}{N}\sum_{j=1}^{N}\frac{1}{n}\sum_{i=1}^{n}\Delta\hat% {r}_{ij}^{*},

(5)

where $n$ is the total number of tokens in the input sequence and $N$ is the total number of samples used for steering representation calculation. Each sample corresponds to a different input sequence or context. This averaging operation yields a single steering representation that captures the overall steering direction across all tokens and samples.

The resulting $\Delta\hat{r}^{*}$ represents the final steering representation obtained from the debias training process. This steering representation encodes the desired steering direction, taking into account the contrast between positive and negative steering prompts, averaged across tokens and samples. The final steering representation $\Delta\hat{r}^{*}$ can be used to guide the generation process of the language model. By adding this steering representation to the intermediate representations of the LLM during inference, we can steer the model’s output towards the desired attributes or concepts while mitigating the influence of biases present in the pre-training data.

4.3 Explanation of Output

In this step, our objective is to identify a direction that accurately predicts the underlying output concept and to provide an explanation for the generated output. Given the steering representation $\Delta\hat{r}^{*}\in\mathbb{R}^{K\times d}$ , where $K$ represents the number of intervened layers and $d$ is the dimensionality of the representation, we aim to measure the alignment between the generated output and the desired direction.

For each generated token $i$ , we select the corresponding representations $r^{*}_{i}\in\mathbb{R}^{K\times d}$ from the intervened layers. These representations capture the activations of the model at the specific layers where the steering intervention is applied. To obtain a consolidated representation for both the steering prompt and the generated token, we perform layer-wise averaging. We compute the average of the steering representation $\Delta\hat{r}^{*}$ and the token representation $r^{*}_{i}$ across the $K$ intervened layers. This step aggregates the information from multiple layers, providing a more robust representation. Mathematically, let $\overline{\Delta\hat{r}^{*}}$ and $\overline{r^{*}_{i}}$ denote the layer-averaged representations for the steering representation and the generated token $i$ , respectively.

Next, we compute the dot product between the averaged steering representation vector $\overline{\Delta\hat{r}^{*}}$ and the averaged token representation vector $\overline{r^{*}_{i}}$ . The dot product measures the similarity between the two vectors, indicating the alignment between the generated output and the desired direction:

\text{Similarity}_{i}=\overline{\Delta\hat{r}^{*}}\cdot\overline{r^{*}_{i}},

(6)

The resulting $\text{similarity}_{i}$ score quantifies the extent to which the generated token $i$ aligns with the desired direction defined by the steering representation. A higher similarity score indicates a stronger alignment, suggesting that the generated output is more likely to exhibit the desired attribute or concept.

To provide a comprehensive explanation of the generated output, we can analyze the similarity scores for each generated token and identify the tokens that contribute most significantly to the undesired direction. We can rank the tokens based on their similarity scores and highlight the top-k tokens that have the lowest alignment with the steering representation. These top-k tokens can be considered as the key indicators of the undesired attribute or concept in the generated output. Furthermore, we can visualize the similarity scores across the generated output to gain insights into the alignment between the output and the desired direction. By plotting the similarity scores for each token, we can observe the variations in alignment throughout the generated text. This visualization can help identify regions of the output that strongly align with the desired direction and regions that may deviate from it. In addition to the token-level analysis, we can also compute an overall alignment score for the entire generated output. This can be done by averaging the similarity scores across all tokens:

\text{Alignment}=\frac{1}{n}\sum_{i=1}^{n}\text{Similarity}_{i}

(7)

where $n$ is the total number of tokens in the generated output. The overall alignment score provides a summary measure of how well the generated output aligns with the desired direction.

By combining the token-level analysis, visualization of similarity scores, and the overall alignment score, we can provide a comprehensive explanation of the generated output in relation to the desired direction. This explanation can help users understand and identify the specific aspects of the output that contribute to the desired attribute or concept.

4.4 Control of Output

The steering representations are injected into the model and activated during inference, directing the model’s response toward the desired direction. Unlike most existing methods [20; 63; 35; 34] that simply add or subtract a constant steering representation regardless of the token representation, we propose a more sophisticated approach to control the output. Our method takes into account the relationship between the generated token representation and the steering representation, allowing for more fine-grained and context-aware control of the output.

To establish a connection between the generated token representation and the steering representation, we employ a projection operation inspired by [18]. This operation amplifies the component of the token representation that aligns with the steering representation, effectively emphasizing the desired direction in the output. Let $r^{*}_{i}\in\mathbb{R}^{K\times d}$ denote the representation of the generated token $i$ obtained from the intervened layers. This is achieved by projecting out the component in the direction of steering representation $\Delta\hat{r}^{*}$ , and the operation can be defined as

\hat{r}^{*}_{i}=r^{*}_{i}+\beta\times\frac{r^{*\mathsf{T}}_{i}\Delta\hat{r}^{*% }}{\|\Delta\hat{r}^{*}\|^{2}}\Delta\hat{r}^{*},

(8)

where $r^{*\mathsf{T}}_{i}$ represents the transpose of the token representation $r^{*}_{i}$ , and $\|\Delta\hat{r}^{*}\|^{2}$ denotes the squared Euclidean norm of the steering representation. To steer, we multiply the projection by a coefficient $\beta$ that represents the intervention strength.

By adjusting the value of $\beta$ , we can control the extent to which the output is steered toward the desired direction. A larger value of $\beta$ will result in a stronger emphasis on the desired direction, while a smaller value will have a more subtle effect. The choice of $\beta$ is crucial for achieving the desired level of control over the output. It allows us to balance the influence of the steering representation with the original token representation, ensuring that the generated output remains coherent and relevant to the input context while incorporating the desired direction. Another consideration is the adaptability of $\beta$ to different contexts and desired directions. It may be beneficial to dynamically adjust the value of $\beta$ based on the characteristics of the input and the specific direction we aim to steer the output. For example, we can employ a context-dependent $\beta$ that varies based on the semantic similarity between the input and the desired direction, allowing for a more nuanced control of the output.

Furthermore, the projection operation can be extended to incorporate multiple steering representations simultaneously. In scenarios where we want to steer the output towards multiple desired directions, we can compute the projections onto each steering representation separately and combine them using appropriate weighting schemes. This enables a more comprehensive control over the output, allowing us to incorporate multiple desired attributes or concepts.

In summary, the control of output in our proposed method involves projecting the generated token representations onto the steering representation and scaling the projected component by a coefficient $\beta$ . This approach establishes a connection between the generated output and the desired direction, enabling a more fine-grained and context-aware control compared to existing methods. By carefully tuning the value of $\beta$ and potentially adapting it to different contexts, we can effectively steer the output toward the desired direction while maintaining the quality and coherence of the generated text. The projection operation can also be extended to incorporate multiple steering representations, allowing for more comprehensive control over the output.

5 Experiments

We evaluate the performance of our proposed LLMGuardaril framework on four key attributes that serve as guardrails for large language models: truthfulness, toxicity, bias, and harmfulness. These attributes are crucial for ensuring the safe and responsible deployment of language models in real-world applications.

5.1 Baselines

In this section, we introduce a diverse set of baseline methods that aim to steer the behavior of large language models towards desired attributes or concepts. By comparing LLMGuardaril against these baselines, we can assess its effectiveness in achieving the desired steering while maintaining the model’s general knowledge and capabilities.

Base model (Few-shot) [64] is an in-context learning method that does not require fine-tuning of the language model. This technique leverages a small set of high-quality samples to construct prompts, which are then used to guide the Language Learning Model (LLM) in generating answers that closely resemble the provided examples. To implement this approach, we first create a set of 3-5 queries. These queries are then fed into GPT-4, a state-of-the-art language model, to generate high-quality answers. The resulting query-answer pairs serve as in-distribution examples, which are subsequently used to construct prompts during the inference process. By providing the LLM with these carefully crafted prompts, the model is able to generate answers that are more aligned with the desired output, thereby improving the overall quality of the answers.

Linear Artificial Tomography (LAT-Reading) [18] is a representation reading technique that aims to locate emergent representations for high-level concepts and functions within a network. The LAT pipeline consists of three key steps: designing stimulus and task, collecting neural activity, and constructing a linear model. A stimulus set and task template are carefully designed to elicit distinct neural activity for the target concept or function. Next, neural activity is collected from specific token positions in the model’s hidden states. Finally, a linear model, such as PCA or a supervised method, is applied to the collected neural activity to identify directions that align with the target concept or function. The resulting “reading vectors” can be generally used to steer the output direction.

LAT-Contrast [18] is a representation control technique that builds upon the reading vectors obtained through LAT. During inference, the same input is run through the model using a pair of contrastive prompts, producing two different representations. The difference between these representations forms a Contrast Vector, which serves as a stimulus-dependent controller. The Contrast Vectors are used to modify the model’s representations at each layer using operations such as linear combination, piece-wise operations, or projection. Compared to the above LAT-Reading, it requires over $3\times$ more inference compute for the positive prompt, negative prompt, and the original prompt.

Low-Rank Representation Adaptation (LoRRA) [18] is another representation control technique that addresses the computational overhead associated with calculating Contrast Vectors during inference. LoRRA involves fine-tuning low-rank adapter matrices connected to the model’s attention weights using a specific loss function applied to representations. The loss function is designed to guide the model’s representations towards the desired target representations, which can be obtained using methods such as LAT-Contrast. During training, the adapters are updated to minimize the difference between the model’s current representations and the target representations. Once trained, the adapters are merged into the model, resulting in modified representations without additional computational burden during inference. LoRRA provides an efficient way to control the model by directly optimizing the model’s representations.

Activation Addition (ActAdd) [34] is a lightweight approach for controlling the behavior of pre-trained language models without fine-tuning or optimization. ActAdd operates by modifying the activations of a frozen model at inference time to steer the output in a desired direction, which is specified through natural language prompts. The method computes the difference in activations resulting from a pair of prompts that represent the desired attribute and its opposite, then adds this delta to the activations at a specific layer during inference. This steers the model’s output towards exhibiting the desired attribute while preserving its general knowledge and capabilities on unrelated tasks.

Mean-Centring [35] is a method for steering the behavior of pre-trained language models by modifying their activations at inference time. The key idea is to extract a “distillation vector” that captures the properties of a target dataset exhibiting the desired behavior. This is done by averaging the activations of the model when processing the target dataset and then subtracting the mean activations of the model on a general training dataset. This mean-centering step removes the model’s general activation bias and isolates the specific direction corresponding to the target behavior.

Contrast-Consistent Search (CCS) [26] is an unsupervised method for discovering latent knowledge in the internal activations of pre-trained language models. Given a set of yes-no questions, CCS constructs “contrast pairs” by answering each question as both “yes” and “no”. It then extracts the model’s hidden representations for these contrast pairs, normalizes them to remove positional biases, and learns a linear mapping from the representations to probabilities of being true or false.

5.2 Benchmarks and Evaluation Metrics

To assess the effectiveness of LLMGuardaril in steering language models towards desired attributes, we conduct experiments on several public datasets that focus on different aspects of content safety and bias [63]. These datasets are carefully selected to cover a wide range of challenging scenarios and provide a comprehensive evaluation of our proposed framework.

Truthfulness. The TruthfulQA benchmark [65] is utilized to evaluate LLMGuardaril’s performance in promoting truthful responses. This dataset is meticulously designed to be adversarial, incorporating false beliefs, misconceptions, and logical falsehoods across 38 categories. By testing on the full dataset of 817 questions, we assess the model’s ability to navigate through these challenges and provide accurate and informative answers. The primary metric, denoted as $\texttt{True}+\texttt{Info}$ , represents the percentage of responses that are both truthful and informative, as determined by two finetuned GPT-3-13B models (GPT-judge). Additionally, we report separate results for truthfulness and informativeness to provide a more granular analysis of the model’s performance.

Toxicity. To evaluate the effectiveness of LLMGuardaril in mitigating toxicity, we employ the ToxiGen dataset [66], which contains implicitly toxic and benign sentences mentioning 13 minority groups. We use a revised version of the dataset [67] that reduces noise by filtering out prompts with disagreement among annotators regarding the target demographic group. Through stratified sampling, we select a representative subset of 700 examples for our experiments. The main metric is the percentage of toxic generations, determined using HateBERT, a fine-tuned BERT model provided by the dataset. Furthermore, we report the percentage of refusal responses, identified by the presence of specific signal keywords, to assess the model’s ability to avoid engaging with potentially harmful prompts.

Bias. The BOLD benchmark [68] is employed to evaluate bias in generated responses. This large-scale dataset comprises 23,679 English Wikipedia prompts spanning five domains: race, gender, religion, political ideology, and profession. To manage experiment costs, we sample 120 prompts from each domain. The VADER sentiment score [69] serves as the primary metric, quantifying the sentiment directed towards the population mentioned in each prompt. VADER generates a sentiment score between -1 and 1, with 0 indicating a neutral sentiment. While the goal is to identify imbalances in sentiment across different groups, for conciseness, we report the mean sentiment score over the entire dataset as our main metric. Additionally, we provide the percentage of refusal responses to assess the model’s ability to avoid biased or discriminatory language.

Harmfulness. To evaluate LLMGuardaril’s performance in mitigating harmful content, we utilize the AdvBench dataset [70], which contains 500 harmful behaviors and instructions reflecting toxic, discriminatory, or cybercrime-related actions. The primary metric is the percentage of refusal responses, identified using the same key phrases for refusal as in the original dataset. Additionally, we employ HateBERT, a fine-tuned BERT model, to classify the toxicity of generated responses (excluding refusals). The percentage of toxic generations serves as an additional metric to assess the model’s ability to avoid producing harmful content.

By conducting experiments on these diverse benchmarks, we aim to provide a comprehensive evaluation of LLMGuardaril’s effectiveness in steering language models towards desired attributes while mitigating undesirable behaviors. The selected datasets cover a wide range of content safety and bias challenges, enabling us to assess the framework’s performance in promoting truthfulness, reducing toxicity and bias, and avoiding harmful content generation.

5.3 Experiment Settings

Prompt Design. To thoroughly assess the efficacy of LLMGuardaril, we employ a prompt structure that combines a prefix steering prompt, which can be either positive or negative, with an identical semantic prompt. This approach enables us to isolate the influence of the steering prompt on the model’s generated output while maintaining a consistent semantic context across experiments. As depicted in Figure 5, we meticulously develop the prefix steering prompt sets tailored to each benchmark dataset: TruthfulQA, BOLD, ToxiGen, and AdvBench. These prompts are strategically designed to steer the model towards generating content that aligns with the desired attributes or concepts specific to each dataset, such as truthfulness, lack of bias, non-toxicity, and avoidance of harmful content.

Model Selection. We evaluate the guardrail performance of LLMGuardaril using two prominent families of instruction-tuned large language models: Llama2 [71] and Vicuna-V1.5 [72]. The choice of these models is motivated by their widespread adoption and exceptional performance. To strike a balance between computational feasibility and comprehensive evaluation, our primary experiments focus on the 7B and 13B variants within each model family. These model sizes provide a reasonable trade-off between performance and resource requirements, enabling us to conduct extensive experiments while managing computational costs effectively. To gain a deeper understanding of LLMGuardaril’s scalability and its potential to generalize across different model sizes, we expand our experiments to encompass the entire spectrum of model variants within the Vicuna family, including Vicuna-33B.

Table 1: Performance comparison on four benchmark datasets.

\uparrow

means higher is better and

\downarrow

means lower is better.

BaseModel	Method	TruthfulQA			ToxiGen		BOLD		AdvBench
BaseModel	Method	True(%) $\uparrow$	Info(%) $\uparrow$	True+Info(%) $\uparrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Avg.Sent. $\uparrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$
	Base	34.08	88.32	30.10	54.00	44.71	35.00	0.438	80.58	19.04
	Few Shot	37.13	91.11	33.83	66.65	32.30	39.72	0.498	81.30	17.63
	LAT-Reading	38.69	92.79	35.90	70.32	29.02	43.61	0.593	83.34	16.60
	LAT-Contrast	40.19	94.50	37.98	77.40	22.11	53.27	0.710	85.28	14.56
Vicuna-7b	LORRA	39.00	93.77	36.57	73.76	25.18	50.77	0.673	84.74	15.00
	ActAdd	35.63	91.02	32.43	63.38	36.50	37.10	0.444	81.21	18.79
	Mean-Centring	37.06	93.23	34.55	66.22	33.70	40.35	0.478	82.03	17.76
	CCS	37.30	95.44	35.60	75.91	23.70	50.40	0.680	84.88	15.02
	LLMGuardaril (ours)	44.74	95.63	42.78	85.59	14.02	59.55	0.738	86.76	12.90
	Base	34.75	89.52	31.11	54.71	43.57	0.45	0.746	65.58	34.42
	Few Shot	36.13	92.49	33.42	67.71	32.29	2.34	0.553	77.50	22.31
	LAT-Reading	38.40	92.21	35.41	68.43	30.43	3.67	0.855	80.58	19.04
	LAT-Contrast	38.62	94.77	36.60	76.43	23.57	3.91	0.884	78.31	21.50
Llama2-7b	LORRA	38.32	93.40	35.79	72.86	25.43	3.57	0.880	77.73	22.27
	ActAdd	35.13	90.31	31.73	62.71	37.29	1.50	0.704	68.82	30.11
	Mean-Centring	36.28	92.68	33.62	65.86	30.29	3.80	0.899	70.58	28.46
	CCS	34.61	96.22	33.30	74.78	25.01	4.22	0.873	77.31	22.62
	LLMGuardaril (ours)	42.31	95.60	40.45	86.29	13.01	8.00	0.895	80.85	19.15
	Base	45.33	90.80	41.15	57.57	42.43	3.57	0.863	68.46	30.77
	Few Shot	44.63	94.52	42.18	67.92	32.01	4.07	0.872	70.40	29.55
	LAT-Reading	45.02	95.73	43.10	70.73	29.02	5.31	0.893	77.92	21.79
	LAT-Contrast	47.04	96.36	45.33	78.66	20.54	6.69	0.899	78.97	20.33
Llama2-13b	LORRA	46.57	96.01	44.71	74.63	25.20	6.14	0.870	78.60	21.14
	ActAdd	45.06	92.75	41.79	63.56	36.11	4.63	0.860	71.17	28.50
	Mean-Centring	44.74	93.77	41.95	68.13	31.59	4.90	0.867	73.55	26.42
	CCS	43.13	97.43	42.03	79.39	20.45	6.71	0.897	78.26	21.57
	LLMGuardaril (ours)	48.22	96.77	46.67	88.85	10.15	8.64	0.899	80.96	19.04

5.4 Main Results

Table 1 presents a comprehensive performance comparison of LLMGuardaril against various baseline methods across four benchmark datasets: TruthfulQA, ToxiGen, BOLD, and AdvBench. These datasets evaluate the models’ ability to align with desired attributes such as truthfulness, non-toxicity, lack of bias, and avoidance of harmful content. Across all datasets and model variants (Vicuna-7b, Llama2-7b, and Llama2-13b), LLMGuardaril consistently outperforms the baseline methods. The results highlight LLMGuardaril’s state-of-the-art performance in steering the language models towards the desired attributes. It successfully mitigates undesirable behaviors and aligns the generated outputs with the target objectives. The superior performance of LLMGuardaril compared to the baseline methods highlights the significance of our proposed framework and its potential for real-world applications. As depicted in Figure 5, we provide examples comparing the original LLM output without any guardrails to the intervened output generated by our LLMGuardaril. The original and intervened outputs use shading to highlight tokens based on their degree of alignment with the desired direction, as measured by the similarity scores between the token representations and the steering representation.

Table 2: Ablation studies of our LLMGuardaril model for the key components.

\uparrow

means higher is better and

\downarrow

means lower is better.

BaseModel	Method	ToxiGen		BOLD		AdvBench
BaseModel	Method	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Avg.Sent. $\uparrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$
	Base	54.71	43.57	0.45	0.746	65.58	34.42
	LLMGuardaril (ours)	86.29	13.01	8.00	0.895	80.85	19.15
Llama2-7b	w/o Causal Debias Loss	74.34	25.35	5.77	0.890	76.67	21.34
	w/o Prediction Loss	50.86	49.00	0.33	0.711	64.10	35.57
	w/o Intervened Layer Selection	84.32	15.52	7.68	0.878	77.62	21.47
	Base	57.57	42.43	3.57	0.863	68.46	30.77
	LLMGuardaril (ours)	88.85	10.15	8.64	0.899	80.96	19.04
Llama2-13b	w/o Causal Debias Loss	76.59	23.16	6.20	0.876	78.41	20.76
	w/o Prediction Loss	51.44	48.56	2.05	0.799	64.57	35.43
	w/o Intervened Layer Selection	87.22	12.20	8.05	0.883	77.93	21.34

5.5 Ablation Study

5.5.1 Different Components

Table 2 presents the ablation studies conducted to evaluate the impact of different components in the LLMGuardaril framework. The experiments are performed on two model variants, Llama2-7b and Llama2-13b, using three benchmark datasets: ToxiGen, BOLD, and AdvBench. The ablation studies focus on three key components of LLMGuardaril: Causal Debias Loss: This loss term is designed to debias the influence of the semantic direction representation on the direction representations during the adversarial learning process. By minimizing this loss, LLMGuardaril aims to mitigate the biases from the steering process and obtain unbiased steering representations. Prediction Loss: The prediction loss ensures that the original semantic information remains unchanged during the debiasing process. It measures the cross-entropy between the original output generated by the LLM using the original representations and the output generated using the debiased representations. Minimizing this loss helps maintain the model’s general knowledge and capabilities while applying the steering. Intervened Layer Selection: LLMGuardaril employs a systematic approach to identify the most effective layers for intervention based on probing accuracy. The selected layers are then used in the Debias LoRA Block to obtain the debiased representations and calculate the steering representation. Without this intervened layer selection module, we directly select the middle layers, for example, 6 layers for Llama2-7B (total 32 layers) and 10 layers for Llama2-13B (total 40 layers).

The ablation studies are conducted by removing each component individually and evaluating the model’s performance. The results are compared to the base model and the complete LLMGuardaril framework. Overall, the ablation studies demonstrate the significance of each component in LLMGuardaril. The Causal Debias Loss is crucial for obtaining unbiased steering representations and effectively mitigating undesired attributes. The Prediction Loss plays a vital role in maintaining the model’s general knowledge and ensuring the quality of the generated outputs. The Intervened Layer Selection contributes to the overall performance by identifying the most effective layers for intervention. The complete LLMGuardaril framework, incorporating all these components, achieves the best results across the benchmark datasets, validating the effectiveness of our proposed approach.

Table 3: Ablation studies of our LLMGuardaril for different output control operation choices.

\uparrow

means higher is better and

\downarrow

means lower is better.

BaseModel	Method	ToxiGen		BOLD		AdvBench
BaseModel	Method	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Avg.Sent. $\uparrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$
	Base	54.71	43.57	0.45	0.746	65.58	34.42
Llama2-7b	w/ Addition	80.84	19.10	6.99	0.860	77.72	21.68
	w/ Product	84.33	15.45	7.64	0.865	79.92	19.36
	w/ Projection (ours)	86.29	13.01	8.00	0.895	80.85	19.15
	Base	57.57	42.43	3.57	0.863	68.46	30.77
Llama2-13b	w/ Addition	82.77	17.01	7.55	0.877	77.90	21.70
	w/ Product	85.67	14.02	8.39	0.882	79.45	20.23
	w/ Projection (ours)	88.85	10.15	8.64	0.899	80.96	19.04

5.5.2 Different Output Control Operations

We explore three different operations for controlling the model output using the steering representations: addition, product, and projection. Let $r^{*}_{i}\in\mathbb{R}^{K\times d}$ denote the representation of the generated token $i$ obtained from the intervened layers and $\Delta\hat{r}^{*}$ denote steering representation. The three operations are defined as follows: Addition $\hat{r}^{*}_{i}=r^{*}_{i}+\beta\Delta\hat{r}^{*}$ ; Product $\hat{r}^{*}_{i}=\beta r^{*}_{i}\cdot\Delta\hat{r}^{*}$ ; Projection $\hat{r}^{*}_{i}=r^{*}_{i}+\beta\times\frac{r^{*\mathsf{T}}_{i}\Delta\hat{r}^{*% }}{\|\Delta\hat{r}^{*}\|^{2}}\Delta\hat{r}^{*}$ . In all three operations, $\beta$ is a coefficient that represents the intervention strength. By adjusting $\beta$ , we can control the extent to which the output is steered toward the desired direction. Table 3 presents that the projection operation consistently achieves the best performance across all metrics and datasets. The superior performance of the projection operation can be attributed to its ability to emphasize the component of the token representation that aligns with the steering representation. By projecting the token representation onto the direction of the steering representation, it effectively amplifies the desired direction in the output while preserving the relevant information from the original token representation. This allows for a more targeted and context-aware control of the output compared to simple addition or product operations.

Table 4: Scaling Law studies of our LLMGuardaril model for different-sized LLMs.

\uparrow

means higher is better and

\downarrow

means lower is better.

BaseModel	Method	ToxiGen		BOLD		AdvBench
BaseModel	Method	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Avg.Sent. $\uparrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$
	Base	54.00	44.71	35.00	0.438	80.58	19.04
Vicuna-7b	LLMGuardaril (ours)	85.59	14.02	59.55	0.738	86.76	12.90
	Base	56.65	43.25	38.83	0.553	81.74	17.74
Vicuna-13b	LLMGuardaril (ours)	86.60	13.33	60.35	0.740	87.22	12.60
	Base	58.02	41.24	40.42	0.575	82.34	17.37
Vicuna-33b	LLMGuardaril (ours)	87.45	12.34	61.22	0.749	87.96	12.00

5.6 The Study of Scaling Law

Table 4 presents the scaling law study conducted to evaluate the performance of LLMGuardaril across different model sizes within the Vicuna family. The experiments are performed on three benchmark datasets: ToxiGen, BOLD, and AdvBench. The base model performance is compared to LLMGuardaril for each model size: Vicuna-7b, Vicuna-13b, and Vicuna-33b. The scaling law study aims to investigate the effect of increasing model size on the performance of LLMGuardaril in steering the model’s output towards desired attributes. It explores the benefits of LLMGuardaril scale with the model size and if larger models exhibit better alignment with the target attributes.

Overall, the scaling law study demonstrates the robustness and scalability of LLMGuardaril across different model sizes within the Vicuna family. It shows the benefits of LLMGuardaril in steering the model’s output towards desired attributes scale with the model size, providing insights into the relationship between model capacity and steering performance. The results suggest that LLMGuardaril can be effectively applied to even larger models to achieve better alignment with target attributes while mitigating undesired behaviors.

Table 5: Out-of-domain (OOD) experiments of our LLMGuardaril model on the ToxiGen dataset.

\uparrow

means higher is better and

\downarrow

means lower is better.

BaseModel	Method	Target:Black		Target: Muslim		Target: Native Am		Target: Latino		Target: Jewish
BaseModel	Method	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$	Refusal(%) $\uparrow$	Toxic(%) $\downarrow$
	Base	64.72	35.20	53.48	26.11	78.24	21.76	29.13	69.66	80.61	19.17
	LAT-Reading	80.44	19.25	69.88	30.02	90.05	9.78	44.15	53.15	87.90	11.30
Llama2-7b	LORRA	84.75	15.00	73.40	25.33	91.15	8.33	46.33	53.67	90.74	9.26
	Mean-Centring	75.43	24.50	67.14	30.18	84.07	15.12	40.50	58.88	89.00	10.55
	LLMGuardaril (ours)	90.73	8.13	75.21	23.15	100	0	56.20	42.62	100	0
	Base	68.55	31.45	55.37	44.63	84.49	15.51	35.44	61.50	81.31	17.66
	LAT-Reading	83.70	16.10	71.16	28.60	91.05	8.53	47.93	51.20	88.10	11.65
Llama2-13b	LORRA	87.45	12.55	76.87	22.18	96.55	3.10	60.89	38.05	92.77	7.10
	Mean-Centring	77.64	22.30	69.90	30.10	88.74	11.26	45.49	52.40	84.78	15.00
	LLMGuardaril (ours)	89.54	10.41	80.87	18.17	100	0	73.17	25.00	100	0

5.7 OOD Experiments of LLMGuardaril

Table 5 presents the results of out-of-domain (OOD) experiments for the LLMGuardaril model and several baseline methods on the ToxiGen dataset. The experiments aim to assess the model’s ability to generalize the steering representations learned from one demographic group (women) to other demographic groups (Black, Muslim, Native American, Latino, and Jewish).

LLMGuardaril demonstrates strong generalization capability by effectively reducing toxicity and increasing refusal rates for all target demographic groups, even though the steering representations were learned using examples from the women group. This suggests that the learned steering representations capture general patterns of toxicity and harmfulness that can be applied to different demographic contexts. In addition, LLMGuardaril consistently outperforms the baseline methods (Base, LAT-Reading, LORRA, and Mean-Centring) across all target demographic groups, achieving higher refusal rates and lower toxicity percentages. This highlights the effectiveness of the causal analysis and adversarial learning techniques employed by LLMGuardaril in obtaining unbiased steering representations that generalize well to unseen demographic groups.

5.8 Sensitivity Analysis of LLMGuardaril

Figure 6 and 7 present the sensitivity analysis of the hyperparameter $\alpha$ and $\beta$ on the ToxiGen and AdvBench datasets. The hyperparameter $\alpha$ controls the balance between the prediction reconstruction loss and the debias loss in the overall loss function. The findings suggest that the prediction reconstruction loss is a vital element in the training process and carries more weight compared to the debias loss in terms of maintaining the model’s effectiveness. The hyperparameter $\beta$ represents the intervention strength when controlling the model output using the steering representations. Based on the trends observed in the sensitivity analyses, an optimal value for $\beta$ appears to be around 2. This value achieves a good balance between refusing harmful prompts and minimizing toxic outputs while maintaining the model’s performance. The optimal ranges identified for $\alpha$ and $\beta$ can guide the selection of appropriate values for these hyperparameters in practical applications of the LLMGuardaril model.

6 Discussion

In this paper, we introduced LLMGuardaril, a novel framework for obtaining unbiased steering representations in large language models (LLMs) by incorporating causal analysis and adversarial learning techniques. Our approach aims to mitigate the influence of semantic biases on the steering process and provide explainable insights into the generated output. In this section, we discuss the potential social impact of our work, address the limitations of our framework, and outline future research directions.

6.1 Potential Social Impact

The development of safe and reliable LLMs has significant social implications. As these models become increasingly prevalent in various domains, such as content generation, dialogue systems, and decision support, it is crucial to ensure that their outputs align with desired attributes and do not perpetuate harmful biases. Our proposed LLMGuardaril framework contributes to this goal by enabling the steering of LLMs toward desired directions while mitigating the influence of unwanted biases.

The explainability component of LLMGuardaril has the potential to foster trust and transparency in the use of LLMs. By providing insights into the alignment between the generated output and the desired direction, our framework allows users to understand the specific aspects of the output that contribute to the desired attributes. This explainability feature can be particularly valuable in sensitive domains, such as healthcare, finance, and legal systems, where the ability to interpret and validate the model’s outputs is essential.

Furthermore, the unbiased steering representations obtained through LLMGuardaril can help mitigate the propagation of societal biases and stereotypes in the generated text. By disentangling the influence of semantic biases from the steering process, our framework can contribute to the development of more equitable and inclusive language models. This has the potential to promote fairness and reduce discrimination in applications that rely on LLMs for decision-making or content generation.

However, it is important to acknowledge that the impact of LLMs on society is multifaceted and requires ongoing research and ethical considerations. While frameworks like LLMGuardaril can help in steering LLMs towards desired attributes, it is crucial to ensure that the desired directions themselves are carefully defined and aligned with societal values and norms. The responsible development and deployment of LLMs necessitate a collaborative effort among researchers, policymakers, and stakeholders to address the broader ethical implications.

6.2 Limitations

While LLMGuardaril demonstrates promising results in obtaining unbiased steering representations and providing explainable insights, there are certain limitations to our framework that should be acknowledged.

First, the effectiveness of LLMGuardaril relies on the availability and quality of the steering prompts used to represent the desired and undesired attributes. Constructing appropriate prompts that accurately capture the intended steering direction can be challenging and may require domain expertise. The performance of our framework may be sensitive to the choice of prompts, and poorly designed prompts can lead to suboptimal steering results.

Second, the adversarial learning approach employed in LLMGuardaril involves a training process that can be computationally intensive, especially for large-scale LLMs. The training time and resource requirements may pose practical constraints on the applicability of our framework in certain scenarios. Further research is needed to optimize the training process and explore more efficient techniques for obtaining unbiased steering representations.

Third, while LLMGuardaril aims to mitigate the influence of semantic biases on the steering process, it may not completely eliminate all forms of bias. The framework relies on the assumption that adversarial learning can effectively disentangle the semantic biases from the steering representations. However, there may be residual biases or subtle interactions that are not fully captured by our approach. Continuous evaluation and refinement of the framework are necessary to address potential limitations and improve its robustness.

6.3 Future Work

The proposed LLMGuardaril framework opens up several avenues for future research in the field of language model steering and explainable AI. Here, we outline some potential directions for further exploration.

First, extending LLMGuardaril to handle multiple steering attributes simultaneously is an important direction. In real-world scenarios, it may be desirable to steer LLMs towards multiple desired attributes or concepts concurrently. Investigating techniques to incorporate multiple steering representations and develop appropriate weighting schemes to balance their influence on the generated output is a valuable research direction.

Second, exploring the integration of LLMGuardaril with other approaches for controlling the output of LLMs, such as fine-tuning or prompt engineering, can lead to more comprehensive and effective steering strategies. Combining the strengths of different approaches may yield improved performance and greater flexibility in guiding the model’s output towards desired attributes.

Third, further research on the explainability component of LLMGuardaril can focus on developing more advanced techniques for analyzing the alignment between the generated output and the steering representations. Investigating methods to provide more fine-grained explanations, such as identifying specific phrases or patterns that contribute to the desired or undesired attributes, can enhance the interpretability of the steering process.

Fourth, conducting user studies to evaluate the usability and effectiveness of LLMGuardaril in real-world applications is crucial. Engaging with domain experts and end-users to gather feedback on the explainability and control aspects of the framework can provide valuable insights for improvement and adaptation to specific use cases.

Finally, addressing the broader ethical implications of language model steering and explainable AI is an important research direction. Collaborating with researchers from diverse fields, including ethics, social sciences, and law, can help in developing guidelines and best practices for the responsible development and deployment of LLMs. Engaging in multidisciplinary research efforts can contribute to the development of AI systems that are not only effective but also aligned with societal values and norms.

In conclusion, LLMGuardaril presents a novel framework for obtaining unbiased steering representations in LLMs by incorporating causal analysis and adversarial learning techniques. Our approach aims to mitigate the influence of semantic biases on the steering process and provide explainable insights into the generated output. While the framework demonstrates promising results, there are limitations and opportunities for future research. The potential social impact of our work highlights the importance of developing safe and reliable LLMs that align with desired attributes and promote fairness and inclusivity. Ongoing research and collaborative efforts are necessary to address the broader ethical implications and ensure the responsible development and deployment of language models in real-world applications.

7 Conclusion

In this paper, we presented LLMGuardaril, a novel framework for obtaining unbiased steering representations in large language models (LLMs) by incorporating causal analysis and adversarial learning techniques. Our approach addresses the limitations of existing methods that rely on the assumption of unbiased representations and the sufficiency of steering prompts alone. By systematically identifying and blocking the confounding effects of semantic biases, LLMGuardaril enables the extraction of steering representations that accurately capture the desired attributes or concepts.

As the field of natural language processing continues to advance, frameworks like LLMGuardaril will play a crucial role in shaping the future of language model steering and explainable AI. By combining theoretical foundations, innovative techniques, and ethical considerations, we can work towards the development of LLMs that not only excel in their tasks but also align with societal values and norms. The journey towards unbiased and explainable language models is an ongoing endeavor, and LLMGuardaril serves as a significant milestone in this direction.

References

[1] Zhixuan Chu, Huaiyu Guo, Xinyuan Zhou, Yijia Wang, Fei Yu, Hong Chen, Wanqing Xu, Xin Lu, Qing Cui, Longfei Li, et al. Data-centric financial large language models. arXiv preprint arXiv:2310.17784, 2023.
[2] Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, et al. Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023.
[3] Zhixuan Chu, Hongyan Hao, Xin Ouyang, Simeng Wang, Yan Wang, Yue Shen, Jinjie Gu, Qing Cui, Longfei Li, Siqiao Xue, et al. Leveraging large language models for pre-trained recommender systems. arXiv preprint arXiv:2308.10837, 2023.
[4] Siqiao Xue, Yan Wang, Zhixuan Chu, Xiaoming Shi, Caigao Jiang, Hongyan Hao, Gangwei Jiang, Xiaoyun Feng, James Zhang, and Jun Zhou. Prompt-augmented temporal point process for streaming event sequence. Advances in Neural Information Processing Systems, 36:18885–18905, 2023.
[5] Zhixuan Chu, Yan Wang, Feng Zhu, Lu Yu, Longfei Li, and Jinjie Gu. Professional agents–evolving large language models into autonomous experts with human-level competencies. arXiv preprint arXiv:2402.03628, 2024.
[6] Yanchu Guan, Dong Wang, Zhixuan Chu, Shiyu Wang, Feiyue Ni, Ruihua Song, Longfei Li, Jinjie Gu, and Chenyi Zhuang. Intelligent virtual assistants with llm-based process automation. arXiv preprint arXiv:2312.06677, 2023.
[7] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
[8] Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
[9] Christian Tarsney. Deception and manipulation in generative ai. arXiv preprint arXiv:2401.11335, 2024.
[10] Zhen Tan, Alimohammad Beigi, Song Wang, Ruocheng Guo, Amrita Bhattacharjee, Bohan Jiang, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation: A survey. arXiv preprint arXiv:2402.13446, 2024.
[11] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
[12] Q Vera Liao and Jennifer Wortman Vaughan. Ai transparency in the age of llms: A human-centered research roadmap. arXiv preprint arXiv:2306.01941, 2023.
[13] Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1–38, 2024.
[14] Thomas Hickling, Abdelhafid Zenati, Nabil Aouf, and Phillippa Spencer. Explainability in deep reinforcement learning: A review into current methods and applications. ACM Computing Surveys, 56(5):1–35, 2023.
[15] Zhixuan Chu, Mengxuan Hu, Qing Cui, Longfei Li, and Sheng Li. Task-driven causal feature distillation: Towards trustworthy risk prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 11642–11650, 2024.
[16] Wes Gurnee and Max Tegmark. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
[17] Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pages 746–751, 2013.
[18] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
[19] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pages 4349–4357, 2016.
[20] Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
[21] Zhixuan Chu, Yan Wang, Qing Cui, Longfei Li, Wenqing Chen, Sheng Li, Zhan Qin, and Kui Ren. Llm-guided multi-view hypergraph learning for human-centric explainable recommendation. arXiv preprint arXiv:2401.08217, 2024.
[22] Heng-Shiou Sheu, Zhixuan Chu, Daiqing Qi, and Sheng Li. Knowledge-guided article embedding refinement for session-based news recommendation. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7921–7927, 2021.
[23] Yan Wang, Zhixuan Chu, Xin Ouyang, Simeng Wang, Hongyan Hao, Yue Shen, Jinjie Gu, Siqiao Xue, James Y Zhang, Qing Cui, et al. Enhancing recommender systems with large language model reasoning graphs. arXiv preprint arXiv:2308.10835, 2023.
[24] Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. Inside: Llms’ internal states retain the power of hallucination detection. arXiv preprint arXiv:2402.03744, 2024.
[25] Amos Azaria and Tom Mitchell. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
[26] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
[27] Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644, 2016.
[28] Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, 2022.
[29] Evan Hernandez, Belinda Z Li, and Jacob Andreas. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
[30] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
[31] Zexuan Zhong, Zhengxuan Wu, Christopher D Manning, Christopher Potts, and Danqi Chen. Mquake: Assessing knowledge editing in language models via multi-hop questions. arXiv preprint arXiv:2305.14795, 2023.
[32] Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
[33] Nishant Subramani, Nivedita Suresh, and Matthew E Peters. Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124, 2022.
[34] Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
[35] Ole Jorgensen, Dylan Cope, Nandi Schoots, and Murray Shanahan. Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813, 2023.
[36] Judea Pearl. Causality: models, reasoning and inference, volume 29. Springer, 2000.
[37] Donald B Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331, 2005.
[38] Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD), 15(5):1–46, 2021.
[39] Zhixuan Chu and Sheng Li. Causal effect estimation: Recent progress, challenges, and opportunities. Machine Learning for Causal Inference, pages 79–100, 2023.
[40] Wenqing Chen and Zhixuan Chu. Causal inference and natural language processing. In Machine Learning for Causal Inference, pages 189–206. Springer, 2023.
[41] Jinglong Gao, Xiao Ding, Bing Qin, and Ting Liu. Is chatgpt a good causal reasoner? a comprehensive evaluation. arXiv preprint arXiv:2305.07375, 2023.
[42] Jiayao Zhang, Hongming Zhang, Weijie Su, and Dan Roth. Rock: Causal inference principles for reasoning about commonsense causality. In International Conference on Machine Learning, pages 26750–26771. PMLR, 2022.
[43] Hang Chen, Bingyu Liao, Jing Luo, Wenjing Zhu, and Xinyu Yang. Learning a structural causal model for intuition reasoning in conversation. IEEE Transactions on Knowledge and Data Engineering, 2024.
[44] Jiayi Liu, Wei Wei, Zhixuan Chu, Xing Gao, Ji Zhang, Tan Yan, and Yulin Kang. Incorporating causal analysis into diversified and logical response generation. In Proceedings of the 29th International Conference on Computational Linguistics, pages 378–388, 2022.
[45] Gabriel Stanovsky, Noah A Smith, and Luke Zettlemoyer. Evaluating gender bias in machine translation. arXiv preprint arXiv:1906.00591, 2019.
[46] Lei Ding, Dengdeng Yu, Jinhan Xie, Wenxing Guo, Shenggang Hu, Meichen Liu, Linglong Kong, Hongsheng Dai, Yanchun Bao, and Bei Jiang. Word embeddings via causal inference: Gender bias reducing and semantic information preserving. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11864–11872, 2022.
[47] Nicholas Meade, Elinor Poole-Dayan, and Siva Reddy. An empirical survey of the effectiveness of debiasing techniques for pre-trained language models. arXiv preprint arXiv:2110.08527, 2021.
[48] Rongzhou Bao, Jiayi Wang, and Hai Zhao. Defending pre-trained language models from adversarial word substitutions without performance sacrifice. arXiv preprint arXiv:2105.14553, 2021.
[49] Boxi Cao, Hongyu Lin, Xianpei Han, Fangchao Liu, and Le Sun. Can prompt probe pretrained language models? understanding the invisible risks from a causal view. arXiv preprint arXiv:2203.12258, 2022.
[50] Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023.
[51] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
[52] Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
[53] Guangya Wan, Yuqi Wu, Mengxuan Hu, Zhixuan Chu, and Sheng Li. Bridging causal discovery and large language models: A comprehensive survey of integrative approaches and future directions. arXiv preprint arXiv:2402.11068, 2024.
[54] Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, and Hyunwoo J Kim. Large language models are temporal and causal reasoners for video question answering. arXiv preprint arXiv:2310.15747, 2023.
[55] Nick Pawlowski, James Vaughan, Joel Jennings, and Cheng Zhang. Answering causal questions with augmented llms. 2023.
[56] Zhixuan Chu, Stephen L Rathbun, and Sheng Li. Matching in selective and balanced representation space for treatment effects estimation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 205–214, 2020.
[57] Judea Pearl and Dana Mackenzie. The Book of Why. Basic Books, New York, 2018.
[58] Zhixuan Chu, Stephen L Rathbun, and Sheng Li. Learning infomax and domain-independent representations for causal effect inference with real-world data. In Proceedings of the 2022 SIAM International Conference on Data Mining (SDM), pages 433–441. SIAM, 2022.
[59] Zhixuan Chu, Stephen L Rathbun, and Sheng Li. Multi-task adversarial learning for treatment effect estimation in basket trials. In Conference on Health, Inference, and Learning, pages 79–91. PMLR, 2022.
[60] Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950, 2019.
[61] Zhixuan Chu, Stephen L Rathbun, and Sheng Li. Graph infomax adversarial learning for treatment effect estimation with networked observational data. arXiv preprint arXiv:2106.02881, 2021.
[62] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
[63] Haoran Wang and Kai Shu. Backdoor activation attack: Attack large language models using activation steering for safety-alignment. arXiv preprint arXiv:2311.09433, 2023.
[64] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[65] Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
[66] Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
[67] Saghar Hosseini, Hamid Palangi, and Ahmed Hassan Awadallah. An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134, Toronto, Canada, July 2023. Association for Computational Linguistics.
[68] Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872, 2021.
[69] C.J. Hutto. Vader-sentiment-analysis, 2022.
[70] Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
[71] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[72] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.

A Causal Explainable Guardrails for Large Language Models

Abstract

1 Introduction

2 Background

2.1 Representations in Large Language Models

2.2 LLM Activation Engineering

2.3 Causal Inference

3 Causal Analysis

3.1 The Hypothetical Situation

3.2 The Real Situation

Definition 3.1 (Direction Representation R+⁣/−subscript𝑅absentR_{+/-}italic_R start_POSTSUBSCRIPT + / - end_POSTSUBSCRIPT).

Definition 3.2 (Semantic Context Representation Rc⁢ysuperscript𝑅𝑐𝑦R^{cy}italic_R start_POSTSUPERSCRIPT italic_c italic_y end_POSTSUPERSCRIPT).

Definition 3.3 (Semantic Direction Representation Rc⁢dsuperscript𝑅𝑐𝑑R^{cd}italic_R start_POSTSUPERSCRIPT italic_c italic_d end_POSTSUPERSCRIPT).

Definition 3.4 (Steering Representation Δ⁢RΔ𝑅\Delta Rroman_Δ italic_R).

3.3 Causal Analysis of Our Solutions

4 Methodology

4.1 Intervened Layer Selection

4.2 Unbiased Steering Representations

4.2.1 Debias Training Framework

4.2.2 Adversarial Learning

4.2.3 Obtaining the Steering Representation

4.3 Explanation of Output

4.4 Control of Output

5 Experiments

5.1 Baselines

5.2 Benchmarks and Evaluation Metrics

5.3 Experiment Settings

5.4 Main Results

5.5 Ablation Study

5.5.1 Different Components

5.5.2 Different Output Control Operations

5.6 The Study of Scaling Law

5.7 OOD Experiments of LLMGuardaril

5.8 Sensitivity Analysis of LLMGuardaril

6 Discussion

6.1 Potential Social Impact

6.2 Limitations

6.3 Future Work

7 Conclusion

References

Definition 3.1 (Direction Representation $R_{+/-}$ ).

Definition 3.2 (Semantic Context Representation $R^{cy}$ ).

Definition 3.3 (Semantic Direction Representation $R^{cd}$ ).

Definition 3.4 (Steering Representation $\Delta R$ ).