Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Building Guardrails for Large Language Models

Uploaded by

mateuschalao2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Building Guardrails for Large Language Models

Uploaded by

mateuschalao2
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Building Guardrails for Large Language Models

Yi Dong * 1 Ronghui Mu * 1 Gaojie Jin 2 Yi Qi 1 Jinwei Hu 1 Xingyu Zhao 3 Jie Meng 4 Wenjie Ruan 1
Xiaowei Huang 1

Abstract privacy and robustness (Huang et al., 2023d). In societal


contexts, worries also include the potential misuse by mali-
arXiv:2402.01822v2 [cs.CL] 29 May 2024

As Large Language Models (LLMs) become more


cious actors for activities such as spreading misinformation
integrated into our daily lives, it is crucial to
or aiding criminal activities, as indicated in studies by Kreps
identify and mitigate their risks, especially when
et al. (2022); Goldstein et al. (2023); Kang et al. (2023). In
the risks can have profound impacts on human
the scientific context, LLMs can be used in professional
users and societies. Guardrails, which filter the
contexts, where there are dedicated ethical considerations
inputs or outputs of LLMs, have emerged as a
and risks in scientific research (Birhane et al., 2023).
core safeguarding technology. This position pa-
per takes a deep look at current open-source so- To address these issues, model developers have implemented
lutions (Llama Guard, Nvidia NeMo, Guardrails a variety of safety protocols intended to confine the behav-
AI), and discusses the challenges and the road iors of these models to a more secure range of functions. The
towards building more complete solutions. Draw- complexity of LLMs, characterized by intricate networks
ing on robust evidence from previous research, we and numerous parameters, along with the closed-source na-
advocate for a systematic approach to construct ture (such as ChatGPT), present substantial hurdles. These
guardrails for LLMs, based on comprehensive complexities require different strategies compared to the
consideration of diverse contexts across various pre-LLM era, which focus on white-box techniques, en-
LLMs applications. We propose employing socio- hancing models by various regularisations and architecture
technical methods through collaboration with a adaptations during training. Therefore, in parallel to the rein-
multi-disciplinary team to pinpoint precise tech- forcement learning from human feedback (RLHF) and other
nical requirements, exploring advanced neural- training skills such as in-context training, the community
symbolic implementations to embrace the com- moves towards employing black-box, post-hoc strategies,
plexity of the requirements, and developing verifi- notably guardrails (Welbl et al., 2021; Gehman et al., 2020),
cation and testing to ensure the utmost quality of which monitors and filters the inputs and outputs of trained
the final product. LLMs. A guardrail is an algorithm that takes as input a set
of objects (e.g., the input and/or the output of LLMs) and
determines if and how some enforcement actions can be
1. Introduction taken to reduce the risks embedded in the objects. For exam-
ple, if an input to the LLMs is related to child exploitation,
Recent times have witnessed a notable increase in the uti- the guardrail may stop the input from being processed by
lization of Large Language Models (LLMs) like ChatGPT, the LLMs or adapt the output so that it becomes harmless
attributed to their extensive and general capabilities (Ope- (Perez et al., 2022). In other words, guardrails are to identify
nAI, 2023). However, the rapid deployment and integration the potential misuse in the query stage and try to prevent the
of LLMs have raised significant concerns regarding their model from providing the answer that should not be given.
risks including, but not limited to, ethical use, data biases,
The difficulty in constructing guardrails often lies in estab-
*
Equal contribution 1 Department of Computer Science, Uni- lishing the requirements for them. E.g., AI regulations can
versity of Liverpool, UK 2 Key Laboratory of System Software be different across different countries, and in the context
(Chinese Academy of Sciences) and State Key Laboratory of Com- of a company, data privacy can be less serious than it is
puter Science, Institute of Software, Chinese Academy of Sciences
3
WMG, University of Warwick, Warwick, UK 4 Institute of Digital in the public domain. Nevertheless, a guardrail of LLMs
Technologies, Loughborough University London, UK. Correspon- may include requirements from one or more of the fol-
dence to: Xiaowei Huang <xiaowei.huang@liverpool.ac.uk>. lowing categories: (i) Free from unintended responses e.g.,
offensive and hate speech (Section 3.1); (ii) Compliance
Proceedings of the 41 st International Conference on Machine to ethical principles such as fairness, privacy, and copy-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s). right (Section 3.2, 3.3); (iii) Hallucinations and uncertainty

1
Building Guardrails for Large Language Models

(Section 3.4). In this paper, we do not include the typical dicts their classification on a set of user-specified categories.
requirement, i.e., accuracy, as they are benchmarks of the Figure 1 shows its workflow. Due to the zero/few-shot abili-
LLMs and arguably not the responsibilities of the guardrails. ties of LLMs, Llama Guard can be adapted–by defining the
That said, there might not be a clear cut on the responsibili- user-specified categories –to different taxonomies and sets
ties (notably, robustness) between LLMs and the guardrails, of guidelines that meet requirements for different applica-
and the two models shall collaborate to achieve a joint set of tions and users. This is a Type 1 neural-symbolic system
objectives. Nevertheless, for concrete applications, the re- (Lamb et al., 2021), i.e., typical deep learning methods
quirements need to be precisely defined, together with their where the input and output of a learning agent are symbolic.
corresponding metrics, and a multi-disciplinary approach It lacks guaranteed reliability since the classification results
is called for. The mitigation of a given requirement (such depend on the LLM’s understanding of the categories and
as hallucinations, toxicity, fairness, biases, etc) is already the model’s predictive accuracy.
non-trivial, as discussed in Section 3. The need to work
with multiple requirements makes it worse, especially when
some requirements can be conflicting. Such complexity re-
quires a sophisticated solution design method to manage.
In terms of the design of guardrails, while there might not
be “one method that rules them all”, a plausible design of
the guardrail is neural-symbolic, with learning agents and
symbolic agents collaborating in processing both the in-
puts and the outputs of LLMs. There are multiple types of
neural-symbolic agents (Lamb et al., 2021). However, the Figure 1. Llama Guard Guardrail Workflow
existing guardrail solutions such as Llama Guard (Inan et al.,
2023), Nvidia NeMo (Rebedea et al., 2023), and Guardrails N vidia N eM o, described in (Rebedea et al., 2023), func-
AI (Rajpal, 2023) use the simplest, loosely coupled ones. tions as an intermediary layer that enhances the control and
Given the complexity of the guardrails, it will be interesting safety of LLMs. NeMo is designed as a versatile toolkit
to investigate other, more deeply coupled, neural-symbolic that facilitates the creation, training, and deployment of
solution designs. state-of-the-art LLMs, including but not limited to GPT.
LLMs are extensively used throughout the guardrail process
This paper argues that, like safety-critical software, a for various tasks across multiple stages. For example, in
systematic process to cover the development cycle (ranging a conversation scenario, LLM is utilized in the following
from specification, to design, implementation, integration, three phases: (I) Generating user intent, where it refines
verification, validation, and production release) is required user intent using provided examples and potential intents,
to carefully build the guardrails, as indicated in industrial producing deterministic results by setting the temperature to
standards such as ISO-26262 and DO-178B/C. The goal zero. (II) Generating next step: In this phase, Nemo searches
of this paper is to review the state-of-the-art (Section 2), the most relevant similar flows, and integrates these similar
present technical challenges on implementing individual flows together into an example, which is then fed into the
requirements (Section 3), and then discuss several issues LLM. The output of LLM call is termed as “bot intent”. (III)
regarding the systematic design of a guardrail for a specific Generating the bot message, taking the most relevant five
application context (Section 4). bot intents and relevant data chunks as inputs to provide
context.
2. Existing Implementation Solutions Unlike traditional models that rely on initial layer embed-
This section reviews three existing implementation solutions dings, NeMo utilizes similarity functions to capture the most
for guardrails1 , and discusses their pros and cons. pertinent semantics, employing the “sentence transformers /
all-MiniLM-L6-v2” model for this purpose. This model aids
Llama Guard (Inan et al., 2023), developed by Meta on in embedding inputs into a dense vector space, enhancing
the Llama2-7b architecture, focuses on enhancing Human- the efficacy of nearest neighbor searches using the Annoy
AI conversation safety. It is a fine-tuned model that takes algorithm. Additionally, NeMo employs Colang, an exe-
the input and output of the victim model as input and pre- cutable programme language designed by Nvidia Nvidia
1
There are other guardrails available in the market, such as (2023), to establish constraints, in order to guide LLMs
Open AI’s solution, Microsoft Azure AI Content Safety, Google within set dialogical boundaries. When the customer’s input
Guardrails for Generative AI. However, they are either not open- prompt comes, NeMo embeds the prompt as a vector, and
sourced or lack details and contents for reproduction. Our discus- then uses K-nearest neighbor (KNN) method to compare it
sion is limited to the three guardrails that are open-source and have
been successfully replicated in our experiments.
with the stored vector-based user canonical forms, retriev-
ing the embedding vectors that are ‘the most similar’ to

2
Building Guardrails for Large Language Models

the embedded input prompt. After that, Nemo starts the without comprehensive studies on if and how such infrastruc-
flow execution to generate output from the canonical form. ture can be utilized to implement a satisfactory guardrail.
During the flow execution process, the LLMs are used to Research is needed to understand detailed issues regard-
generate a safe answer if requested by the Colang program. ing the infrastructures, including their capability (in deal-
The process is presented in Figure 2. Building on the above ing with, e.g., configuration redundancy and conversational
customizable workflow, NeMo also includes a set of pre- capability limitations), generalization (in dealing with un-
implemented moderations dedicated to e.g., fact-checking, foreseen scenarios), and expressivity (of enabling suitable
hallucination prevention in responses, and content modera- interactions of symbolic and learning components). More
tion. NeMo is also a Type-1 neural-symbolic system, with importantly, a systematic approach of building guardrails
its effectiveness closely tied to the performance of the KNN based on the infrastructures is called for.
method.
Overall, in this section we review three existing strategies for
implementing guardrails, each with its own set of pros and
cons. Subsequent sections will delve into methodologies for
constructing guardrail components, tailored to meet specific
requirements. Especially, Section 3 provides an overview of
the current research landscape on individual requirements,
and Section 4 delivers a broader systems-thinking approach
to consider multiple requirements altogether.

Figure 2. Nvidia NeMo Guardrails Workflow


3. Technical Challenges of Implementing
Guardrails AI enables the user to add structure, type and Individual Requirements
quality guarantees to the outputs of LLMs (Rajpal, 2023).
It operates in three steps: 1) defining the “RAIL” spec, 2) This section will review the technical challenges of imple-
initializing the “guard”, and 3) wrapping the LLMs. In the menting individual requirements, highlighting the intriguing
first step, Guardrails AI defines a set of RAIL specifications, complexity of dealing with a requirement. We consider
which are used to describe the return format limitations. four categories of requirements that might be requested
This information is required to be written in a specific XML in a specific context or application. Table 1 provides a
format, facilitating subsequent output checks, e.g., structure summarisation of existing representative works. For every
and types. The second step involves activating the defined category of requirements, it classifies techniques into three
spec as a guard. For applications that require categorized groups. For vulnerability detection, the victim LLMs are
processing, such as toxicity checks, additional classifier typically treated as a blackbox, and thus they can be either
models can be introduced to categorize the input and output with or without guardrails. Protection via LLMs enhance-
text. The third step is triggered when the guard detects an ment includes techniques that tune the weights of LLMs. In
error. Here, the Guardrails AI can automatically generate a contrast, For protection via I/O engineering, we consider
corrective prompt, pursuing the LLMs to regenerate the cor- any techniques that work on input and output, e.g., prompt
rect answer. The output is then re-checked to ensure it meets engineering and output filter.
the specified requirements. Currently, the methods based on
Guardrails AI are only applicable for text-level checks and 3.1. Free from Unintended Response
cannot be used in multimodal scenarios involving images or
Recent studies have highlighted a growing concern about
audio. Unlike the previous two methods, Guardrail AI is a
the ability of LLMs like ChatGPT to generate toxic con-
Type-2 neural-symbolic system, which consists of a back-
tents, even with guardrails in place (Burgess, 2023; Chris-
bone symbolic algorithm supported by learning algorithms
tian, 2023; Zou et al., 2023). Most research uses prompt
(in this case, those additional classifier models).
engineering methods to cause LLMs to create unintended
content, a process often referred to as “jailbreaking”.
Vulnerability Detection Kang et al. (2023); Wei et al.
(2023); Shen et al. (2023); Deng et al. (2023) have demon-
strated that the LLMs can be manipulated to produce ma-
licious contents using specific prompts. In addition, Kang
et al. (2023) used TEXT- DAVINCI -003 prompt, Wei et al.
Figure 3. Guardrails AI Workflow (2023) explored failure models, Shen et al. (2023) employed
“DAN” (Do Anything now”), Zou et al. (2023) introduced
Nevertheless, these solutions only provide the basic infras-
automated prompt generation based on gradient, Deng et al.
tructure (language for rule description, example workflow),

3
Building Guardrails for Large Language Models
Vulnerability Detection Protection via LLMs Enhancement Protection via I/O Engineering
(Kang et al., 2023) (Wei et al., 2023) (Li et al., 2018)(Liu et al., 2020) (Jain et al., 2023)(Kumar et al., 2023)
(Shen et al., 2023)(Deng et al., 2023) (Miyato et al., 2016)(Ganguli et al., 2022) (Robey et al., 2023)(Kim et al., 2023)
Free from
(Yong et al., 2023)(Vega et al., 2023) (Touvron et al., 2023)(Perez et al., 2022) (Rajpal, 2023)(Inan et al., 2023)
Unintended
(Zhang & Ippolito, 2023)(Albert, 2024) (Askell et al., 2021)(Nakano et al., 2021) (Rebedea et al., 2023)
Response
(Koh et al., 2023)(Motoki et al., 2023) (Ranaldi et al., 2023)(Limisiewicz et al., 2023) (Huang et al., 2023a)
Fairness (Limisiewicz et al., 2023)(Badyal et al., 2023) (Xie & Lukasiewicz, 2023)(Ernst et al., 2023) (Tao et al., 2023)(Oba et al., 2023)
(Yeh et al., 2023)(Shaikh et al., 2022) (Ungless et al., 2022)(Ramezani & Xu, 2023) (Dwivedi et al., 2023)
(Zou et al., 2023) (Wang et al., 2023c)
(Zanella-Béguelin et al., 2020)(Shi et al., 2022) (Ozdayi et al., 2023)
(Li et al., 2023b)(Huang et al., 2022)
Privacy (Igamberdiev & Habernal, 2023)(Yu et al., 2022) (Li et al., 2023c)
(Li et al., 2023a) (Lukas et al., 2023)
(Mireshghallah et al., 2022)(Xiao et al., 2023) (Duan et al., 2023)
(Wang et al., 2024a)(Mireshghallah et al., 2023)
(Ji et al., 2023)(Manakul et al., 2023) (Meng et al., 2022b)(Chuang et al., 2023) (Press et al., 2022)(Gao et al., 2023)
(Bang et al., 2023)(Chen & Shu, 2023) (Meng et al., 2022a) (Bayat et al., 2023) (Pinter & Elhadad, 2023)(He et al., 2022)
Hallucination
(Xu et al., 2024)(Huang et al., 2023b) (Wang et al., 2024b)(Elaraby et al., 2023) (Zhao et al., 2023)(Ram et al., 2023)
(Chern et al., 2023) (Cohen et al., 2023) (Liang et al., 2024)(Razumovskaia et al., 2023) (Dhuliawala et al., 2023)(Wang et al., 2023b)

Table 1. Literature on detecting and mitigating individual risks.

(2023) proposed a balanced way by combining manual and scenario where the model is not open. Consequently, several
automatic prompt generation together, and Vega et al. (2023) I/O engineering approaches that work on the input/output
created few-shot priming attack and forced the LLMs to start prompts have emerged. Jain et al. (2023) explored various
generating from the middle of a sentence. Zhang & Ippolito defense technologies, including preprocessing and rephras-
(2023) evaluated the effectiveness of these prompt manip- ing input prompts. Kumar et al. (2023) used a safety filter on
ulation attacks. Beyond that, Yong et al. (2023) bypassed input prompts for certified robustness. Robey et al. (2023)
GPT-4’s safeguard by translating the English inputs into introduced randomized smoothing technology to defend
low-source languages. During our tests, we observed that against such attacks by modifying input prompts and using
certain vulnerabilities in LLMs that were previously known majority voting for detection. Additionally, guardrail tools
have been addressed, possibly due to the updates made such as Guardrails AI and Nemo also offer detection and
by developers to enhance security measures. Nonetheless, protection functions for harmful and toxic outputs.
a considerable number of individuals referred to as “jail-
Our Perspective As Tramer et al. (2020) have pointed out,
breakers” remain capable of effectively deceiving ChatGPT,
while the defenses are effective against certain attacks, they
which are tested in the publicly accessible project (Albert,
remain vulnerable to stronger ones. This could turn into a
2024), as demonstrated in Appendix B.
continuous and infinite cycle of attacks and defenses. Con-
Protection via LLMs Enhancement The LLMs can be sequently, a more robust solution is required, ideally of-
enhanced by inherent safety training technologies. It can be fering provable guarantees to confirm the LLMs’ robust-
achieved via the augmentation of training data by adding ness against all adversarial attacks within a permissible
adversarial examples (Li et al., 2018; Ganguli et al., 2022; perturbation limit. Toward this goal, we notice that existing
Perez et al., 2022; Mozes et al., 2023). Moreover, vari- guardrails seldom consider providing such guarantees. First
ous efforts have been made to enhance safety during the and foremost, it is necessary to develop metrics for toxicity
RLHF process. Touvron et al. (2023) proposed to incor- and other criteria to address unintended responses. In terms
porate a safety reward into the RLHF process to prevent of these metrics, rather than relying on purely empirical
harmful outputs. Askell et al. (2021) improved the RLHF measures which may improve the performance but cannot
process by implementing context distillation in the training lead to guarantees, we can consider certified robustness
dataset. In the context of LLMs, Nakano et al. (2021) used bounds, either statistical bound (Cohen et al., 2019; Zhao
the Reject Sampling mechanism to select the least harmful et al., 2022) or deterministic bound (Huang et al., 2017; Sun
responses, thereby shaping the training dataset for RLHF. & Ruan, 2023), as scores to measure the guardrail perfor-
The robustness of language models can also be improved mance. Additionally, we can also incorporate the metrics
by modifying the training loss functions (Liu et al., 2020; (or the bounds) into the training process of the LLMs for
Miyato et al., 2016). However, these adaptions are inef- improvement, or use it in the fine-tuning process.
fective for the LLMs due to the catastrophic forgetting in
the training process (Jain et al., 2023). Furthermore, these 3.2. Fairness
approaches require retraining of the LLMs to defend against
the attacks, which can be unsuitable due to high-cost and Fairness in LLMs has been studied from different angles,
closed-source nature. such as gender bias (Malik, 2023; Sun et al., 2023; Ovalle
et al., 2023), cultural bias (Tao et al., 2023; Gupta et al.,
Protection via I/O Engineering While detection and model 2023), dataset bias (Sheppard et al., 2023), and social bias
enhancement are crucial, they alone are insufficient to safe- (Sheng et al., 2023; Manerba et al., 2023; Tang et al., 2023;
guard against the evolving nature of threats, especially in the Gonçalves & Strubell, 2023; Nagireddy et al., 2023; Bi

4
Building Guardrails for Large Language Models

et al., 2023). Understanding and addressing biases in LLMs designed code generation templates to mitigate the bias in
requires solid theoretical frameworks and comprehensive code generation tasks. Tao et al. (2023) found that cultural
analysis. Gallegos et al. (2023) provided a comprehensive prompting is a simple and effective method to reduce cul-
overview of social biases and fairness in natural language tural bias in the latest LLMs, although it may be ineffective
processing, offering a framework for identifying and catego- or even exacerbate bias in some countries. Oba et al. (2023)
rizing different types of harms, intuitive taxonomies for bias proposed a method to address gender bias that does not re-
evaluation metrics and datasets, and a guide for mitigations. quire access to model parameters. It shows that text-based
preambles generated from manually designed templates can
Vulnerability Detection Badyal et al. (2023) purposefully
effectively suppress gender biases with minimal adverse
incorporated biases into the responses of LLMs to craft
effects on downstream task performance. Dwivedi et al.
distinct personas for use in interactive media. Koh et al.
(2023) guided LLMs to generate more equitable content by
(2023) focused on identifying and quantifying instances of
employing an innovative approach of prompt engineering
social bias in models like ChatGPT, especially in sensitive
and in-context learning, significantly reducing gender bias,
applications such as job and college admissions screening.
especially in traditionally problematic.
Limisiewicz et al. (2023) proposed a novel method for de-
tecting gender bias in language models. Motoki et al. (2023) Our Perspective To effectively mitigate bias, it’s crucial
examined the presence of political bias in ChatGPT, focus- to develop guardrails through a comprehensive approach
ing on aspects such as race, gender, religion, and political that intertwines various strategies. This begins with metic-
orientation. Additionally, they explored the role of random- ulously monitoring and filtering training data to ensure it
ness in responses, by collecting multiple answers to the is diverse and devoid of biased or discriminatory content.
same questions, which enables a more robust analysis of po- The essence of this step lies in either removing biased data
tential biases. Yeh et al. (2023) examined the bias of LLMs or enriching the dataset with more inclusive and varied
by controlling the input, highlighting that LLMs can still information. Alongside this, algorithmic adjustments are
produce biased responses despite the progress in bias reduc- necessary, which involve fine-tuning the model’s param-
tion. Shaikh et al. (2022) designed a Bias Index to quantify eters to prevent the overemphasis of certain patterns that
and address biases inherent in LLMs including GPT-4. It could lead to biased outcomes. Incorporating bias detection
has also been observed that the biased response can be gen- tools is another pivotal aspect. These tools are designed
erated inadvertently, sometimes in the form of seemingly to scrutinize the model’s outputs, identifying and flagging
harmless jokes (Zhou & Sanfilippo, 2023) (demonstrated potentially biased content for human review and correction.
in Appendix B). Such instances may not be sufficiently We believe that a key to the long-term efficacy of these
addressed by existing guardrail systems. guardrails is the adoption of a continuous learning approach.
This involves regularly updating the model with new data,
Protection via LLMs Enhancement Many studies have
insights, and feedback and adapting to evolving societal
concentrated on reducing bias through model adaption ap-
norms and values. This dynamic process ensures that the
proaches. Limisiewicz et al. (2023) provided a bias mitigat-
guardrails against bias remain robust and relevant. More-
ing method, DAMA, that can reduce bias while maintaining
over, the above issues can and should be addressed with a
model performance on downstream tasks. Ranaldi et al.
multidisciplinary team, as discussed in Section 4.2. Also,
(2023) investigated the bias in CtB-LLMs and demonstrate
similar to the discussion in Section 3.1, we believe in prin-
the effectiveness of debiasing techniques. They find that
cipled methods to evaluate fairness when the definitions are
bias is not solely dependent on the number of parameters
clearly settled. It is however expected that the definition will
but also on factors like perplexity, and that techniques like
be distribution-based, rather than point-based as unintended
debiasing of OPT using LoRA can significantly reduce bias.
responses, which need to estimate posterior distributions
Ungless et al. (2022) demonstrated that the Stereotype Con-
and to measure the distance between two distributions.
tent Model, which posits that minority groups are often
perceived as cold or incompetent, applies to contextual-
ized word embeddings and presents a successful fine-tuning 3.3. Privacy and Copyright
method to reduce such biases. Moreover, Ernst et al. (2023) Legislations such as the EU AI Act, General Data Protection
proposed a novel adversarial learning debiasing method, Regulation (GDPR), and California Consumer Privacy Act
applied during the pre-training of LLMs. Ramezani & Xu (CCPA) have established rigorous standards for data sharing
(2023) mitigated cultural bias through fine-tuning models and retention. These frameworks mandate strict compliance
on culturally relevant data. with data protection and privacy guidelines. Privacy-related
Protection via I/O Engineering In addition to fine-tuning research focuses on the risks of either leaking training data
methods, several studies exploring the control of input and or the trained model. The former includes the attacks and
output. Huang et al. (2023a) suggested to use purposely defense on e.g., determining if a data point is within the
training dataset (Shokri et al., 2017), reconstructing a train-

5
Building Guardrails for Large Language Models

ing data point from a subset of the features (Zhang et al., as we know the list of green tokens, it is easy to determine
2020), or reconstructing some of the training data (Balle if an output is watermarked or not. We can also use the
et al., 2022). The latter infers information from the model, watermarks to track the point of origin or the owner of wa-
see e.g., (Wang et al., 2021). In the following, we focus on termarked text for copyright purposes, and this has been
the privacy on the training data. applied to protect the copyright of generated prompts (Yao
et al., 2023). We believe in an agreed watermarking mech-
Vulnerability Detection LLMs face the challenges in re-
anism between the data owners and the LLMs developers,
leasing the personal identifiable information (PII) (Li et al.,
such that the users embed a personalized watermark into
2023b;a; Lukas et al., 2023; Huang et al., 2022; Wang et al.,
their documents or texts when they deem them private or
2024a), highlighting the need for caution and robust data
with copyright, and the LLMs developers will not use water-
handling protocols. They are pre-trained on extensive tex-
marked data for their training. More importantly, the LLMs
tual datasets (Narayanan et al., 2021) and can inadvertently
developers should take the responsibility of enabling (1)
reveal sensitive information about data subjects (Plant et al.,
an automatic verification to determine if a user-provided,
2022). Specifically, Li et al. (2023b) considered the risks of
watermarked text is within the training data, and (2) model
leaking personal information in e.g., text completion task
unlearning (Nguyen et al., 2022), which allows the removal
where the adversary attempts to recover private informa-
of users’ personally owned texts from training data.
tion by using tricky prompt as the prefix, and Wang et al.
(2024a) used an aggregated score to evaluate the LLM’s
privacy. Mireshghallah et al. (2023) also exhaustively tested 3.4. Hallucinations and Uncertainty
the latest ChatGPT about their capability of keeping a secret. LLMs have a notable inclination to generate hallucinations
Protection via LLMs Enhancement Numerous studies (Ji et al., 2023; Bang et al., 2023), leading to contents that
have focused on implementing privacy defense technologies deviate from real-world facts or user inputs. The hallucina-
to safeguard data and model privacy and counter privacy tions in conditional text generation are closely tied to high
breaches, with the Differential Privacy (DP) based methods model uncertainty (Huang et al., 2023b). The absence of
(Abadi et al., 2016) as the most studied. For general NLP uncertainty measures for LLMs significantly hampers the
models, Li et al. (2022) indicated that a direct application of reliability of information generated by LLMs.
DP-SGD (Abadi et al., 2016) may not achieve satisfactory Vulnerability Detection Chen & Shu (2023) first identified
performance, and suggests a few tricks. Igamberdiev & the challenges in detecting the misinformation in ChatGPT,
Habernal (2023) implemented a model for text rewriting resulting in a growing number of research to explore the
along with Local Differential Privacy (LDP), both with and factual hallucination that is inconsistent with real-world
without pretraining. For LLMs, the focus has been on the facts. Chern et al. (2023) proposed a cohesive framework
integration of DP into the fine-tuning process (Yu et al., by utilizing a range of external tools for gathering evidence
2022; Shi et al., 2022; Mireshghallah et al., 2022). Other to identify factual inaccuracies. Some methods aim to de-
than DP-based methods which deal with general differential tect hallucinations without relying on external sources by
privacy, Xiao et al. (2023) considered contextual privacy, focusing on the model’s uncertainty in generating factual
which measures the sensitivity of a piece of information content. Manakul et al. (2023) proposed to identify hallu-
upon the context, and injects domain-specific knowledge cinations by generating multiple responses and evaluating
into the fine-tuning process. the consistency of factual statements. Apart from evalu-
Protection via I/O Engineering Ozdayi et al. (2023) pro- ating uncertainty through the self-consistency of multiple
posed a method to prepend a trained prompt to the incoming generations from a single LLM, one can adopt a multi-
prompt before passing them to the model, where the training agent approach by including additional LLMs (Cohen et al.,
of the prefix prompt is to minimise the extent of extractable 2023). Worsely, Xu et al. (2024), claim that LLMs cannot
memorized content in the model. Li et al. (2023c) and Duan completely eliminate hallucinations. They define a formal
et al. (2023) also proposed the prompt-tuning methodology world where hallucination is characterized as inconsisten-
that adhere to differential privacy principles. cies between computable LLMs and a computable ground
truth function.
Our Perspective Other than constructing privacy-
preserving LLMs, watermarking techniques can play a Protection via LLMs Enhancement Meng et al. (2022b)
more important role in LLMs, for not only privacy but also proposed to mitigating data-related hallucinations in LLMs
copyright protection. A typical watermarking mechanism by increasing the amount of factual data during the pre-
(Kirchenbauer et al., 2023) embedded watermarks into the training phase, and this proposal was later refined by Meng
output of LLMs by selecting a randomized set of “green” et al. (2022a). Modifying the training dataset can partially re-
tokens before a word is generated, and then softly promot- duce the model’s knowledge gap effectively. Besides, Liang
ing the use of green tokens during sampling. So, as long et al. (2024) developed an automated hallucination annota-

6
Building Guardrails for Large Language Models

tion tool, DreamCatcher, and proposing a Reinforcement 2008) could enhance the evaluation of fairness, creativity,
Learning from Knowledge Feedback training framework, and privacy of LLMs in generating questions by considering
effectively improving their performance in tasks related to the uncertainty level and all possible responses.
factuality and honesty. Wang et al. (2024b) introduced the
ReCaption framework, which combines rewriting captions 4. Challenges on Designing Guardrails
using ChatGPT with fine-tuning large vision-language mod-
els, successfully reducing fine-grained object hallucinations Based on the discussions about tackling individual require-
in LVLMs More related works can be found in (Tonmoy ments as discussed in Section 3, this section advocates the
et al., 2024). building of a guardrail by considering multiple requirements
in a systematic way. We discuss four topics: conflicting
Protection via I/O Engineering Apart from the refining
requirements (Section 4.1), multidisciplinary approach (Sec-
methods, Pinter & Elhadad (2023) found that these methods
tion 4.2), implementation strategy (Section 4.3), and rigor-
might pose potential risks when trying to combat LLMs
ous engineering process (Section 4.4).
hallucinations. They recommend using retrieval-augmented
methods, which seek to add external knowledge acquired
from retrieval directly to the LLMs’ prompt (He et al., 2022; 4.1. Conflicting Requirements
Press et al., 2022; Ram et al., 2023). Based on the Chain- This section discusses the tension between safety and in-
of-Thought technology, Dhuliawala et al. (2023) introduced telligence, as an example for the conflicting requirements.
the “Chain-of-Verification” method to effectively reduce Conflicting requirements are typical, including e.g., fairness
the generation of inaccurate information in LLMs. Wang and privacy (Xiang, 2022), privacy and robustness (Song
et al. (2023b) then proposed a faithful knowledge distilla- et al., 2019), robustness and XAI (Huang et al., 2023c),
tion method that significantly enhances the credibility and and robustness and fairness (Bassi et al., 2024). The inte-
accuracy of LLMs. Zhao et al. (2023) proposed a Verify-and- gration of guardrails with LLMs may lead to a discernible
Edit framework based on GPT-3, which enhances the factual conservative shift in the generation of responses to open-
accuracy of predictions in open-domain question-answering ended text-generation questions (Röttger et al., 2023). The
tasks. Additionally, Gao et al. (2023) pioneered the “Retrofit shift has been witnessed in ChatGPT over time. Chen et al.
Attribution using Research and Revision” system, which (2023) documented a notable change in ChatGPT’s perfor-
improves the outputs by automatically attributing and post- mance between March and June 2023. Specifically, when
editing generated text to correct inaccuracies. responding to sensitive queries, the model’s character count
Our Perspective As suggested earlier, uncertainty can be decreased significantly, plummeting from an excess of 600
utilized to deal with hallucinations. The primary challenges characters to approximately 140. Additionally, in the con-
of LLMs uncertainty stem from the critical roles of meaning text of opinion-based questions and answers surveys, the
and form in language. This relates to what linguists and model is more inclined to abstain from responding.
philosophers refer to as a sentence’s semantic content and Given the brevity and conservativeness of responses gen-
its syntactic or lexical structure. Foundation models primar- erated by ChatGPT, it raises the question: How can ex-
ily produce token-likelihoods, indicating lexical confidence. ploratory depth be maintained in responses, particularly
However, in most applications, it is the meanings that are for open-ended test generation tasks? Furthermore, does
of paramount importance. Kuhn et al. (2022) presented the the application of guardrails constrain ChatGPT’s capac-
concept of semantic entropy, an entropy that integrates lin- ity to deliver more intuitive responses? On the other hand,
guistic invariances brought about by the same meaning. The Narayanan & Kapoor (2023) critically examined this paper,
fundamental method involves a semantic equivalence rela- and emphasized the difference between an LLM’s capa-
tion to express that two sentences have the same meaning. bilities and its behavior. In psychological studies (Michie
In addition, we need to consider the uncertainty of the mea- et al., 2011), behaviour is believed to be determined by not
surements of LLM. For example, for the assessment of only capability (refer to knowledge, skills, etc) but also
toxicity levels, there are quantitative methods like tracking opportunity for external factors and motivation for inter-
the frequency of toxic words or using sentiment analysis nal processes. In the context of LLMs, the opportunity
scores, and qualitative approaches such as evaluations by includes social norms and cultural practices that need to be
experts. It is crucial to verify that these metrics are consis- taken care of by the guardrails. Although capabilities typi-
tent and applicable across a variety of contexts and content cally remain constant, behavior can alter due to fine-tuning,
types. We also highlight the need to account for the inherent which can be interpreted as the “uncertainty” challenges in
uncertainty of LLMs, an aspect not sufficiently addressed in LLMs. They suggest that changes in GPT-4’s performance
previous guardrail designs. Incorporating uncertainty mea- are likely linked more to evaluation data and fine-tuning
surements such as conformal predictions (Shafer & Vovk, methods rather than a decline in its fundamental abilities.
They also acknowledge that such behavioral drift poses a

7
Building Guardrails for Large Language Models

challenge in developing reliable chatbot products. The adop- be different across different LLMs, and research is needed
tion of guardrail has also led to the model adopting a more to scientifically determine requirements. The above chal-
succinct communication approach, thereby offering fewer lenges (multiple categories, domain-specific, and potentially
details or electing for non-response in certain queries. The conflicting requirements) are compounded by the fact that
decision of “to do or not to do” can be a challenging task many requirements, such as fairness and toxicity, are hard to
when designing the guardrail. While the easiest approach be precisely defined, especially without a concrete context.
is to decline an answer to any sensitive questions, is it the The existing methods, such as the popular one that sets a
most intelligent one? That is, we need to determine if the threshold on predictive toxicity level (Perez et al., 2022), do
application of guardrail always has a positive impact on not have valid justification and assurance.
LLMs that is within our expectation.
Our Perspective Developing LLMs ethically involves ad-
Our Perspective For the safety and intelligence tension, hering to principles such as fairness, accountability, and
prior research has suggested to incorporate a creativity as- transparency. These principles ensure that LLMs do not
sessment mechanism into the guardrail development for perpetuate biases or cause unintended harm. The works
LLMs. To measure the creativity capability of LLMs, by e.g., Sun et al. (2023) and Ovalle et al. (2023) provide
Chakrabarty et al. (2023) employed the Consensual Assess- insights into how these principles can be operationalized in
ment Technique (Amabile, 1982), a well-regarded approach the context of LLMs. Establishing community standards
in creativity evaluation, focusing on several key aspects: is vital for the responsible development of LLMs. These
fluency, flexibility, originality, and elaboration, which col- standards, derived from a consensus among stakeholders,
lectively contribute to a comprehensive understanding of the including developers, users, and those impacted by AI, can
LLMs’ creative output in storytelling. Narayanan & Kapoor guide the ethical development and deployment of LLMs.
(2023) showed that although some LLMs may demonstrate They ensure that LLMs are aligned with societal values and
adeptness in specific aspects of creativity, there is a signif- ethical norms, as discussed in broader AI ethics literature
icant gap between their capabilities and human expertise (ActiveFence, 2023). Moreover, the ethical development of
when evaluated comprehensively. We also need to assess LLMs is not a one-time effort but requires ongoing evalu-
which requirements are critical and which can be adjusted ation and refinement. This involves regular assessment of
or compromised for different tasks and contexts. While LLMs outputs, updating models to reflect changing societal
these conflicts may not be entirely resolvable, particularly norms, and incorporating feedback from diverse user groups
within a general framework applicable across various con- to ensure that LLMs remain fair and unbiased.
texts, more targeted approaches in specific scenarios might
Socio-technical theory (Trist & Bamforth, 1957), in which
offer better chance of conflict resolution. Such approaches
both ‘social’ and ‘technical’ aspects are brought together
demand ongoing research to develop concrete principles,
and treated as interdependent parts of a complex system,
methods, and standards that a multidisciplinary team can
have been promoted (Filgueiras et al., 2023; Jr. et al., 2020)
implement and adhere to. Guardrails, while effective in
for machine learning to deal with properties related to hu-
particular situations, are not a universal solution capable of
man and societal values, including e.g., fairness (Dolata
addressing all potential conflicts. Instead, they should be
et al., 2022), biases (Schwartz et al., 2022), and ethics (Mbi-
designed to manage specific, well-defined scenarios.
azi et al., 2023). To manage the complexity, the whole
system approach (Crabtree et al., 2011), which promotes
4.2. Multidisciplinary Approach an ongoing and dynamic way of working and enables local
While current LLMs guardrails include mechanisms to de- stakeholders to come together for an integrated solution, has
tect harmful contents, they still pose a risk of generating been successfully working on healthcare systems (Brand
biased or misleading responses. It is reasonable to expect et al., 2017). We believe, a multi-disciplinary group of ex-
the future guardrails to integrate not only harm detections perts will work out, and rightly justify and validate, the
but also other mechanisms to deal with, e.g., ethics, fairness, concrete requirements for a specific context, by applying
and creativity. We have provided in the Introduction three the socio-technical theory and the whole system approach.
categories of requirements to be considered for a guardrail.
Moreover, LLMs may not be universally effective across all 4.3. Neural-Symbolic Approach for Implementation
domains, and it has been a trend to consider domain-specific
Existing guardrail frameworks such as those introduced in
LLMs (Pal et al., 2023). In domain-specific scenarios, spe-
Section 2 employ a language (such as RAIL or Colang)
cialized rules may conflict with the general principles. For
to describe the behavior of a guardrail. A set of rules and
instance, in crime prevention, the use of certain terminolo-
guidelines are expressed with the language, such that each
gies that are generally perceived as harmful, such as ‘guns’
of them is applied independently. It is unclear if and how
or ‘crime’, is predominant and should not be precluded.
such a mechanism can be used to deal with more complex
To this end, the concrete requirements for guardrails will

8
Building Guardrails for Large Language Models

cases where the rules and guidelines have conflicts. As set of non-dominated solutions, where no solution is better
mentioned in Section 4.2, such complex cases are common than others across all objectives that are considered. Some
in building guardrails. Moreover, it is unclear if they are efforts have been taken, e.g., (Huang et al., 2023c) adapts
sufficiently flexible, and capable of adapting, to semantic an evolutionary algorithm to find Pareto front for robust-
shifts over time and across different scenarios and datasets. ness and XAI. Statistical certification involves using sta-
tistical methods to ensure that a single requirement meets
Our Perspective First, a principled approach is needed to
a specified standard with a certain level of confidence. It
resolve conflicts in requirements, as suggested in (van Lam-
is typically applied when there is uncertainty in the mea-
sweerde et al., 1998) for requirement engineering, which
surements or when the requirement is subject to variability.
is based on the combination of logic and decision theory.
Combining these techniques can find the trade-offs, provide
Second, a guardrail requires the cooperation of symbolic
confidence in the viability of solutions with respect to indi-
and learning-based methods. For example, we may expect
vidual requirements, and support more informed and adap-
that, the learning agents deal with the frequently-seen cases
tive decision-making processes. Attention should also be
(where there are plenty of data) to improve the overall per-
paid to understanding the theoretical limits of the evaluation
formance w.r.t. the above-mentioned requirements, and the
methods. For example, it is known that different verification
symbolic agents take care of the rare cases (where there
methods will provide different levels of guarantees on their
are few or no data) to improve the performance in dealing
results, with (”davidad” Dalrymple et al., 2024) defining 11
with corner cases in an interpretable way. In general, before
levels (0-10), e.g., the commonly applied attacks are only
we can confirm, and reliably evaluate, the cognitive ability
at level-1, some testing methods (Sun et al., 2019; Wicker
of learning agents, the symbolic agents can embed human-
et al., 2018) are at level-5 or level-6, and methods based on
like cognition (e.g., the analogical connections between
sampling with global optimisation guarantees or statistical
concepts in similar abstract contexts) through structures
guarantees such as (Cohen et al., 2019; Dong et al., 2023;
such as knowledge graphs. Not only can they improve the
Ruan et al., 2018; Wang et al., 2023a) are at between level-7
guardrails’ capability, but they also enable the end users with
and level-9. Last but not least, safety argument (Zhao et al.,
more explainability, which is important due to the guardrails’
2020; Dong et al., 2023) will be needed to not only structure
responsibility in providing safety and trust to AI. Due to the
the reasoning and evidence collection but also ensure the
complex conflict resolution methods, more closely-coupled
communication with the stakeholders.
neural-symbolic methods might be needed to deal with the
tension between effective learning and sound reasoning,
such as those Type-6 systems (Lamb et al., 2021) that can 5. Conclusion
deal with true symbolic reasoning inside a neural engine,
This paper advocates for a systematic approach to build-
e.g., Pointer Networks (Vinyals et al., 2015).
ing guardrails, beyond the current solutions which only
offer the simplest mechanisms to describe rules and connect
4.4. Systems Development Life Cycle (SDLC) learning and symbolic components. Guardrails are highly
The criticality of guardrails requires a careful engineering complex due to their role of managing interactions between
process to be applied, and for this, a revisit of the SDLC, LLMs with humans. A systematic approach, supported by a
which is a complex project management model to encom- multidisciplinary team, can fully consider and manage the
pass guardrail creation from its initial idea to its finalized complexity and provide assurance to the final product.
deployment and maintenance has the potential, and the V-
model (Oppermann, 2023), which builds the relations of Acknowledgements
each development process with its testing activities, can be
useful to ensure the quality of the final product. This project has received funding from The Alan Turing
Institute under grant agreement No ARC-001 and the U.K.
Our Perspective Rigorous verification and testing will be EPSRC through End-to-End Conceptual Guarding of Neural
needed (Huang et al., 2023d), which requires a comprehen- Architectures [EP/T026995/1].
sive set of evaluation methods. For individual requirements,
certification with statistical guarantees can be useful, such as
the randomized smoothing (Cohen et al., 2019) and global Impact Statement
robustness (Dong et al., 2023). For the evaluation of multi- This paper shares our views about how to build a responsi-
ple, conflicting requirements, a combination of the Pareto ble safeguarding mechanism for Large Language Models
front based evaluation methods for multiple requirements (LLMs), a generative AI technique. In this sense, it holds
(Ngatchou et al., 2005) and the statistical certification for a positive societal impacts. Nevertheless, to expose the prob-
single requirement is needed. The Pareto front, a concept lems, the paper also includes example questions and model
from the field of multi-objective optimization, represents a outputs that may be perceived as offensive.

9
Building Guardrails for Large Language Models

References Birhane, A., Kasirzadeh, A., Leslie, D., and Wachter, S.


Science in the age of large language models. Nature
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B.,
Reviews Physics, 5(5):277–280, May 2023. ISSN 2522-
Mironov, I., Talwar, K., and Zhang, L. Deep learn-
5820. doi: 10.1038/s42254-023-00581-4.
ing with differential privacy. In Proceedings of the
2016 ACM SIGSAC Conference on Computer and Com- Brand, S., Thompson Coon, J., Fleming, L., Carroll, L.,
munications Security, CCS ’16, pp. 308–318, New Bethel, A., and Wyatt, K. Whole-system approaches to
York, NY, USA, 2016. Association for Computing improving the health and wellbeing of healthcare workers:
Machinery. ISBN 9781450341394. doi: 10.1145/ A systematic review. PLoS ONE, 12(12):e0188418, 2017.
2976749.2978318. URL https://doi.org/10. doi: 10.1371/journal.pone.0188418.
1145/2976749.2978318.
Burgess, M. The hacking of chatgpt is just getting started.
ActiveFence. Llm safety review: Benchmarks and analysis. Wired, available at: www. wired. com/story/chatgpt-
https://www.activefence.com/, 2023. jailbreak-generative-ai-hacking, 2023.
Albert, A. jailbreakchat., 2024. URL https://www. Chakrabarty, T., Laban, P., Agarwal, D., Muresan, S.,
jailbreakchat.com. Accessed: 2024-01-06. and Wu, C.-S. Art or artifice? large language mod-
Amabile, T. M. Social psychology of creativity: A consen- els and the false promise of creativity. arXiv preprint
sual assessment technique. Journal of personality and arXiv:2309.14556, 2023.
social psychology, 43(5):997, 1982.
Chen, C. and Shu, K. Can llm-generated misinformation be
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., detected? arXiv preprint arXiv:2309.13788, 2023.
Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma,
Chen, L., Zaharia, M., and Zou, J. How is chatgpt’s behavior
N., et al. A general language assistant as a laboratory for
changing over time? arXiv preprint arXiv:2307.09009,
alignment. arXiv preprint arXiv:2112.00861, 2021.
2023.
Badyal, N., Jacoby, D., and Coady, Y. Intentional biases
in llm responses. In 2023 IEEE 14th Annual Ubiqui- Chern, I., Chern, S., Chen, S., Yuan, W., Feng, K., Zhou,
tous Computing, Electronics & Mobile Communication C., He, J., Neubig, G., Liu, P., et al. Factool: Factuality
Conference (UEMCON), pp. 0502–0506. IEEE, 2023. detection in generative ai–a tool augmented framework
for multi-task and multi-domain scenarios. arXiv preprint
Balle, B., Cherubin, G., and Hayes, J. Reconstructing train- arXiv:2307.13528, 2023.
ing data with informed adversaries. In 2022 IEEE Sympo-
sium on Security and Privacy (SP), pp. 1138–1156, 2022. Christian, J. Amazing “jailbreak” bypasses chatgpt’s ethics
doi: 10.1109/SP46214.2022.9833677. safeguards. Futurism, February, 4:2023, 2023.

Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, Chuang, Y.-S., Xie, Y., Luo, H., Kim, Y., Glass, J., and
B., Lovenia, H., Ji, Z., Yu, T., Chung, W., et al. A multi- He, P. Dola: Decoding by contrasting layers improves
task, multilingual, multimodal evaluation of chatgpt on factuality in large language models. arXiv preprint
reasoning, hallucination, and interactivity. arXiv preprint arXiv:2309.03883, 2023.
arXiv:2302.04023, 2023.
Cohen, J., Rosenfeld, E., and Kolter, Z. Certified ad-
Bassi, P. R. A. S., Dertkigil, S. S. J., and Cavalli, A. Im- versarial robustness via randomized smoothing. In
proving deep neural network generalization and robust- Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-
ness to background bias via layer-wise relevance propa- ings of the 36th International Conference on Machine
gation optimization. Nature Communications, 15(1):291, Learning, volume 97 of Proceedings of Machine Learn-
2024. doi: 10.1038/s41467-023-44371-z. URL https: ing Research, pp. 1310–1320. PMLR, 09–15 Jun 2019.
//doi.org/10.1038/s41467-023-44371-z. URL https://proceedings.mlr.press/v97/
cohen19c.html.
Bayat, F. F., Qian, K., Han, B., Sang, Y., Belyi, A., Khor-
shidi, S., Wu, F., Ilyas, I. F., and Li, Y. Fleek: Factual er- Cohen, R., Hamri, M., Geva, M., and Globerson, A. Lm vs
ror detection and correction with evidence retrieved from lm: Detecting factual errors via cross examination. arXiv
external knowledge. arXiv preprint arXiv:2310.17119, preprint arXiv:2305.13281, 2023.
2023.
Crabtree, B. F., Miller, W. L., and Stange, K. C. The chronic
Bi, G., Shen, L., Xie, Y., Cao, Y., Zhu, T., and He, X. A care model and diabetes management in us primary care
group fairness lens for large language models. arXiv settings: A systematic review. Diabetes Care, 34(4):
preprint arXiv:2312.15478, 2023. 1058–1063, 2011. doi: 10.2337/dc10-1145.

10
Building Guardrails for Large Language Models

”davidad” Dalrymple, D., Skalse, J., Bengio, Y., Russell, Gallegos, I. O., Rossi, R. A., Barrow, J., Tanjim, M. M.,
S., Tegmark, M., Seshia, S., Omohundro, S., Szegedy, Kim, S., Dernoncourt, F., Yu, T., Zhang, R., and Ahmed,
C., Goldhaber, B., Ammann, N., Abate, A., Halpern, N. K. Bias and fairness in large language models: A
J., Barrett, C., Zhao, D., Zhi-Xuan, T., Wing, J., and survey. arXiv preprint arXiv:2309.00770, 2023.
Tenenbaum, J. Towards guaranteed safe ai: A framework
for ensuring robust and reliable ai systems, 2024. Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y.,
Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse,
Deng, B., Wang, W., Feng, F., Deng, Y., Wang, Q., and K., et al. Red teaming language models to reduce harms:
He, X. Attack prompt generation for red teaming Methods, scaling behaviors, and lessons learned. arXiv
and defending large language models. arXiv preprint preprint arXiv:2209.07858, 2022.
arXiv:2310.12505, 2023.
Gao, L., Dai, Z., Pasupat, P., Chen, A., Chaganty, A. T.,
Fan, Y., Zhao, V., Lao, N., Lee, H., Juan, D.-C., et al.
Dhuliawala, S., Komeili, M., Xu, J., Raileanu, R., Li, X.,
Rarr: Researching and revising what language models
Celikyilmaz, A., and Weston, J. Chain-of-verification
say, using language models. In Proceedings of the 61st
reduces hallucination in large language models. arXiv
Annual Meeting of the Association for Computational
preprint arXiv:2309.11495, 2023.
Linguistics (Volume 1: Long Papers), pp. 16477–16508,
Dolata, M., Feuerriegel, S., and Schwabe, G. A sociotech- 2023.
nical view of algorithmic fairness. Information Systems Gehman, S., Gururangan, S., Sap, M., Choi, Y., and
Journal, 32(4):754–818, 2022. doi: https://doi.org/10. Smith, N. A. Realtoxicityprompts: Evaluating neural
1111/isj.12370. URL https://onlinelibrary. toxic degeneration in language models. arXiv preprint
wiley.com/doi/abs/10.1111/isj.12370. arXiv:2009.11462, 2020.
Dong, Y., Huang, W., Bharti, V., Cox, V., Banks, A., Goldstein, J. A., Sastry, G., Musser, M., DiResta, R.,
Wang, S., Zhao, X., Schewe, S., and Huang, X. Re- Gentzel, M., and Sedova, K. Generative language
liability assessment and safety arguments for machine models and automated influence operations: Emerg-
learning components in system assurance. ACM Trans. ing threats and potential mitigations. arXiv preprint
Embed. Comput. Syst., 22(3), apr 2023. ISSN 1539-9087. arXiv:2301.04246, 2023.
doi: 10.1145/3570918. URL https://doi.org/10.
1145/3570918. Gonçalves, G. and Strubell, E. Understanding the effect
of model compression on social bias in large language
Duan, H., Dziedzic, A., Papernot, N., and Boenisch, F. models. arXiv preprint arXiv:2312.05662, 2023.
Flocks of stochastic parrots: Differentially private prompt
learning for large language models. arXiv preprint Gupta, S., Shrivastava, V., Deshpande, A., Kalyan, A., Clark,
arXiv:2305.15594, 2023. P., Sabharwal, A., and Khot, T. Bias runs deep: Implicit
reasoning biases in persona-assigned llms. arXiv preprint
Dwivedi, S., Ghosh, S., and Dwivedi, S. Breaking the bias: arXiv:2311.04892, 2023.
Gender fairness in llms using prompt engineering and in- He, H., Zhang, H., and Roth, D. Rethinking with retrieval:
context learning. Rupkatha Journal on Interdisciplinary Faithful large language model inference. arXiv preprint
Studies in Humanities, 15(4), 2023. arXiv:2301.00303, 2022.
Elaraby, M., Lu, M., Dunn, J., Zhang, X., Wang, Y., and Liu, Huang, D., Bu, Q., Zhang, J., Xie, X., Chen, J., and Cui,
S. Halo: Estimation and reduction of hallucinations in H. Bias assessment and mitigation in llm-based code
open-source weak large language models. arXiv preprint generation. arXiv preprint arXiv:2309.14345, 2023a.
arXiv:2308.11764, 2023.
Huang, J., Shao, H., and Chang, K. C.-C. Are
Ernst, J. S., Marton, S., Brinkmann, J., Vellasques, E., Fou- large pre-trained language models leaking your per-
card, D., Kraemer, M., and Lambert, M. Bias mitiga- sonal information? In Goldberg, Y., Kozareva,
tion for large language models using adversarial learning. Z., and Zhang, Y. (eds.), Findings of the Associa-
2023. tion for Computational Linguistics: EMNLP 2022,
pp. 2038–2047, Abu Dhabi, United Arab Emirates,
Filgueiras, F., Mendonca, R., and Almeida, V. Governing December 2022. Association for Computational Lin-
artificial intelligence through a sociotechnical lens. IEEE guistics. doi: 10.18653/v1/2022.findings-emnlp.
Internet Computing, 27(05):49–52, sep 2023. ISSN 1941- 148. URL https://aclanthology.org/2022.
0131. doi: 10.1109/MIC.2023.3310110. findings-emnlp.148.

11
Building Guardrails for Large Language Models

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers,
Chen, Q., Peng, W., Feng, X., Qin, B., et al. A survey I., and Goldstein, T. A watermark for large language
on hallucination in large language models: Principles, models. In Krause, A., Brunskill, E., Cho, K., En-
taxonomy, challenges, and open questions. arXiv preprint gelhardt, B., Sabato, S., and Scarlett, J. (eds.), Pro-
arXiv:2311.05232, 2023b. ceedings of the 40th International Conference on Ma-
chine Learning, volume 202 of Proceedings of Machine
Huang, W., Zhao, X., Jin, G., and Huang, X. Safari: Ver-
Learning Research, pp. 17061–17084. PMLR, 23–29 Jul
satile and efficient evaluations for robustness of inter-
2023. URL https://proceedings.mlr.press/
pretability. In Proceedings of the IEEE/CVF International
v202/kirchenbauer23a.html.
Conference on Computer Vision (ICCV), pp. 1988–1998,
October 2023c. Koh, N. H., Plata, J., and Chai, J. Bad: Bias detection
Huang, X., Kwiatkowska, M., Wang, S., and Wu, M. Safety for large language models in the context of candidate
verification of deep neural networks. In Majumdar, R. and screening. arXiv preprint arXiv:2305.10407, 2023.
Kunčak, V. (eds.), Computer Aided Verification, pp. 3–29, Kreps, S., McCain, R. M., and Brundage, M. All the news
Cham, 2017. Springer International Publishing. ISBN that’s fit to fabricate: Ai-generated text as a tool of media
978-3-319-63387-9. misinformation. Journal of experimental political science,
Huang, X., Ruan, W., Huang, W., Jin, G., Dong, Y., Wu, C., 9(1):104–117, 2022.
Bensalem, S., Mu, R., Qi, Y., Zhao, X., et al. A survey
Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty:
of safety and trustworthiness of large language models
Linguistic invariances for uncertainty estimation in natu-
through the lens of verification and validation. arXiv
ral language generation. In International Conference on
preprint arXiv:2305.11391, 2023d.
Learning Representations, 2022.
Igamberdiev, T. and Habernal, I. Dp-bart for privatized text
rewriting under local differential privacy. arXiv preprint Kumar, A., Agarwal, C., Srinivas, S., Feizi, S., and
arXiv:2302.07636, 2023. Lakkaraju, H. Certifying llm safety against adversarial
prompting. arXiv preprint arXiv:2309.02705, 2023.
Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K.,
Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testug- Lamb, L. C., d’Avila Garcez, A., Gori, M., Prates, M. O.,
gine, D., et al. Llama guard: Llm-based input-output Avelar, P. H., and Vardi, M. Y. Graph neural networks
safeguard for human-ai conversations. arXiv preprint meet neural-symbolic computing: a survey and perspec-
arXiv:2312.06674, 2023. tive. In Proceedings of the Twenty-Ninth International
Joint Conference on Artificial Intelligence, IJCAI’20,
Jain, N., Schwarzschild, A., Wen, Y., Somepalli, G., 2021. ISBN 9780999241165.
Kirchenbauer, J., Chiang, P.-y., Goldblum, M., Saha, A.,
Geiping, J., and Goldstein, T. Baseline defenses for ad- Li, H., Chen, Y., Luo, J., Kang, Y., Zhang, X., Hu,
versarial attacks against aligned language models. arXiv Q., Chan, C., and Song, Y. Privacy in large lan-
preprint arXiv:2309.00614, 2023. guage models: Attacks, defenses and future directions.
CoRR, abs/2310.10383, 2023a. doi: 10.48550/ARXIV.
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E.,
2310.10383. URL https://doi.org/10.48550/
Bang, Y. J., Madotto, A., and Fung, P. Survey of halluci-
arXiv.2310.10383.
nation in natural language generation. ACM Computing
Surveys, 55(12):1–38, 2023. Li, H., Guo, D., Fan, W., Xu, M., Huang, J., Meng, F.,
Jr., D. M., Prabhakaran, V., Kuhlberg, J., Smart, A., and and Song, Y. Multi-step jailbreaking privacy attacks on
Isaac, W. S. Extending the machine learning abstraction chatgpt. In Bouamor, H., Pino, J., and Bali, K. (eds.),
boundary: A complex systems approach to incorporate Findings of the Association for Computational Linguis-
societal context. CoRR, abs/2006.09663, 2020. URL tics: EMNLP 2023, Singapore, December 6-10, 2023,
https://arxiv.org/abs/2006.09663. pp. 4138–4153. Association for Computational Linguis-
tics, 2023b. URL https://aclanthology.org/
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., and 2023.findings-emnlp.272.
Hashimoto, T. Exploiting programmatic behavior of
llms: Dual-use through standard security attacks. arXiv Li, J., Ji, S., Du, T., Li, B., and Wang, T. Textbugger: Gen-
preprint arXiv:2302.05733, 2023. erating adversarial text against real-world applications.
arXiv preprint arXiv:1812.05271, 2018.
Kim, J., Derakhshan, A., and Harris, I. G. Robust safety
classifier for large language models: Adversarial prompt Li, X., Tramer, F., Liang, P., and Hashimoto, T. Large lan-
shield. arXiv preprint arXiv:2311.00172, 2023. guage models can be strong differentially private learners.

12
Building Guardrails for Large Language Models

In International Conference on Learning Representations, Mireshghallah, F., Backurs, A., Inan, H. A., Wutschitz, L.,
2022. URL https://openreview.net/forum? and Kulkarni, J. Differentially private model compression.
id=bVuP3ltATMz. Advances in Neural Information Processing Systems, 35:
29468–29483, 2022.
Li, Y., Tan, Z., and Liu, Y. Privacy-preserving prompt
tuning for large language model services. arXiv preprint Mireshghallah, N., Kim, H., Zhou, X., Tsvetkov, Y., Sap, M.,
arXiv:2305.06212, 2023c. Shokri, R., and Choi, Y. Can llms keep a secret? testing
privacy implications of language models via contextual
Liang, Y., Song, Z., Wang, H., and Zhang, J. Learn- integrity theory. arXiv preprint arXiv:2310.17884, 2023.
ing to trust your feelings: Leveraging self-awareness
in llms for hallucination mitigation. arXiv preprint Miyato, T., Dai, A. M., and Goodfellow, I. Adversarial
arXiv:2401.15449, 2024. training methods for semi-supervised text classification.
arXiv preprint arXiv:1605.07725, 2016.
Limisiewicz, T., Mareček, D., and Musil, T. Debiasing
algorithm through model adaptation. arXiv preprint Motoki, F., Pinho Neto, V., and Rodrigues, V. More human
arXiv:2310.18913, 2023. than human: Measuring chatgpt political bias. Available
at SSRN 4372349, 2023.
Liu, X., Cheng, H., He, P., Chen, W., Wang, Y., Poon, H.,
Mozes, M., He, X., Kleinberg, B., and Griffin, L. D. Use
and Gao, J. Adversarial training for large neural language
of llms for illicit purposes: Threats, prevention measures,
models. arXiv preprint arXiv:2004.08994, 2020.
and vulnerabilities. arXiv preprint arXiv:2308.12833,
Lukas, N., Salem, A., Sim, R., Tople, S., Wutschitz, L., 2023.
and Zanella-Béguelin, S. Analyzing leakage of person-
Nagireddy, M., Chiazor, L., Singh, M., and Baldini, I. So-
ally identifiable information in language models. arXiv
cialstigmaqa: A benchmark to uncover stigma amplifi-
preprint arXiv:2302.00539, 2023.
cation in generative language models. arXiv preprint
Malik, A. Evaluating large language models through gender arXiv:2312.07492, 2023.
and racial stereotypes. arXiv preprint arXiv:2311.14788, Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim,
2023. C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al.
Webgpt: Browser-assisted question-answering with hu-
Manakul, P., Liusie, A., and Gales, M. J. Selfcheckgpt: Zero-
man feedback. arXiv preprint arXiv:2112.09332, 2021.
resource black-box hallucination detection for generative
large language models. arXiv preprint arXiv:2303.08896, Narayanan, A. and Kapoor, S. Is GPT-4 getting
2023. worse over time? AI Snake Oil, July 2023.
URL https://www.aisnakeoil.com/p/
Manerba, M. M., Stańczak, K., Guidotti, R., and Augen-
is-gpt-4-getting-worse-over-time?
stein, I. Social bias probing: Fairness benchmarking
subscribe_prompt=free.
for language models. arXiv preprint arXiv:2311.09090,
2023. Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Pat-
wary, M., Korthikanti, V., Vainbrand, D., and Catanzaro,
Mbiazi, D., Bhange, M., Babaei, M., Sheth, I., and Kenfack, B. Scaling language model training to a trillion parame-
P. J. Survey on ai ethics: A socio-technical perspective, ters using megatron, 2021.
2023.
Ngatchou, P., Zarei, A., and El-Sharkawi, A. Pareto
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating multi objective optimization. In Proceedings of the
and editing factual associations in gpt. Advances in Neu- 13th International Conference on, Intelligent Systems
ral Information Processing Systems, 35:17359–17372, Application to Power Systems, pp. 84–91, 2005. doi:
2022a. 10.1109/ISAP.2005.1599245.
Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y., and Nguyen, T. T., Huynh, T. T., Nguyen, P. L., Liew, A. W.-C.,
Bau, D. Mass-editing memory in a transformer. arXiv Yin, H., and Nguyen, Q. V. H. A survey of machine
preprint arXiv:2210.07229, 2022b. unlearning, 2022.

Michie, S., Van Stralen, M. M., and West, R. The behaviour Nvidia. Colang. https://github.com/NVIDIA/
change wheel: a new method for characterising and de- NeMo-Guardrails/blob/main/docs/user_
signing behaviour change interventions. Implementation guides/colang-language-syntax-guide.
science, 6:1–12, 2011. md, 2023.

13
Building Guardrails for Large Language Models

Oba, D., Kaneko, M., and Bollegala, D. In-contextual bias Ramezani, A. and Xu, Y. Knowledge of cultural moral
suppression for large language models. arXiv preprint norms in large language models. arXiv preprint
arXiv:2309.07251, 2023. arXiv:2306.01857, 2023.

OpenAI, R. Gpt-4 technical report. arXiv, pp. 2303–08774, Ranaldi, L., Ruzzetti, E. S., Venditti, D., Onorati, D., and
2023. Zanzotto, F. M. A trip towards fairness: Bias and
de-biasing in large language models. arXiv preprint
Oppermann, A. What is the v-model in software arXiv:2305.13862, 2023.
development? https://builtin.com/
software-engineering-perspectives/ Razumovskaia, E., Vulić, I., Marković, P., Cichy, T., Zheng,
v-model, 2023. Accessed: 2024.2.1. Q., Wen, T.-H., and Budzianowski, P. Dial beinfo for
faithfulness: Improving factuality of information-seeking
Ovalle, A., Mehrabi, N., Goyal, P., Dhamala, J., Chang,
dialogue via behavioural fine-tuning. arXiv preprint
K.-W., Zemel, R., Galstyan, A., Pinter, Y., and Gupta,
arXiv:2311.09800, 2023.
R. Are you talking to [’xem’] or [’x’,’em’]? on tokeniza-
tion and addressing misgendering in llms with pronoun Rebedea, T., Dinu, R., Sreedhar, M., Parisien, C., and Cohen,
tokenization parity. arXiv preprint arXiv:2312.11779, J. Nemo guardrails: A toolkit for controllable and safe
2023. llm applications with programmable rails. arXiv preprint
arXiv:2310.10501, 2023.
Ozdayi, M. S., Peris, C., Fitzgerald, J., Dupuy, C., Maj-
mudar, J., Khan, H., Parikh, R., and Gupta, R. Con- Robey, A., Wong, E., Hassani, H., and Pappas, G. J. Smooth-
trolling the extraction of memorized data from large llm: Defending large language models against jailbreak-
language models via prompt-tuning. arXiv preprint ing attacks. arXiv preprint arXiv:2310.03684, 2023.
arXiv:2305.11759, 2023.
Röttger, P., Kirk, H. R., Vidgen, B., Attanasio, G., Bianchi,
Pal, S., Bhattacharya, M., Lee, S.-S., and Chakraborty, C.
F., and Hovy, D. Xstest: A test suite for identifying
A domain-specific next-generation large language model
exaggerated safety behaviours in large language models.
(llm) or chatgpt is required for biomedical engineering
arXiv preprint arXiv:2308.01263, 2023.
and research. Annals of Biomedical Engineering, 2023.
doi: 10.1007/s10439-023-03306-x. URL https:// Ruan, W., Huang, X., and Kwiatkowska, M. Reachability
doi.org/10.1007/s10439-023-03306-x. analysis of deep neural networks with provable guaran-
tees. In IJCAI, pp. 2651–2659, 2018.
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides,
J., Glaese, A., McAleese, N., and Irving, G. Red teaming Schwartz, R., Vassilev, A., Greene, K., Perine,
language models with language models. arXiv preprint L., Burt, A., and Hall, P. Towards a stan-
arXiv:2202.03286, 2022. dard for identifying and managing bias in ar-
Pinter, Y. and Elhadad, M. Emptying the ocean with tificial intelligence. Special Publication (NIST
a spoon: Should we edit models? arXiv preprint SP), 2022. URL https://tsapps.nist.gov/
arXiv:2310.11958, 2023. publication/get_pdf.cfm?pub_id=934464.

Plant, R., Giuffrida, V., and Gkatzia, D. You are what you Shafer, G. and Vovk, V. A tutorial on conformal prediction.
write: Preserving privacy in the era of large language Journal of Machine Learning Research, 9(3), 2008.
models. arXiv preprint arXiv:2204.09391, 2022.
Shaikh, O., Zhang, H., Held, W., Bernstein, M., and Yang,
Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., D. On second thought, let’s not think step by step!
and Lewis, M. Measuring and narrowing the com- bias and toxicity in zero-shot reasoning. arXiv preprint
positionality gap in language models. arXiv preprint arXiv:2212.08061, 2022.
arXiv:2210.03350, 2022.
Shen, X., Chen, Z., Backes, M., Shen, Y., and Zhang, Y. ”
Rajpal, S. Guardrails ai. https://www. do anything now”: Characterizing and evaluating in-the-
guardrailsai.com/, 2023. wild jailbreak prompts on large language models. arXiv
preprint arXiv:2308.03825, 2023.
Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua,
A., Leyton-Brown, K., and Shoham, Y. In-context Sheng, Y., Cao, S., Li, D., Zhu, B., Li, Z., Zhuo, D., Gonza-
retrieval-augmented language models. arXiv preprint lez, J. E., and Stoica, I. Fairness in serving large language
arXiv:2302.00083, 2023. models. arXiv preprint arXiv:2401.00588, 2023.

14
Building Guardrails for Large Language Models

Sheppard, B., Richter, A., Cohen, A., Smith, E. A., Kneese, Tonmoy, S., Zaman, S., Jain, V., Rani, A., Rawte, V.,
T., Pelletier, C., Baldini, I., and Dong, Y. Subtle misogyny Chadha, A., and Das, A. A comprehensive survey of hal-
detection and mitigation: An expert-annotated dataset. lucination mitigation techniques in large language models.
arXiv preprint arXiv:2311.09443, 2023. arXiv preprint arXiv:2401.01313, 2024.

Shi, W., Shea, R., Chen, S., Zhang, C., Jia, R., and Yu, Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
Z. Just fine-tune twice: Selective differential privacy for A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
large language models. arXiv preprint arXiv:2204.07667, Bhosale, S., et al. Llama 2: Open foundation and fine-
2022. tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Shokri, R., Stronati, M., Song, C., and Shmatikov, V. Mem-
bership inference attacks against machine learning mod- Tramer, F., Carlini, N., Brendel, W., and Madry, A. On adap-
els. In 2017 IEEE Symposium on Security and Pri- tive attacks to adversarial example defenses. Advances in
vacy (SP), pp. 3–18, Los Alamitos, CA, USA, may Neural Information Processing Systems, 33:1633–1645,
2017. IEEE Computer Society. doi: 10.1109/SP.2017. 2020.
41. URL https://doi.ieeecomputersociety. Trist, E. L. and Bamforth, K. W. Studies in the quality of life:
org/10.1109/SP.2017.41. Delivered by the institute of personnel management in
november 1957. Lecture Series, 1957. Tavistock Institute
Song, L., Shokri, R., and Mittal, P. Privacy risks of securing
of Human Relations.
machine learning models against adversarial examples.
In Proceedings of the 2019 ACM SIGSAC Conference on Ungless, E. L., Rafferty, A., Nag, H., and Ross, B. A robust
Computer and Communications Security, CCS ’19, pp. bias mitigation procedure based on the stereotype content
241–257, New York, NY, USA, 2019. Association for model. arXiv preprint arXiv:2210.14552, 2022.
Computing Machinery. ISBN 9781450367479. doi: 10.
1145/3319535.3354211. URL https://doi.org/ van Lamsweerde, A., Darimont, R., and Letier, E. Managing
10.1145/3319535.3354211. conflicts in goal-driven requirements engineering. IEEE
Transactions on Software Engineering, 24(11):908–926,
Sun, H., Pei, J., Choi, M., and Jurgens, D. Aligning 1998. doi: 10.1109/32.730542.
with whom? large language models have gender and
racial biases in subjective nlp tasks. arXiv preprint Vega, J., Chaudhary, I., Xu, C., and Singh, G. Bypassing the
arXiv:2311.09730, 2023. safety training of open-source llms with priming attacks.
arXiv preprint arXiv:2312.12321, 2023.
Sun, S. and Ruan, W. TextVerifier: Robustness veri-
Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks.
fication for textual classifiers with certifiable guaran-
In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M.,
tees. In Rogers, A., Boyd-Graber, J., and Okazaki,
and Garnett, R. (eds.), Advances in Neural Information
N. (eds.), Findings of the Association for Com-
Processing Systems, volume 28. Curran Associates, Inc.,
putational Linguistics: ACL 2023, pp. 4362–4380,
2015. URL https://proceedings.neurips.
Toronto, Canada, July 2023. Association for Computa-
cc/paper_files/paper/2015/file/
tional Linguistics. doi: 10.18653/v1/2023.findings-acl.
29921001f2f04bd3baee84a12e98098f-Paper.
267. URL https://aclanthology.org/2023.
pdf.
findings-acl.267.
Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang,
Sun, Y., Huang, X., Kroening, D., Sharp, J., Hill, M., and C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong,
Ashmore, R. Structural test coverage criteria for deep S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z.,
neural networks. ACM Trans. Embed. Comput. Syst., 18 Cheng, Y., Koyejo, S., Song, D., and Li, B. Decodingtrust:
(5s), oct 2019. ISSN 1539-9087. doi: 10.1145/3358233. A comprehensive assessment of trustworthiness in gpt
URL https://doi.org/10.1145/3358233. models. arXiv preprint arXiv: 2306.11698, 2024a.
Tang, R., Zhang, X., Lin, J., and Ture, F. What do llamas Wang, F., Xu, P., Ruan, W., and Huang, X. Towards veri-
really think? revealing preference biases in language fying the geometric robustness of large-scale neural net-
model representations. arXiv preprint arXiv:2311.18812, works. In IJCAI2023, 2023a.
2023.
Wang, K.-C., FU, Y., Li, K., Khisti, A. J., Zemel, R., and
Tao, Y., Viberg, O., Baker, R. S., and Kizilcec, R. F. Audit- Makhzani, A. Variational model inversion attacks. In
ing and mitigating cultural bias in llms. arXiv preprint Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan,
arXiv:2311.14096, 2023. J. W. (eds.), Advances in Neural Information Processing

15
Building Guardrails for Large Language Models

Systems, 2021. URL https://openreview.net/ Yeh, K.-C., Chi, J.-A., Lian, D.-C., and Hsieh, S.-K. Eval-
forum?id=c0O9vBVSvIl. uating interfaced llm bias. In Proceedings of the 35th
Conference on Computational Linguistics and Speech
Wang, L., He, J., Li, S., Liu, N., and Lim, E.-P. Mitigat- Processing (ROCLING 2023), pp. 292–299, 2023.
ing fine-grained hallucination by fine-tuning large vision-
language models with caption rewrites. In International Yong, Z.-X., Menghini, C., and Bach, S. H. Low-
Conference on Multimedia Modeling, pp. 32–45. Springer, resource languages jailbreak gpt-4. arXiv preprint
2024b. arXiv:2310.02446, 2023.

Wang, P., Wang, Z., Li, Z., Gao, Y., Yin, B., and Ren, X. Yu, D., Naik, S., Backurs, A., Gopi, S., Inan, H. A., Ka-
Scott: Self-consistent chain-of-thought distillation. arXiv math, G., Kulkarni, J., Lee, Y. T., Manoel, A., Wutschitz,
preprint arXiv:2305.01879, 2023b. L., Yekhanin, S., and Zhang, H. Differentially private
fine-tuning of language models. In The Tenth Interna-
Wang, Y., Li, H., Han, X., Nakov, P., and Baldwin, T. Do- tional Conference on Learning Representations, ICLR
not-answer: A dataset for evaluating safeguards in llms. 2022, Virtual Event, April 25-29, 2022. OpenReview.net,
arXiv preprint arXiv:2308.13387, 2023c. 2022. URL https://openreview.net/forum?
id=Q42f0dfjECO.
Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken:
How does llm safety training fail? arXiv preprint Zanella-Béguelin, S., Wutschitz, L., Tople, S., Rühle, V.,
arXiv:2307.02483, 2023. Paverd, A., Ohrimenko, O., Köpf, B., and Brockschmidt,
M. Analyzing information leakage of updates to natu-
Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mellor, J.,
ral language models. In Proceedings of the 2020 ACM
Hendricks, L. A., Anderson, K., Kohli, P., Coppin, B.,
SIGSAC conference on computer and communications
and Huang, P.-S. Challenges in detoxifying language
security, pp. 363–375, 2020.
models. arXiv preprint arXiv:2109.07445, 2021.
Zhang, Y. and Ippolito, D. Prompts should not be seen
Wicker, M., Huang, X., and Kwiatkowska, M. Feature-
as secrets: Systematically measuring prompt extraction
guided black-box safety testing of deep neural networks.
attack success. arXiv preprint arXiv:2307.06865, 2023.
In Beyer, D. and Huisman, M. (eds.), Tools and Algo-
rithms for the Construction and Analysis of Systems, pp.
Zhang, Y., Jia, R., Pei, H., Wang, W., Li, B., and Song,
408–426, Cham, 2018. Springer International Publishing.
D. The secret revealer: Generative model-inversion
ISBN 978-3-319-89960-2.
attacks against deep neural networks. In 2020 IEEE/CVF
Xiang, A. Being ’seen’ vs. ’mis-seen’: Tensions be- Conference on Computer Vision and Pattern Recognition,
tween privacy and fairness in computer vision. Har- CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp.
vard Journal of Law & Technology, 36(1), Fall 250–258. Computer Vision Foundation / IEEE, 2020.
2022. Available at SSRN: https://ssrn.com/ doi: 10.1109/CVPR42600.2020.00033. URL https:
abstract=4068921 or http://dx.doi.org/ //openaccess.thecvf.com/content_CVPR_
10.2139/ssrn.4068921. 2020/html/Zhang_The_Secret_Revealer_
Generative_Model-Inversion_Attacks_
Xiao, Y., Jin, Y., Bai, Y., Wu, Y., Yang, X., Luo, X., Yu, W., Against_Deep_Neural_Networks_CVPR_
Zhao, X., Liu, Y., Chen, H., et al. Large language models 2020_paper.html.
can be good privacy protection learners. arXiv preprint
arXiv:2310.02469, 2023. Zhao, H., Ma, C., Dong, X., Luu, A. T., Deng, Z.-H., and
Zhang, H. Certified robustness against natural language
Xie, Z. and Lukasiewicz, T. An empirical analysis of attacks by causal intervention. In Chaudhuri, K., Jegelka,
parameter-efficient methods for debiasing pre-trained lan- S., Song, L., Szepesvari, C., Niu, G., and Sabato, S.
guage models. arXiv preprint arXiv:2306.04067, 2023. (eds.), Proceedings of the 39th International Conference
on Machine Learning, volume 162 of Proceedings of
Xu, Z., Jain, S., and Kankanhalli, M. Hallucination is Machine Learning Research, pp. 26958–26970. PMLR,
inevitable: An innate limitation of large language models. 17–23 Jul 2022. URL https://proceedings.mlr.
arXiv preprint arXiv:2401.11817, 2024. press/v162/zhao22g.html.

Yao, H., Lou, J., Ren, K., and Qin, Z. Promptcare: Prompt Zhao, R., Li, X., Joty, S., Qin, C., and Bing, L. Verify-
copyright protection by watermark injection and verifica- and-edit: A knowledge-enhanced chain-of-thought frame-
tion, 2023. work. arXiv preprint arXiv:2305.03268, 2023.

16
Building Guardrails for Large Language Models

Zhao, X., Banks, A., Sharp, J., Robu, V., Flynn, D., Fisher,
M., and Huang, X. A safety framework for critical sys-
tems utilising deep neural networks. In SafeComp2020,
pp. 244–259, 2020.
Zhou, K. Z. and Sanfilippo, M. R. Public perceptions of
gender bias in large language models: Cases of chatgpt
and ernie. arXiv preprint arXiv:2309.09120, 2023.

Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Uni-


versal and transferable adversarial attacks on aligned lan-
guage models. arXiv preprint arXiv:2307.15043, 2023.

17
Building Guardrails for Large Language Models

Llama Guard Nvidia NeMo Guardrails AI


Monitoring rules ! ! !
Enforcement rules % ! !
Multi-modal support ! ! %
Output check ! % !
Scalability support – % !
Table 2. Compared Results of Guardrail Frameworks under Qualitative Analysis Dimensions
A. Comparison of Llama Guard, Nemo and Guardrails AI
We build the qualitative analysis dimensions based on the workflow of the guardrails (refer to Figure 1, Figure 2 and Figure
3), as shown in Table 2. The first factor to take into account is the capability of customizing rules for guardrails. Customized
rules are considered into two dimensions, where Monitoring rules refers to the ability to allow users to customize the
functions performed by guardrails, and Enforcement rules denotes the capacity to compel the production of predefined
content upon detection of content. It is noted here that LLama Guard only classifies the output text, but does not enforce
the output. Multi-modal support considers whether the input-output properties of the guardrail support multi-modality.
Guardrails AI can only support text-based checks. In terms of the Output check, Nemo’s output follows the flow execution
of the Colang program, but there is no further validation if it imported GPT’s generation. Scalability support demonstrates
whether the guardrail framework is applicable to the specific LLM. Llama Guard checks users’ input and LLM’s output,
and does not interact directly with LLMs, so it is not considered for this dimension. Nvidia NeMo is only available with
ChatGPT and Guardrails AI provides better scalability support.

B. Demonstration of the current challenges in ChatGPT


In this section, we showcase the negative aspects of ChatGPT’s responses, as depicted in Figure 4. These aspects include
unintended responses, biases, privacy breaches, and hallucinations. Additionally, we demonstrate the challenges faced by
current guardrailed chatbots when it comes to refusing responses and delivering overly cautious responses in Figure 5.
In Figure 4(a), when we change the input prompt to a “Hypothetical response”, ChatGPT provides a step-by-step guide
on an illegal act, such as hotwiring a car, raising significant safety concerns. In the example illustrated in Figure 4(b), an
unfair response may inadvertently come across as a joke, assuming that fairness is widely accepted or understood without
further explanation. This oversight can have negative repercussions on communities, especially children, as it perpetuates
harmful biases without adequate context or explanation. Regarding privacy leakage, we demonstrate an example in Figure
4(c), revealing that ChatGPT is unable to keep a secret within the conversation, even when we mention that the message will
be shared with all attendees. In 4(d), we observe that when ChatGPT is asked to provide references, some information in the
references can be inaccurate, raising concerns about the reliability in the scientific information.
In the context of opinion-based questions and answers surveys, the model is more inclined to abstain from responding, as
demonstrated in Figure 5. As we can see in the example presented in Figure 5 (b), ChatGPT 4 tends to decline to answer
potentially sensitive questions, even abstaining from delivering positive responses.

C. Evaluate current attack method


In this section, we showcase that ChatGPT 3.5 and 4 have successfully addressed and resolved certain state-of-the-art attack
methods. Representative examples are shown in Figures 6(a)-6(d).

18
Building Guardrails for Large Language Models

(a) Jailbreaking

(b) Fairness

(c) Privacy Leakage

(d) Hallucination

Figure 4. Harmful Response on ChatGPT 3.5

19
Building Guardrails for Large Language Models

(a) Refuse Response (b) Conservative

Figure 5. Safer or Intelligence? How to Respond

(a) Instruction-following attack (Kang et al., 2023) (b) Jailbroken (Wei et al., 2023)

(c) Jailbroken: Do Anything Now (Shen et al., 2023) (d) Hypothetical response (Wei et al., 2023)

Figure 6. Guarded Example of Attacks on ChatGPT

20

You might also like