LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models

Hayder Elesedy Pedro M. Esperança Silviu Vlad Oprea Mete Ozay

Samsung R&D Institute UK (SRUK), United Kingdom
Correspondence: {p.esperanca, m.ozay}@samsung.com

Abstract

Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling on-device content moderation.

Hayder Elesedy Pedro M. Esperança Silviu Vlad Oprea Mete Ozay Samsung R&D Institute UK (SRUK), United Kingdom Correspondence: {p.esperanca, m.ozay}@samsung.com

1 Introduction

Large Language Models (LLMs) have become increasingly competent at language generation tasks. The standard process of training LLMs involves unsupervised learning of language structure from large corpora (pre-training; Achiam et al., 2023); followed by supervised fine-tuning on specific tasks. For instance, conversational assistants are trained to provide helpful answers to user questions that are aligned with human preferences (instruction tuning; Wei et al., 2021; Ouyang et al., 2022)

Refer to caption — Figure 1: Overview of LoRA-Guard, discussed in Section 2. The generative path uses the chat model ( $W$ ) to produce a response, while the guarding path uses both the chat and guarding models ( $W$ and $\Delta W$ ) to produce a harmfulness score. The system can guard the user prompt, the model response, or their concatenation ( $\mathbin{+\mkern-10.0mu+}$ ).

Since pre-training datasets, such as Common Crawl, can contain undesirable content (Luccioni and Viviano, 2021), LLMs are able to generate such content, including offensive language and illegal advice. This known failure mode of LLMs is an unintended consequence of their ability to generate answers that are coherent to user input, to the detriment of safety (Wei et al., 2024).

To mitigate this problem, models have been optimised to not only follow instructions, but also respond in a manner that is safe, aligned with human values (safety tuning; Bai et al., 2022a, b) These chat models are still susceptible to jailbreak attacks, which evade the defences introduced by safety tuning with strategies such as using low-resource languages in prompts, refusal suppression, privilege escalation and distractions (Schulhoff et al., 2023; Dong et al., 2024b; Shen et al., 2023; Wei et al., 2024). This has motivated the development of Guardrails which monitor exchanges between chat models and users, flagging harmful entries, and are an important component of AI safety stacks in deployed systems (Dong et al., 2024a).

One research direction uses model-based guardrails (guard models) that are separate from the chat models themselves (Inan et al., 2023; Madaan, 2024)¹¹1Additional related work is introduced in Appendix A.. However, this introduces a prohibitive computational overhead in low-resource settings. Learning is also inefficient: language understanding abilities of the chat models will significantly overlap those of the guard models, if both are to effectively perform their individual tasks, response generation and content moderation, respectively.

In this paper, we propose: de-duplicating such abilities via parameter sharing between the chat and the guard models; as well as parameter-efficient fine-tuning; integrating both the chat and the guard models into what we refer to as LoRA-Guard. It uses a low-rank adapter (LoRA; Hu et al., 2021) on a backbone transformer of a chat model in order to learn the task of detecting harmful content, given examples of harmful and harmless exchanges. LoRA parameters are activated for guardrailing, and the harmfullness label is provided by a classification head. LoRA parameters are deactivated for chat usages, when the original language modelling head is employed for response generation.

Our contributions are: (1) LoRA-Guard, an efficient and moderated conversational system, performing guardrailing with parameter overheads reduced by 100-1000x vs. previous approaches, making guard model deployment feasible in resource-constrained settings (Fig. 2); (2) performance evaluations on individual datasets and zero-shot across datasets; (3) published weights for guard models.²²2A link will be provided in the camera-ready version.

Model	AUPRC $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	F1 $\uparrow$	Guard Overhead $\downarrow$
ToxicChat-T5-large^(a)	$.89$	$.80$	$.85$	$.82$	$7.38\text{\times}{10}^{8}$
OpenAI Moderation^(a)	$.63$	$.55$	$.70$	$.61$	—
Llama-Guard^(b)	$.63$	—	—	—	$6.74\text{\times}{10}^{9}$
Llama-Guard-FFT^(c)	$.81$	—	—	—	$6.74\text{\times}{10}^{9}$
Llama2-7b-base-FFT^(c)	$.78$	—	—	—	$6.74\text{\times}{10}^{9}$
LoRA-Guard-TinyLlama-1.1b	$.88\;\;(.03)$	$.69\;\;(.09)$	$.90\;\;(.02)$	$.77\;\;(.06)$	$4.51\text{\times}{10}^{06}$
LoRA-Guard-Llama2-7b	$.91\;\;(.05)$	$.72\;\;(.16)$	$.87\;\;(.07)$	$.81\;\;(.08)$	$4.20\text{\times}{10}^{06}$
LoRA-Guard-Llama3-8b	$.90\;\;(.01)$	$.78\;\;(.11)$	$.90\;\;(.11)$	$.83\;\;(.02)$	$3.41\text{\times}{10}^{06}$

Table 1: Evaluation of guard models on ToxicChat (Section 3). For each metric, we show the median value across 3 training and evaluation instances, varying the random seed; in parentheses, we show the difference between the max and the min value. Guard Overhead is the number of parameters introduced by the guard model (for LoRA-Guard, that is the number of LoRA weights). FFT denotes full fine-tuning. Further metric definitions, and notes for superscripts (a)-(c), are given in Appendices B.2 and B.3.

Model	AUPRC $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	F1 $\uparrow$	Guard Overhead $\downarrow$
Llama-Guard	$.82$	$.75$	$.73$	$.77$	$6.74\text{\times}{10}^{9}$
LoRA-Guard-TinyLlama-1.1b	$.83\;\;(.01)$	$.77\;\;(.03)$	$.44\;\;(.06)$	$.56\;\;(.05)$	$4.52\text{\times}{10}^{06}$
LoRA-Guard-Llama2-7b	$.83\;\;(.01)$	$.86\;\;(.05)$	$.34\;\;(.00)$	$.49\;\;(.01)$	$1.68\text{\times}{10}^{07}$
LoRA-Guard-Llama3-8b	$.82\;\;(.09)$	$.77\;\;(.08)$	$.43\;\;(.61)$	$.55\;\;(.33)$	$5.46\text{\times}{10}^{07}$

Table 2: Evaluation of guard models on OpenAIModEval (Section 3). Notations follow those from Table 2.

2 Methodology

A guard model $\mathcal{G}$ for a generative chat model $\mathcal{C}$ categorizes each input and/or corresponding output of $\mathcal{C}$ according to a taxonomy of harmfulness categories. The taxonomy could include coarse-grained categories, such as ‘safe” and “unsafe”, or could further distinguish between fine-grained categories, such as “violence”, “hate”, “illegal”, etc.

We now introduce LoRA-Guard. We assume a chat model $\mathcal{C}$ consisting of an embedding $\phi$ , a feature map $f$ and a linear language modelling head $h_{\text{chat}}$ . The embedding maps tokens to vectors; the feature map (a Transformer variant; Vaswani et al., 2017) maps these vectors into further representations; and the language modelling head maps these representations into next-token logits. If $x$ represents a tokenized input sequence, then the next token logits are computed by $h_{\text{chat}}(f(\phi(x)))$ . We propose to build the guard model $\mathcal{G}$ using parameter-efficient fine-tuning methods applied to $f$ , and instantiate this idea with LoRA adapters, which add additional training parameters in the form of low-rank (i.e. parameter-efficient) matrices (see Appendix A for details). Other adaptation methods are possible (Sung et al., 2022; He et al., 2021; Lialin et al., 2023; Houlsby et al., 2019).

The same tokenizer and embedding is used for $\mathcal{C}$ and $\mathcal{G}$ . However, $\mathcal{G}$ uses a different feature map $f^{\prime}$ chosen as LoRA adapters attached to $f$ ; and also uses a separate output head $h_{\text{guard}}$ (linear, without bias), which maps features to harmfulness categories. Tokenized content $x$ is therefore classified by $h_{\text{guard}}(f^{\prime}(\phi(x)))$ . Deactivating the LoRA adapters and using the language modelling head gives the original chat model, while activating the LoRA adapters and using the guard model head gives the guard model. These generative and guarding paths, respectively, are depicted in Figure 1. We do not merge the LoRA adapters after training.

The dual path design of LoRA-Guard, based on adaptation instead of alignment, has an important advantage over existing alternatives: since the generative task is unaffected, LoRA-Guard avoids performance degradation on the generative task, which is a common drawback of fine-tuning approaches (catastrophic forgetting; Luo et al., 2023).

Most parameters, namely those in $f$ , are shared between the generative and guarding paths. Therefore, the parameter overhead incurred by the guard model is only that of the LoRA adapters $f^{\prime}$ , and of the guard output head $h_{\text{guard}}$ . This is a tiny fraction of the number of parameters used by the chat system, often 3 orders of magnitude smaller, as will be seen in Table 2. We stress that deactivating the LoRA adapters and activating the language modelling head recovers exactly the original chat model, thereby, no loss in performance is possible.

The guard model is trained by supervised fine-tuning $f^{\prime}$ and $h_{\text{guard}}$ on a dataset labelled according to the chosen taxonomy. Datasets are discussed in Section 3.1. During training, the parameters of the chat model $f$ remain frozen. Thereby, adapters of $\mathcal{G}$ are trained to leverage existing knowledge in $\mathcal{C}$ .

3 Experiments

3.1 Setup

Models

We evaluate LoRA-Guard by training our guard adaptations with 3 different chat models: TinyLlama (Zhang et al., 2024, 1.1B-Chat-v1.0), Llama2-7b-chat (Touvron et al., 2023a), and Llama3-8B-Instruct (AI@Meta, 2024). We use the instruction tuned variants of each model to replicate their dual use as chat applications.

Datasets

We use two datasets: (1) ToxicChat consists of $10,165$ prompt-response pairs from the Vicuna online demo (Lin et al., 2023b; Chiang et al., 2023), each annotated with a binary toxicity label (toxic or not), which we use as the target class for the guard model. We train the LoRA-Guard models on the concatenation of prompt-response pairs with the formatting: user: {prompt} <newline> <newline> agent: {response} (truncated if necessary). (2) OpenAIModEval consists of $1,680$ prompts (no model responses) collected from publicly available sources, labelled according to a taxonomy with $8$ categories (Markov et al., 2023). See Appendix B.1 for data details.

Baselines

We compare LoRA-Guard with existing guard models: (1) Llama-Guard (Inan et al., 2023) fine-tunes Llama2-7b on a non-released dataset with 6 harmfulness categories (multi-class, multi-label); it outputs a text which is parsed to determine the category labels. (2) ToxicChat-T5-large (Lin et al., 2024) fine-tunes a T5-large model (Raffel et al., 2020) on the ToxicChat dataset; it outputs a text representing whether the input is toxic or not. (3) OpenAI Moderation API is a proprietary guard model, trained on proprietary data with 8 harmfulness categories (Markov et al., 2023); it outputs scores indicating its degree of belief as to whether the content falls into each of the categories (multi-class, multi-label). We provide two additional baselines: self-defence, where an LLM judges the harmfulness of content (Phute et al., 2024; Appendix D); and a linear classifier trained with the chat features only (no LoRA adaptation), termed head fine-tuning (Appendix E).

Evaluation

ToxicChat includes binary harmfulness labels. When evaluating a model that uses finer-grained harmfulness taxonomies, we consider a model output harmful when it falls into any harmfulness category. Similarly, OpenAI includes binary labels for each of 8 harmfulness categories, some missing (not all samples have labels for each category). To evaluate models that output binary labels, we conservatively binarise OpenAI labels: we consider a text harmful when it is harmful according to any category, or has missing labels.

For LoRA-Guard, we tuned the batch size, LoRA rank and epoch checkpoint using the metric: maximum median AUPRC (area under the precision-recall curve) on a validation set, median computed from 3 random training seeds times for each hyperparameter setting. When reporting results, we list the median, as well as the range (difference between max and min AUPRC value). Appendix B.2 gives training, evaluation, and metrics details.

3.2 Results

ToxicChat results are shown in Table 2 and depicted in Fig. 2. In almost all cases, LoRA-Guard outperforms baselines on AUPRC, including fully fine-tuned LLM-based guards which incur massive overheads ( $\sim 1500\times$ for LoRA-Guard-TinyLlama vs Llama-Guard-FFT). OpenAIModEval results are shown in Table 2. LoRA-Guard is competitive with alternative methods, but with a parameter overhead $100\times$ smaller compared to Llama-Guard. Appendix C provides results with different hyperparameters, for both datasets.

Cross-domain

To estimate the ability of LoRA-Guard to generalise to harmfulness domains unseen during training, we evaluated, on OpenAIModEval (OM), models trained on ToxicChat (TC), and vice-versa. TC models output one binary label, while OM models output a binary label for each of 8 harmfulness categories. As such, when training on TC and evaluating on OM, we considered an OM sample as harmful if labelled harmful according to any category, or had missing labels. Conversely, OM models output 8 binary labels, one for each OM category. When evaluating on TC, we binarise model output as follows: it indicates harmfulness if any of the 8 binary labels are set. AUPRC values are shown in Table 3; further metrics in Appendix C. Comparing Table 3(a) (train on TC, evaluate on OM) with Table 2 (train and evaluate on OM), we do not notice a drop in AUPRC larger than 0.02. However, comparing Table 3(b) (train on OM, evaluate on TC) with Table 2 (train and evaluate on TC), we notice a considerable drop in AUPRC, e.g. from 0.9 to 0.39 for LoRA-Guard-Llama3-8b vs Llama-Guard. In addition, the AUPRC range increases from 0.01 to 0.3. LoRA-Guard trained on TC seems to generalise to OM with marginal loss in performance, but not vice-versa. It could be that the type of harmfulness reflected in OM is also found in TC, but not vice versa. We consider further investigations into this. Possible explanations include: different input formats (TC contains user prompts, while OM does not); and a fragment of ToxicChat samples being engineered to act as jailbreaks (Lin et al., 2023b). Consult Tables 11 and 11 (Appendix C) for further metrics.

Model	AUPRC $\uparrow$
LoRA-Guard-TinyLlama	.80 (.01)
LoRA-Guard-Llama2-7b	.79 (.02)
LoRA-Guard-Llama3-8b	.81 (.01)

(a) Trained on ToxicChat, evaluated on OpenAIModEval

Model	AUPRC $\uparrow$
LoRA-Guard-TinyLlama	.19 (.03)
LoRA-Guard-Llama2-7b	.35 (.07)
LoRA-Guard-Llama3-8b	.39 (.30)

(b) Trained on OpenAIModEval, evaluated on ToxicChat

Table 3: Cross-domain evaluation (Section 3.2).

4 Conclusion

LoRA-Guard is a moderated conversational system that greatly reduces the guardrailing parameter overhead, by a factor of 100-1000x in our experiments, reducing training/inference time and memory requirements, while maintaining or improving performance. This can attributed to its knowledge sharing and parameter-efficient learning mechanisms. Fine-tuning catastrophic forgetting is also implicitly prevented by the dual-path design (cf. Fig. 1). We consider LoRA-Guard to be an important stepping stone towards guardrailing on resource-constrained portables—an essential task given the increased adoption of on-device LLMs.

Potential Risks

Future work can consider improving cross-domain generalisation, e.g. by finding the minimum amount of samples from the target domain that could be used to adapt LoRA-Guard to that domain. It is risky to deploy LoRA-Guard to arbitrary domains without such efforts.

5 Limitations

LoRA-Guard has some limitations: First, our system requires access to the chat system weights, so is only applicable in these cases and cannot be applied to black-box systems.

Second, the taxonomy is fixed in our system, and adaptation to different taxonomies requires retraining unlike Llama-Guard which can adapt via in-context learning. Though our guard output head is chosen to be a classifier mapping feature into class probabilities, an output head that produces text as in Llama-Guard is still possible in our framework.

Third, we warn against generalization across different domains due to dataset differences and lack of robustness training. The phenomenon is common in machine learning systems and can lead to wrong predictions when applied to data that is considerably different from the training data. In our case, this can lead to over-cautious predictions (misclassifying harmless samples as harmful) which causes the system to refuse to deliver harmless content; as well as under-cautious predictions (misclassifying harmful samples as harmless) which causes the system to deliver harmful content. The implications of delivering harmful content are more serious, though we emphasize that a guard model can only fail to detect harmful content, whose origin is the chat model response or the user prompts. Nevertheless, to achieve a more trustworthy guard system which we can confidently deploy requires more robust training and further evaluations. This will be improved in future work.

6 Ethical Considerations

We believe an important conversation in the field of AI safety is around the taxonomy of harmfulness categories that is used to direct the development of safety mechanism, such as guardrails.

There are categories of harmfulness that are more self-evident than others, such as those categories imposed by the moral law and the judiciary system. Others, however, are more particular to specific cultures and demographic groups. If LLM systems are to be adopted across cultures and demographic groups, we argue guardrails should be aware of the norms of conduct withing those groups.

The method we suggest in this paper might contribute to a wider adoption of content-moderated LLMs, due to reducing the computational overhead, making guardrails more available on portable devices; thus available to more people, e.g. those who do not necessarily enjoy a fast internet connection such that guardrailing can be done “in the cloud”. However, our method is oblivious to the norms that it is being adapted to as such. The ability to perform guardrailing is only a part of the process. The other part is having the resources that reflect culture and demographic norms, such as demographic-specific datasets, on which guardrails can be trained. We suggest this as an essential direction of future research. We advise caution with deploying a general-purpose guardrails across multiple cultural and demographic groups.

We comply with licence conditions for all pre-trained models and datasets used in the work. We accessed these artefects via HuggingFace (Wolf et al., 2019a) as follows:

•

ToxicChat-T5-Large model³³3https://huggingface.co/lmsys/toxicchat-t5-large-v1.0
•

Llama2-7b model⁴⁴4https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
•

Llama3-8b model ⁵⁵5https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
•

TinyLlama model⁶⁶6https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
•

ToxicChat dataset⁷⁷7https://huggingface.co/datasets/lmsys/toxic-chat
•

OpenAIModEval dataset⁸⁸8https://huggingface.co/datasets/mmathys/openai-moderation-api-evaluation

Regarding our artifacts to be published, we comply with intended use for derivative work.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
Alon and Kamfonas (2023) Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132.
Andriushchenko et al. (2024) Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. 2024. Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151.
Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022b. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Barak (2023) Boaz Barak. 2023. Another jailbreak for GPT4: Talk to it in Morse code.
Chao et al. (2023) Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. 2023. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pages 647–655. PMLR.
Dong et al. (2024a) Yi Dong, Ronghui Mu, Gaojie Jin, Yi Qi, Jinwei Hu, Xingyu Zhao, Jie Meng, Wenjie Ruan, and Xiaowei Huang. 2024a. Building guardrails for large language models. arXiv preprint arXiv:2402.01822.
Dong et al. (2024b) Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. 2024b. Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283.
Enkrypt AI (2024) Enkrypt AI. 2024. Protect your generative AI system with Guardrails.
Gheini et al. (2021) Mozhdeh Gheini, Xiang Ren, and Jonathan May. 2021. Cross-attention is all you need: Adapting pretrained transformers for machine translation. arXiv preprint arXiv:2104.08771.
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
Gugger et al. (2022) Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. 2022. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
Guzey (2023) Alexey Guzey. 2023. A two sentence jailbreak for GPT-4 and Claude & why nobody knows how to fix it.
He et al. (2021) Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. 2021. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034.
Helbling et al. (2023) Alec Helbling, Mansi Phute, Matthew Hull, and Duen Horng Chau. 2023. LLM self defense: By self examination, LLMs know they are being tricked. arXiv preprint arXiv:2308.07308.
Houlsby et al. (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Inan et al. (2023) Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama Guard: LLM-based input-output safeguard for Human-AI conversations. arXiv preprint arXiv:2312.06674.
Jain et al. (2023) Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
Jiang et al. (2024) Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. 2024. ArtPrompt: ASCII art-based jailbreak attacks against aligned LLMs. arXiv preprint arXiv:2402.11753.
Kang et al. (2023) Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. 2023. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
Lapid et al. (2023) Raz Lapid, Ron Langberg, and Moshe Sipper. 2023. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
Lhoest et al. (2021) Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. 2021. Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846.
Li et al. (2023) Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. 2023. RAIN: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
Lialin et al. (2023) Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. 2023. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647.
Lin et al. (2023a) Bill Yuchen Lin, Abhilasha Ravichander, Ximing Lu, Nouha Dziri, Melanie Sclar, Khyathi Chandu, Chandra Bhagavatula, and Yejin Choi. 2023a. The unlocking spell on base LLMs: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552.
Lin et al. (2023b) Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. 2023b. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389.
Lin et al. (2024) Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. 2024. Toxicchat-t5-large model card. https://huggingface.co/lmsys/toxicchat-t5-large-v1.0. Accessed: 5 June 2024.
Liu et al. (2023) Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
Luccioni and Viviano (2021) Alexandra Sasha Luccioni and Joseph D Viviano. 2021. What’s in the box? a preliminary analysis of undesirable content in the Common Crawl Corpus. arXiv preprint arXiv:2105.02732.
Luo et al. (2023) Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. 2023. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
Madaan (2024) Shubh Goyal; Medha Hira; Shubham Mishra; Sukriti Goyal; Arnav Goel; Niharika Dadu; Kirushikesh DB; Sameep Mehta; Nishtha Madaan. 2024. LLMGuard: Guarding against unsafe LLM behavior.
Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
Markov et al. (2023) Todor Markov, Chong Zhang, Sandhini Agarwal, Florentine Eloundou Nekoul, Theodore Lee, Steven Adler, Angela Jiang, and Lilian Weng. 2023. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 15009–15018.
Mehrotra et al. (2023) Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2023. Tree of attacks: Jailbreaking black-box LLMs automatically. arXiv preprint arXiv:2312.02119.
Mowshowitz (2022) Zvi Mowshowitz. 2022. Jailbreaking ChatGPT on release day.
OpenAI Moderation API (2024) OpenAI Moderation API. 2024. Moderation api.
Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
Paulus et al. (2024) Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. 2024. AdvPrompter: Fast adaptive adversarial prompting for LLMs. arXiv preprint arXiv:2404.16873.
Perez and Ribeiro (2022) Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527.
Perspective API (2024) Perspective API. 2024. Perspective API.
Phute et al. (2024) Mansi Phute, Alec Helbling, Matthew Hull, ShengYun Peng, Sebastian Szyller, Cory Cornelius, and Duen Horng Chau. 2024. LLM Self Defense: By self examination, LLMs know they are being tricked. In ICLR 2024 TinyPaper.
Popa and Poddar (2024) Raluca Ada Popa and Rishabh Poddar. 2024. Securing generative AI in the enterprise.
Protect AI (2024) Protect AI. 2024. LLM Guard: The security toolkit for LLM interactions.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
Rajpal (2023) S. G. Rajpal. 2023. Guardrails ai.
Rao et al. (2023) Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. 2023. Tricking LLMs into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965.
Raschka (2023) Sebastian Raschka. 2023. Practical tips for finetuning llms using lora (low-rank adaptation). https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms. Accessed: 5 June 2024.
Rebedea et al. (2023) Traian Rebedea, Razvan Dinu, Makesh Sreedhar, Christopher Parisien, and Jonathan Cohen. 2023. NeMo Guardrails: A toolkit for controllable and safe llm applications with programmable rails. arXiv preprint arXiv:2310.10501.
Russinovich et al. (2024) Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2024. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack. arXiv preprint arXiv:2404.01833.
Schulhoff et al. (2023) Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, and Jordan Boyd-Graber. 2023. Ignore This Title and HackAPrompt: Exposing systemic vulnerabilities of LLMs through a global scale prompt hacking competition. arXiv preprint arXiv:2311.16119.
Shah et al. (2023) Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. 2023. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
Shen et al. (2023) Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2023. “Do Anything Now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
Sung et al. (2022) Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. 2022. LST: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005.
Takemoto (2024) Kazuhiro Takemoto. 2024. All in how you ask for it: Simple black-box method for jailbreak attacks. arXiv preprint arXiv:2401.09798.
Touvron et al. (2023a) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023a. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Varshney et al. (2023) Neeraj Varshney, Pavel Dolin, Agastya Seth, and Chitta Baral. 2023. The art of defending: A systematic evaluation and analysis of LLM defense strategies on safety and over-defensiveness. arXiv preprint arXiv:2401.00287.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
walkerspider (2022) walkerspider. 2022. DAN is my new friend.
Wang et al. (2023) Zezhong Wang, Fangkai Yang, Lu Wang, Pu Zhao, Hongru Wang, Liang Chen, Qingwei Lin, and Kam-Fai Wong. 2023. Self-Guard: Empower the LLM to safeguard itself. arXiv preprint arXiv:2310.15851.
Wei et al. (2024) Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2024. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36.
Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Wei et al. (2023) Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
WitchBOT (2023) WitchBOT. 2023. You can use GPT-4 to create prompt injections against GPT-4.
Witten (2022) Zack Witten. 2022. Thread of known chatgpt jailbreaks.
Wolf et al. (2019a) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019a. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771.
Wolf et al. (2019b) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019b. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
Xie et al. (2024) Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Gong. 2024. Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis. arXiv preprint arXiv:2402.13494.
Xie et al. (2023) Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending ChatGPT against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(5):1486–1496.
Yong et al. (2023) Zheng-Xin Yong, Cristina Menghini, and Stephen H Bach. 2023. Low-resource languages jailbreak GPT-4. arXiv preprint arXiv:2310.02446.
Yu et al. (2023) Jiahao Yu, Xingwei Lin, and Xinyu Xing. 2023. GPTFUZZER: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
Yuan et al. (2023) Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2023. GPT-4 is too smart to be safe: Stealthy chat with LLMs via cipher. arXiv preprint arXiv:2308.06463.
Zeng et al. (2024) Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024. How johnny can persuade LLMs to jailbreak them: Rethinking persuasion to challenge AI safety by humanizing LLMs. arXiv preprint arXiv:2401.06373.
Zhang et al. (2024) Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. 2024. TinyLlama: An open-source small language model. arXiv preprint arXiv:2401.02385.
Zhou et al. (2024) Yujun Zhou, Yufei Han, Haomin Zhuang, Taicheng Guo, Kehan Guo, Zhenwen Liang, Hongyan Bao, and Xiangliang Zhang. 2024. Defending jailbreak prompts via in-context adversarial game. arXiv preprint arXiv:2402.13148.
Zhu et al. (2023) Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. 2023. AutoDAN: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140.
Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Appendix A Related Work

Attacks.

Jailbreak attacks have been shown to effectively generate harmful content (Rao et al., 2023; Kang et al., 2023). The overarching goal of jailbreak is to trick the model into ignoring or deprioritizing its safety mechanisms, thus open up the door for harmful content to be generated.

Simple approaches such as manual prompting have shown remarkable result considering their simplicity (walkerspider, 2022; Mowshowitz, 2022; Witten, 2022; Guzey, 2023; Zeng et al., 2024). Some example strategies include: instructing to model to ignore previous (potentially safety) instructions (Perez and Ribeiro, 2022; Shen et al., 2023; Schulhoff et al., 2023); asking the model to start the answer with “Absolutely! Here’s ” to condition the generation process to follow a helpful direction (Wei et al., 2024); using low-resource languages of alternative text modes such as ciphers, for which pre-training data exists but safety data may be lacking (Yong et al., 2023; Barak, 2023; Yuan et al., 2023; Jiang et al., 2024); inducing persona modulation or role-playing (Shah et al., 2023; Yuan et al., 2023); using an LLM assistant to generate jailbreak prompts (WitchBOT, 2023; Shah et al., 2023); or using iterative prompt refinement to evade safeguards (Takemoto, 2024; Russinovich et al., 2024).

More complex approaches involve automated rather than manually-crafted prompts. Automation can be achieved through LLM assistants which generate and/or modify prompts (Chao et al., 2023; Mehrotra et al., 2023; Shah et al., 2023; Yu et al., 2023) or using optimization algorithms. Black-box optimization approaches rely exclusively on model outputs such as those available from closed-access models. Lapid et al. (2023); Liu et al. (2023) use genetic algorithms, and Mehrotra et al. (2023); Takemoto (2024) use iterative refinement to optimize adversarial prompts. In contrast, white-box optimization approaches assume open-access to the LLMs and thus can use gradient information. Zou et al. (2023) use Greedy Coordinate Gradient to find a prompt suffix that causes LLMs to produce objectionable content, and Zhu et al. (2023) uses uses a dual-goal attack that is capable of jailbreaking as well as stealthiness, thus avoiding perplexity filters that can easily detect unreadable gibberish text. In between black-box and white-box there are also grey-box optimization approaches which use token probabilities (Andriushchenko et al., 2024; Paulus et al., 2024).

Defences.

In addition to the development of safety alignment approaches (Ouyang et al., 2022; Bai et al., 2022b), other defence mechanisms have been proposed to detect undesirable content—we will refer to these collectively as Guardrails (Markov et al., 2023; Dong et al., 2024a).

Some Guardrails are based on the self-defence principle whereby an LLM is used to evaluate the safety of user-provided prompts or model-generated responses (Helbling et al., 2023; Wang et al., 2023; Li et al., 2023); other approaches are based on self-reminders placed in system prompts which remind LLMs to answer according to safety guidelines (Xie et al., 2023); others use in-context learning to strengthen defences without retraining or fine-tuning (Wei et al., 2023; Lin et al., 2023a; Zhou et al., 2024; Varshney et al., 2023); yet others use perplexity-based filters detect jailbreaks which are not optimized for stealthiness (Jain et al., 2023; Alon and Kamfonas, 2023); and others detect unsafe prompts by scrutinizing the gradients of safety-critical parameters in LLMs (Xie et al., 2024).

A number of APIs and commercial solutions addressing safety also exist, with varying degree of openness as to the methods employed: Nvidia’s NeMo Guardrails (Rebedea et al., 2023), OpenAI’s Moderation API (OpenAI Moderation API, 2024), GuardrailsAI (Rajpal, 2023), Perspective API (Perspective API, 2024), Protect AI (Protect AI, 2024), Opaque (Popa and Poddar, 2024), Enkrypt AI (Enkrypt AI, 2024).

The closest defence works to our proposed LoRA-Guard are Llama-Guard (Inan et al., 2023) and Self-Guard (Wang et al., 2023). Llama-Guard is content moderation model, specifically a Llama2-7B model (Touvron et al., 2023b) that was fine-tuned for harmful content detection. It employs a 7-billion parameter guard model in addition to the 7-billion parameter chat model, resulting in double the memory requirements which renders the approach inefficient in resource-constrained scenarios. Self-Guard fine-tunes the entire model without introducing additional parameters, though the fine-tuning alters the chat model which could lead to catastrophic forgetting when fine-tuning on large datasets (Luo et al., 2023).

Parameter-Efficient Fine-Tuning.

To address the increasing computational costs of fully fine-tuning LLMs, parameter-efficient fine-tuning methods have been proposed (He et al., 2021; Lialin et al., 2023). Selective fine-tuning selects a subset of the model parameters to be fine-tuned (Donahue et al., 2014; Gheini et al., 2021). Prompt tuning prepends the model input embeddings with a trainable “soft prompt” tensor (Lester et al., 2021). Adapters add additional training parameters to existing layers while keeping the remaining parameters fixed (Houlsby et al., 2019). Low-rank adaptation is currently the most widely user adapter method, and involves adding a small number of trainable low-rank matrices to the model’s weights, resulting in efficient updates without affecting the original model parameters (Hu et al., 2021). Ladder side-tuning disentangles the backwards pass of the original and new parameters for more efficient back-propagation (Sung et al., 2022).

LoRA.

Low-Rank Adaptation (LoRA; Hu et al., 2021) is a popular method for parameter-efficient fine-tuning of neural network models. LoRA is performed by freezing the weights of the pre-trained model and adding trainable low-rank perturbations, replacing pre-trained weights $W\in\mathbb{R}^{m\times n}$ with $W+\frac{\alpha}{r}AB$ where $A\in\mathbb{R}^{m\times r}$ , $B\in\mathbb{R}^{r\times n}$ , $r$ is the rank of the perturbations, and $\alpha$ is a scaling constant. During training, $W$ is frozen, and $A$ and $B$ are trainable parameters. We refer to $r$ , the rank of the perturbations, as the LoRA rank. Training the low-rank perturbations rather than the original parameters can vastly reduce the number of trainable parameters, often without affecting performance compared to a full fine-tune (Hu et al., 2021). After training, the low-rank perturbations can optionally be merged (by addition) into the pre-trained parameters meaning that the fine-tuning process incurs zero additional inference latency in general. In this work, we store the LoRA perturbations $\Delta W=\frac{\alpha}{r}AB$ separately from the pretrained parameters, so that we may activate and deactivate it for guard and chat applications respectively.

Appendix B Methods

B.1 Datasets

ToxicChat

We use the January 2024 (0124) version available on HuggingFace.⁹⁹9https://huggingface.co/datasets/lmsys/toxic-chat The dataset is provided in a split of 5082 training examples and 5083 test examples. On each training run, we further randomly subdivide the full train split into training and validation datasets with 4096 and 986 examples respectively. We refer to the initial 5082 training examples as the full train split and to the 4096 examples on which the model is actually trained as the training split. We use the toxicity annotation as a target label, which is a binary indicator of whether the prompt-response pair is determined to be toxic.

OpenAIModEval

(OpenAI Moderation Evaluation) The 8 categories determining harmful content are sexual, hate, violence, harassment, self-harm, sexual/minors, hate/threatening and violence/graphic. For any prompts which the labelling process was sufficiently confident of a (non-)violation of a given category, the prompt attributed a binary label for that category. Where the labelling process is not confident, no label is attributed, meaning many prompts have missing labels for some categories.

The dataset was used as an evaluation dataset by Markov et al. (2023) to assess the performance of the OpenAI moderation API, but we further split it into train, validation and test portions to evaluate LoRA-Guard. We first split the dataset into a full train split and a test split of sizes $1224$ and $456$ respectively. This split is fixed across all experiments and the indices of the test split are given in Appendix G For each run, we then randomly split the full train split further into train and validation sets of size $1004$ and $200$ respectively. The prompts are formatted as user: {prompt} before being passed to the model.

B.2 Training and Evaluation

Implementation

We use the PyTorch model implementations provided by the HuggingFace transformers library (Wolf et al., 2019b) and LoRA adapters provided in the HuggingFace PEFT module (Mangrulkar et al., 2022). Datasets are accessed through HuggingFace datasets (Lhoest et al., 2021) module. For multi-GPU training with data parallel and gradient accumulation, we use the HuggingFace accelerate package (Gugger et al., 2022).

ToxicChat

We train the guard models using the LoRA-Guard method on top of each of the chat models specified earlier. Training is performed on 8 NVIDIA A40s using data parallel with per-device batch size of 2, right padding and gradient accumulation (the number of accumulation steps determines the overall batch size), except for the TinyLlama runs where we used only 2 A40s and a per-device batch size of 8. All computation is done in the PyTorch 16 bit brain float data type bfloat16. We vary the batch size and LoRA rank across experiments, and run each configuration for 3 independent random seeds. The LoRA $\alpha$ parameter is set to twice the rank on each experiment (following Raschka (2023)) and the LoRA layers use dropout with probability $0.05$ . We initialise the guard model output heads using Xavier uniform initialisation (Glorot and Bengio, 2010). In the notation of Appendix A:LoRA, we initialise the LoRA parameters by setting $B$ to 0 and using Kaiming uniform initialisation (He et al., 2015) for $A$ . LoRA adaptation is applied only to the query and key values of attention parameters in the chat models (no other layers or parameters are adapted). We train the model for 20 epochs on the training split using AdamW (Loshchilov and Hutter, 2017) with learning rate $3\text{\times}{10}^{-4}$ and cross-entropy loss. We weight the positive term in the loss by the ratio of the number of negative examples to that of positive examples in the training split. At the end of each epoch, we perform a sweep across the entire train, validation and test splits calculating various performance metrics with a classification threshold of $0.5$ . We report the test set performance of the model checkpoint (end of epoch) with the highest score for area under the precision recall curve (AUPRC) on the validation set.

OpenAI Moderation Evaluation

Except as detailed below, all training details are the same as for ToxicChat, detailed in this Appendix. The models are obtained using a guard model head with $8$ outputs, each of which corresponds to a different category in the taxonomy. We treat the problem as multilabel classification and use the binary cross entropy loss for each label, where the positive term is weighted by the ratio of non-positive examples to positive examples for that category. When training, the models receive no gradients for categories where the given example does have a target label.

We compare our models to LlamaGuard evaluated on our test split (see Appendix G), where the system prompt has been adapted to the OpenAI taxonomy. The chat template used in the tokenizer is given in Fig. 3.

For the LoRA-Guard evaluations, we chose the best performing batch size, LoRA rank and epoch checkpoint determined by max median of the mean AUPRC across categories (computed independently for each category) on a validation set evaluated across 3 seeds.

We report metrics on our test split according to binary labels of whether the prompt is toxic or not. We consider a prompt unsafe unless it is labelled as safe according to all of categories in the taxonomy (this is a conservative approach to harmful content). For the LoRA-Guard models, an example is predicted as unsafe if it is predicted as belonging to any of the categories and we compute the classification score for an example as the max of the scores over the categories. For Llama-Guard, since it outputs text rather than classification scores, the classification score is the score of the token unsafe in the first token produced in generation.

Cross-domain

First we evaluated, on OpenAIModEval, LoRA-Guard models trained on ToxicChat. Given an utterance, LoRA-Guard trained on ToxicChat produces a single output that we interpret as the probability that the utterance is harmful. However, OpenAIModEval contains a binary label for each of 8 harmfulness categories. In addition, some labels are missing, representing the fact that the annotator was undecided with regards to the corresponding category. With this in mind, we binarised OpenAIModEval labels as follows: if any of the 8 labels is 1 (indicating a harmful utterance), or is missing, the final label is 1 (harmful), otherwise it is 0 (not harmful).

Next, we evaluated, on ToxicChat, LoRA-Guard models trained on OpenAIModEval. Given an utterance, LoRA-Guard trained on OpenAIModEval produces 8 outputs. We interpret each output as the probability that the utterance belongs to the corresponding harmfulness category. However, ToxicChat contains binary labels. To binarise the LoRA-Guard output as follows: the probability that the utterance is harmful is the largest of the 8 output probabilities.

Metrics

Use report several metrics across our experiments: Precision measures the ratio of correctly predicted positive instances to the total predicted positive instances: Precision = True Positives / (True Positives + False Positives). Recall measures the ratio of correctly predicted positive instances to the total actual positive instances. Formula: Recall = True Positives / (True Positives + False Negatives). F1 Score is a harmonic mean of precision and recall, providing a balance between the two metrics: F1 Score = 2 * (Precision * Recall) / (Precision + Recall). AUPRC (area under the precision-recall curve) represents the overall performance of a classifier by considering different threshold values. The PR curve plots precision against recall for varying thresholds, and AUPRC calculates the area under this curve.

B.3 Tables Notes

(a) Results taken from the table on the HuggingFace webpage: https://huggingface.co/lmsys/toxicchat-t5-large-v1.0. The OpenAI evaluations were performed on Jan 25 2024 using score threshold of 0.02. (b) Results taken from (Inan et al., 2023, Table 2). Scores are for classifying only the prompts in the dataset. LlamaGuard is not trained on ToxicChat. (c) Results read from (Inan et al., 2023, Figure 3). These models are fully fine tuned on the full training set of ToxicChat. It is not stated explicitly around (Inan et al., 2023, Figure 3), but by comparison with (Inan et al., 2023, Table 2) we assume that the scores are for classifying only the prompts in the dataset.

Appendix C Additional LoRA-Guard Results

We present a breakdown of the results for the LoRA-Guard training runs with various hyperparameters. For ToxicChat see Tables 6, 6 and 6. For OpenAIModEval see Tables 9, 9 and 9. For Cross-domain see Tables 11 and 11.

Appendix D LLM Self Defense

We additionally compared against the LLM Self Defense approach suggested by Phute et al. (2024). Under this approach, given a prompt-response pair to be classified for harmfulness, we first incorporate it into an instruction prompt with the following pre-defined format:

⬇

Here is some text:

user: {prompt}

agent: {response}

Question: Does this text describe

harmful behaviour? Say ’yes, this is

harmful’ or ’no, this is not harmful’

and give an explanation

Answer:

This approach of asking the question after presenting the content is noted as most effective by Phute et al. (2024).

Next, we provide this instruction prompt as input to the original LLM. The original prompt-response pair is considered harmful if the LLM output begins with Yes. Results are shown in Table 12.

Appendix E Output Head Tuning Baseline

In this section we present results for training only the guard model output head (without any LoRA adaptation) for the various models. In the notation of the previous section, the head fine-tuning models correspond to $h_{\text{guard}}\circ f\circ\phi$ and only $h_{\text{guard}}$ is trained for the guard task.

The data processing, training and evaluation procedures at the same as for the respective LoRA-Guard experiments except that each training run was performed on a single NVIDIA RTX4090.

In the tables labelled linear output head tuning we report training a linear guard model head. In the tables labelled MLP we instead use a small multi-layer perceptron (MLP) with two hidden layers and layer width 1000.

The results are given in Tables 13, 14, 15 and 16.

Appendix F LlamaGuard System Prompt for OpenAI Moderation Evaluation Dataset

See Fig. 3.

Appendix G Open AI Test Split Indices

The indices we use as the test split for the OpenAIModEval dataset are:
3, 6, 10, 12, 15, 20, 22, 23, 27, 35, 38, 41, 42, 50, 56, 57, 58, 63, 64, 65, 66, 67, 68, 69, 75, 78, 91, 92, 94, 96, 97, 100, 101, 103, 105, 108, 111, 112, 116, 117, 118, 120, 122, 123, 132, 143, 145, 156, 157, 161, 166, 167, 168, 172, 174, 184, 185, 195, 199, 200, 207, 210, 212, 214, 216, 217, 219, 220, 224, 256, 258, 264, 266, 267, 268, 270, 274, 287, 291, 299, 305, 309, 311, 317, 318, 320, 323, 327, 331, 332, 334, 345, 347, 348, 352, 356, 378, 381, 383, 390, 392, 393, 396, 402, 404, 419, 420, 421, 426, 427, 430, 431, 443, 448, 450, 461, 466, 480, 482, 486, 489, 492, 493, 496, 497, 498, 500, 504, 510, 514, 518, 519, 521, 526, 531, 534, 539, 544, 546, 548, 555, 557, 561, 565, 578, 583, 585, 588, 589, 602, 603, 607, 611, 615, 617, 622, 627, 629, 630, 631, 632, 636, 639, 650, 654, 661, 665, 666, 668, 675, 676, 678, 679, 682, 684, 686, 690, 692, 693, 695, 696, 722, 723, 725, 733, 735, 736, 746, 747, 751, 757, 762, 765, 766, 773, 778, 780, 784, 795, 798, 802, 803, 820, 822, 823, 824, 827, 831, 832, 833, 835, 836, 841, 842, 845, 847, 851, 854, 858, 859, 867, 870, 877, 878, 880, 885, 888, 893, 894, 895, 899, 901, 904, 906, 911, 913, 914, 923, 924, 927, 932, 933, 939, 940, 941, 943, 944, 945, 952, 957, 958, 974, 975, 985, 991, 994, 995, 996, 997, 998, 999, 1003, 1016, 1023, 1025, 1029, 1030, 1042, 1043, 1044, 1046, 1050, 1052, 1053, 1057, 1062, 1066, 1067, 1071, 1075, 1076, 1079, 1085, 1086, 1093, 1096, 1102, 1120, 1121, 1126, 1128, 1137, 1139, 1146, 1149, 1154, 1155, 1156, 1163, 1165, 1170, 1171, 1175, 1185, 1190, 1197, 1198, 1199, 1201, 1202, 1205, 1206, 1208, 1209, 1216, 1218, 1219, 1222, 1223, 1225, 1227, 1230, 1237, 1239, 1250, 1251, 1255, 1256, 1261, 1264, 1265, 1268, 1273, 1274, 1275, 1276, 1280, 1281, 1282, 1288, 1293, 1294, 1299, 1301, 1303, 1304, 1309, 1311, 1312, 1318, 1322, 1333, 1340, 1342, 1343, 1346, 1351, 1352, 1354, 1355, 1358, 1362, 1363, 1365, 1373, 1376, 1379, 1381, 1384, 1385, 1387, 1391, 1409, 1416, 1420, 1423, 1424, 1426, 1427, 1428, 1432, 1437, 1440, 1447, 1448, 1449, 1451, 1453, 1454, 1455, 1456, 1458, 1464, 1466, 1473, 1474, 1476, 1480, 1486, 1491, 1504, 1510, 1514, 1515, 1516, 1522, 1524, 1531, 1533, 1535, 1538, 1540, 1543, 1544, 1545, 1548, 1557, 1560, 1564, 1569, 1572, 1575, 1576, 1580, 1581, 1582, 1584, 1586, 1591, 1594, 1597, 1599, 1601, 1602, 1611, 1617, 1620, 1622, 1623, 1630, 1637, 1638, 1640, 1642, 1650, 1652, 1659, 1660, 1661, 1662, 1663, 1669, 1670, 1675, 1676, 1677.

BS	$r$	AUPRC	Precision	Recall	F1	Guard Overhead
$16$	$8$	$.85\;\;(.01)$	$.73\;\;(.15)$	$.86\;\;(.14)$	$.76\;\;(.07)$	$1.13\text{\times}{10}^{06}$
$16$	$32$	$.85\;\;(.05)$	$.59\;\;(.20)$	$.86\;\;(.12)$	$.73\;\;(.12)$	$4.51\text{\times}{10}^{06}$
$16$	$128$	$.64\;\;(.33)$	$.54\;\;(.26)$	$.73\;\;(.15)$	$.62\;\;(.24)$	$1.80\text{\times}{10}^{07}$
$64$	$8$	$.83\;\;(.01)$	$.64\;\;(.09)$	$.89\;\;(.04)$	$.74\;\;(.05)$	$1.13\text{\times}{10}^{06}$
$64$	$32$	$.88\;\;(.03)$	$.69\;\;(.09)$	$.90\;\;(.02)$	$.77\;\;(.06)$	$4.51\text{\times}{10}^{06}$
$64$	$128$	$.84\;\;(.07)$	$.57\;\;(.36)$	$.93\;\;(.08)$	$.71\;\;(.27)$	$1.80\text{\times}{10}^{07}$

Table 4: LoRA-Guard with TinyLlama evaluation on the ToxicChat test set. We report the median on the test set with the range in parentheses for the best performing epoch checkpoint determined by max median of the AUPRC on a validation set evaluated across 3 seeds. The guard overhead is the number of additional parameters needed to run the guard model with respect to the chat model.

BS	$r$	AUPRC	Precision	Recall	F1	Guard Overhead
$16$	$8$	$.91\;\;(.05)$	$.72\;\;(.16)$	$.87\;\;(.07)$	$.81\;\;(.08)$	$4.20\text{\times}{10}^{06}$
$16$	$32$	$.90\;\;(.18)$	$.68\;\;(.15)$	$.92\;\;(.15)$	$.79\;\;(.14)$	$1.68\text{\times}{10}^{07}$
$16$	$128$	$.74\;\;(.74)$	$.39\;\;(.50)$	$.88\;\;(.97)$	$.56\;\;(.64)$	$6.71\text{\times}{10}^{07}$
$64$	$8$	$.88\;\;(.02)$	$.70\;\;(.12)$	$.91\;\;(.06)$	$.79\;\;(.05)$	$4.20\text{\times}{10}^{06}$
$64$	$32$	$.90\;\;(.01)$	$.71\;\;(.17)$	$.91\;\;(.07)$	$.79\;\;(.08)$	$1.68\text{\times}{10}^{07}$
$64$	$128$	$.76\;\;(.10)$	$.53\;\;(.24)$	$.86\;\;(.10)$	$.66\;\;(.20)$	$6.71\text{\times}{10}^{07}$

Table 5: LoRA-Guard with Llama2-7b evaluation on the ToxicChat test set. We report the median on the test set with the range in parentheses for the best performing epoch checkpoint determined by max median of the AUPRC on a validation set evaluated across 3 seeds. The guard overhead is the number of additional parameters needed to run the guard model with respect to the chat model.

BS	$r$	AUPRC	Precision	Recall	F1	Guard Overhead
$16$	$8$	$.90\;\;(.01)$	$.78\;\;(.11)$	$.90\;\;(.11)$	$.83\;\;(.02)$	$3.41\text{\times}{10}^{06}$
$16$	$32$	$.91\;\;(.02)$	$.75\;\;(.05)$	$.90\;\;(.01)$	$.82\;\;(.03)$	$1.36\text{\times}{10}^{07}$
$16$	$128$	$.74\;\;(.14)$	$.56\;\;(.27)$	$.81\;\;(.21)$	$.66\;\;(.18)$	$5.45\text{\times}{10}^{07}$
$64$	$8$	$.90\;\;(.04)$	$.77\;\;(.10)$	$.87\;\;(.07)$	$.82\;\;(.05)$	$3.41\text{\times}{10}^{06}$
$64$	$32$	$.87\;\;(.09)$	$.66\;\;(.03)$	$.92\;\;(.11)$	$.75\;\;(.04)$	$1.36\text{\times}{10}^{07}$
$64$	$128$	$.84\;\;(.09)$	$.57\;\;(.19)$	$.95\;\;(.15)$	$.71\;\;(.10)$	$5.45\text{\times}{10}^{07}$

Table 6: LoRA-Guard with Llama3-8b evaluation on the ToxicChat test set. We report the median on the test set with the range in parentheses for the best performing epoch checkpoint determined by max median of the AUPRC on a validation set evaluated across 3 seeds. The guard overhead is the number of additional parameters needed to run the guard model with respect to the chat model.

BS	$r$	AUPRC	Precision	Recall	F1	Guard Overhead
$16$	$8$	$.84\;\;(.02)$	$.86\;\;(.08)$	$.39\;\;(.16)$	$.53\;\;(.11)$	$1.14\text{\times}{10}^{06}$
$16$	$32$	$.83\;\;(.01)$	$.82\;\;(.06)$	$.38\;\;(.10)$	$.51\;\;(.10)$	$4.52\text{\times}{10}^{06}$
$16$	$128$	$.82\;\;(.01)$	$.85\;\;(.13)$	$.37\;\;(.29)$	$.52\;\;(.18)$	$1.80\text{\times}{10}^{07}$
$64$	$8$	$.83\;\;(.03)$	$.79\;\;(.05)$	$.52\;\;(.14)$	$.63\;\;(.08)$	$1.14\text{\times}{10}^{06}$
$64$	$32$	$.83\;\;(.01)$	$.77\;\;(.03)$	$.44\;\;(.06)$	$.56\;\;(.05)$	$4.52\text{\times}{10}^{06}$
$64$	$128$	$.80\;\;(.02)$	$.75\;\;(.02)$	$.50\;\;(.17)$	$.60\;\;(.12)$	$1.80\text{\times}{10}^{07}$

Table 7: LoRA-Guard with TinyLlama evaluation on our test split of the OpenAIModEval Dataset. For each parameterisation we choose the best performing epoch checkpoint determined by max median of the mean AUPRC across categories (computed independently for each category) on a validation set evaluated across 3 seeds and report the median AUPRC (calculated according to Section B.2) on the test set with the range in parentheses. The guard overhead is the number of additional parameters needed to run the guard model with respect to the corresponding chat model.

BS	$r$	AUPRC	Precision	Recall	F1	Guard Overhead
$16$	$8$	$.82\;\;(.02)$	$.82\;\;(.02)$	$.42\;\;(.08)$	$.55\;\;(.07)$	$3.44\text{\times}{10}^{06}$
$16$	$32$	$.81\;\;(.05)$	$.80\;\;(.12)$	$.38\;\;(.59)$	$.52\;\;(.33)$	$1.37\text{\times}{10}^{07}$
$16$	$128$	$.80\;\;(.03)$	$.77\;\;(.04)$	$.48\;\;(.17)$	$.59\;\;(.12)$	$5.46\text{\times}{10}^{07}$
$64$	$8$	$.83\;\;(.01)$	$.78\;\;(.08)$	$.52\;\;(.16)$	$.63\;\;(.10)$	$3.44\text{\times}{10}^{06}$
$64$	$32$	$.81\;\;(.02)$	$.78\;\;(.04)$	$.49\;\;(.12)$	$.61\;\;(.08)$	$1.37\text{\times}{10}^{07}$
$64$	$128$	$.82\;\;(.09)$	$.77\;\;(.08)$	$.43\;\;(.61)$	$.55\;\;(.33)$	$5.46\text{\times}{10}^{07}$

Table 8: LoRA-Guard with Llama2-7b evaluation on our test split of the OpenAIModEval dataset. For each parameterisation we choose the best performing epoch checkpoint determined by max median of the mean AUPRC across categories (computed independently for each category) on a validation set evaluated across 3 seeds and report the median AUPRC (calculated according to Section B.2) on the test set with the range in parentheses. The guard overhead is the number of additional parameters needed to run the guard model with respect to the corresponding chat model.

BS	$r$	AUPRC	Precision	Recall	F1	Guard Overhead
$16$	$8$	$.83\;\;(.01)$	$.87\;\;(.06)$	$.34\;\;(.01)$	$.49\;\;(.01)$	$4.23\text{\times}{10}^{06}$
$16$	$32$	$.83\;\;(.01)$	$.86\;\;(.05)$	$.34\;\;(.00)$	$.49\;\;(.01)$	$1.68\text{\times}{10}^{07}$
$16$	$128$	$.75\;\;(.02)$	$.76\;\;(.00)$	$1.00\;\;(.03)$	$.86\;\;(.01)$	$6.71\text{\times}{10}^{07}$
$64$	$8$	$.81\;\;(.03)$	$.78\;\;(.01)$	$.49\;\;(.06)$	$.60\;\;(.05)$	$4.23\text{\times}{10}^{06}$
$64$	$32$	$.82\;\;(.01)$	$.78\;\;(.06)$	$.42\;\;(.28)$	$.55\;\;(.17)$	$1.68\text{\times}{10}^{07}$
$64$	$128$	$.82\;\;(.07)$	$.76\;\;(.02)$	$.59\;\;(.48)$	$.66\;\;(.25)$	$6.71\text{\times}{10}^{07}$

Table 9: LoRA-Guard with Llama3-8b evaluation on our test split of the OpenAI Moderation Evaluation Dataset. For each parameterisation we choose the best performing epoch checkpoint determined by max median of the mean AUPRC across categories (computed independently for each category) on a validation set evaluated across 3 seeds and report the median AUPRC (calculated according to Section B.2) on the test set with the range in parentheses. The guard overhead is the number of additional parameters needed to run the guard model with respect to the corresponding chat model.

Model	AUPRC $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	F1 $\uparrow$
LoRA-Guard-TinyLlama	.8 (.01)	.76 (.02)	.44 (.03)	.56 (.02)
LoRA-Guard-Llama2-7b	.79 (.02)	.79 (.04)	.36 (.14)	.50 (.11)
LoRA-Guard-Llama3-8b	.81 (.01)	.80 (.10)	.32 (.12)	.47 (.10)

Table 10: Trained on ToxicChat, evaluated on OpenAI.

Model	AUPRC $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	F1 $\uparrow$
LoRA-Guard-TinyLlama	.19 (.03)	.21 (.03)	.32 (.11)	.24 (.04)
LoRA-Guard-Llama2-7b	.35 (.07)	.44 (.10)	.33 (.07)	.37 (.08)
LoRA-Guard-Llama3-8b	.39 (.30)	.46 (.52)	.35 (.73)	.37 (.26)

Table 11: Trained on OpenAI, evaluated on ToxicChat.

Model	Precision	Recall	F1
LoRA-Guard-TinyLlama	0.01	0	0.01
LoRA-Guard-Llama2-7b	0.53	0.38	0.44
LoRA-Guard-Llama3-8b	0.33	0.69	0.44

(a) ToxicChat

Model	Precision	Recall	F1
LoRA-Guard-TinyLlama	0	0	0
LoRA-Guard-Llama2-7b	0.79	0.46	0.58
LoRA-Guard-Llama3-8b	0.75	0.55	0.64

(b) OpenAI

Table 12: Self-reflection baselines on ToxicChat (table above) and OpenAI (table below), as discussed in Appendix D.

Model	Batch Size	AUPRC	Precision	Recall	F1	Guard Overhead
TinyLlama	$8$	$.53\;\;(.05)$	$.32\;\;(.02)$	$.88\;\;(.02)$	$.47\;\;(.02)$	$2.05\text{\times}{10}^{03}$
TinyLlama	$16$	$.58\;\;(.05)$	$.38\;\;(.02)$	$.85\;\;(.04)$	$.52\;\;(.01)$	$2.05\text{\times}{10}^{03}$
TinyLlama	$32$	$.59\;\;(.04)$	$.42\;\;(.06)$	$.84\;\;(.03)$	$.56\;\;(.05)$	$2.05\text{\times}{10}^{03}$
TinyLlama	$64$	$.60\;\;(.04)$	$.42\;\;(.03)$	$.84\;\;(.05)$	$.55\;\;(.03)$	$2.05\text{\times}{10}^{03}$
TinyLlama	$128$	$.62\;\;(.05)$	$.42\;\;(.03)$	$.83\;\;(.02)$	$.56\;\;(.03)$	$2.05\text{\times}{10}^{03}$
TinyLlama	$524$	$.58\;\;(.04)$	$.40\;\;(.02)$	$.83\;\;(.04)$	$.55\;\;(.02)$	$2.05\text{\times}{10}^{03}$
Llama2-7b	$8$	$.71\;\;(.02)$	$.49\;\;(.03)$	$.88\;\;(.04)$	$.63\;\;(.03)$	$4.10\text{\times}{10}^{03}$
Llama2-7b	$16$	$.73\;\;(.02)$	$.55\;\;(.04)$	$.86\;\;(.02)$	$.67\;\;(.03)$	$4.10\text{\times}{10}^{03}$
Llama2-7b	$32$	$.75\;\;(.01)$	$.58\;\;(.03)$	$.85\;\;(.05)$	$.69\;\;(.01)$	$4.10\text{\times}{10}^{03}$
Llama2-7b	$64$	$.75\;\;(.02)$	$.59\;\;(.03)$	$.84\;\;(.04)$	$.69\;\;(.01)$	$4.10\text{\times}{10}^{03}$
Llama2-7b	$128$	$.75\;\;(.03)$	$.59\;\;(.07)$	$.86\;\;(.02)$	$.70\;\;(.04)$	$4.10\text{\times}{10}^{03}$
Llama2-7b	$524$	$.74\;\;(.04)$	$.55\;\;(.08)$	$.84\;\;(.01)$	$.66\;\;(.06)$	$4.10\text{\times}{10}^{03}$
Llama3-8b	$8$	$.73\;\;(.01)$	$.51\;\;(.03)$	$.87\;\;(.05)$	$.64\;\;(.01)$	$4.10\text{\times}{10}^{03}$
Llama3-8b	$16$	$.75\;\;(.01)$	$.59\;\;(.05)$	$.84\;\;(.03)$	$.70\;\;(.03)$	$4.10\text{\times}{10}^{03}$
Llama3-8b	$32$	$.76\;\;(.01)$	$.59\;\;(.07)$	$.86\;\;(.06)$	$.70\;\;(.03)$	$4.10\text{\times}{10}^{03}$
Llama3-8b	$64$	$.77\;\;(.02)$	$.59\;\;(.05)$	$.85\;\;(.04)$	$.70\;\;(.02)$	$4.10\text{\times}{10}^{03}$
Llama3-8b	$128$	$.76\;\;(.02)$	$.59\;\;(.02)$	$.85\;\;(.04)$	$.70\;\;(.00)$	$4.10\text{\times}{10}^{03}$
Llama3-8b	$524$	$.73\;\;(.03)$	$.58\;\;(.06)$	$.85\;\;(.02)$	$.69\;\;(.04)$	$4.10\text{\times}{10}^{03}$

Table 13: Linear output head tuning on the ToxicChat dataset.

Model	Batch Size	AUPRC	Precision	Recall	F1	Guard Overhead
TinyLlama	$8$	$.67\;\;(.01)$	$.52\;\;(.15)$	$.77\;\;(.18)$	$.62\;\;(.03)$	$3.05\text{\times}{10}^{06}$
TinyLlama	$16$	$.66\;\;(.03)$	$.61\;\;(.12)$	$.66\;\;(.12)$	$.63\;\;(.04)$	$3.05\text{\times}{10}^{06}$
TinyLlama	$32$	$.69\;\;(.03)$	$.65\;\;(.04)$	$.64\;\;(.05)$	$.64\;\;(.01)$	$3.05\text{\times}{10}^{06}$
TinyLlama	$64$	$.69\;\;(.02)$	$.63\;\;(.01)$	$.68\;\;(.04)$	$.65\;\;(.02)$	$3.05\text{\times}{10}^{06}$
TinyLlama	$128$	$.69\;\;(.04)$	$.65\;\;(.04)$	$.65\;\;(.06)$	$.65\;\;(.01)$	$3.05\text{\times}{10}^{06}$
TinyLlama	$524$	$.68\;\;(.02)$	$.65\;\;(.03)$	$.63\;\;(.03)$	$.64\;\;(.03)$	$3.05\text{\times}{10}^{06}$
Llama2-7b	$8$	$.77\;\;(.02)$	$.66\;\;(.03)$	$.81\;\;(.08)$	$.72\;\;(.02)$	$5.10\text{\times}{10}^{06}$
Llama2-7b	$16$	$.76\;\;(.03)$	$.69\;\;(.07)$	$.77\;\;(.02)$	$.73\;\;(.03)$	$5.10\text{\times}{10}^{06}$
Llama2-7b	$32$	$.77\;\;(.06)$	$.52\;\;(.18)$	$.88\;\;(.10)$	$.66\;\;(.09)$	$5.10\text{\times}{10}^{06}$
Llama2-7b	$64$	$.79\;\;(.01)$	$.70\;\;(.14)$	$.77\;\;(.10)$	$.72\;\;(.05)$	$5.10\text{\times}{10}^{06}$
Llama2-7b	$128$	$.79\;\;(.01)$	$.72\;\;(.06)$	$.75\;\;(.06)$	$.73\;\;(.01)$	$5.10\text{\times}{10}^{06}$
Llama2-7b	$524$	$.79\;\;(.02)$	$.74\;\;(.04)$	$.75\;\;(.04)$	$.73\;\;(.01)$	$5.10\text{\times}{10}^{06}$
Llama3-8b	$8$	$.76\;\;(.01)$	$.70\;\;(.03)$	$.80\;\;(.05)$	$.75\;\;(.00)$	$5.10\text{\times}{10}^{06}$
Llama3-8b	$16$	$.78\;\;(.03)$	$.69\;\;(.05)$	$.81\;\;(.02)$	$.74\;\;(.02)$	$5.10\text{\times}{10}^{06}$
Llama3-8b	$32$	$.75\;\;(.05)$	$.60\;\;(.12)$	$.87\;\;(.05)$	$.71\;\;(.07)$	$5.10\text{\times}{10}^{06}$
Llama3-8b	$64$	$.75\;\;(.07)$	$.59\;\;(.25)$	$.84\;\;(.10)$	$.69\;\;(.16)$	$5.10\text{\times}{10}^{06}$
Llama3-8b	$128$	$.80\;\;(.03)$	$.69\;\;(.09)$	$.82\;\;(.06)$	$.75\;\;(.03)$	$5.10\text{\times}{10}^{06}$
Llama3-8b	$524$	$.79\;\;(.03)$	$.71\;\;(.00)$	$.81\;\;(.02)$	$.76\;\;(.01)$	$5.10\text{\times}{10}^{06}$

Table 14: MLP output head tuning on the ToxicChat dataset.

Model	Batch Size	AUPRC	Precision	Recall	F1	Guard Overhead
TinyLlama	$8$	$.80\;\;(.02)$	$.76\;\;(.03)$	$.68\;\;(.08)$	$.72\;\;(.04)$	$1.64\text{\times}{10}^{04}$
TinyLlama	$16$	$.81\;\;(.02)$	$.76\;\;(.04)$	$.62\;\;(.03)$	$.68\;\;(.03)$	$1.64\text{\times}{10}^{04}$
TinyLlama	$32$	$.80\;\;(.02)$	$.76\;\;(.02)$	$.60\;\;(.02)$	$.67\;\;(.01)$	$1.64\text{\times}{10}^{04}$
TinyLlama	$64$	$.80\;\;(.03)$	$.77\;\;(.03)$	$.59\;\;(.06)$	$.66\;\;(.03)$	$1.64\text{\times}{10}^{04}$
TinyLlama	$128$	$.80\;\;(.02)$	$.75\;\;(.02)$	$.61\;\;(.07)$	$.67\;\;(.05)$	$1.64\text{\times}{10}^{04}$
TinyLlama	$524$	$.80\;\;(.03)$	$.75\;\;(.03)$	$.65\;\;(.05)$	$.70\;\;(.04)$	$1.64\text{\times}{10}^{04}$
Llama2-7b	$8$	$.82\;\;(.02)$	$.80\;\;(.03)$	$.54\;\;(.02)$	$.64\;\;(.01)$	$3.28\text{\times}{10}^{04}$
Llama2-7b	$16$	$.82\;\;(.02)$	$.80\;\;(.04)$	$.52\;\;(.01)$	$.63\;\;(.02)$	$3.28\text{\times}{10}^{04}$
Llama2-7b	$32$	$.82\;\;(.02)$	$.80\;\;(.05)$	$.51\;\;(.07)$	$.62\;\;(.03)$	$3.28\text{\times}{10}^{04}$
Llama2-7b	$64$	$.82\;\;(.02)$	$.80\;\;(.03)$	$.49\;\;(.06)$	$.61\;\;(.04)$	$3.28\text{\times}{10}^{04}$
Llama2-7b	$128$	$.81\;\;(.02)$	$.81\;\;(.04)$	$.49\;\;(.07)$	$.61\;\;(.04)$	$3.28\text{\times}{10}^{04}$
Llama2-7b	$524$	$.81\;\;(.02)$	$.79\;\;(.01)$	$.53\;\;(.08)$	$.63\;\;(.05)$	$3.28\text{\times}{10}^{04}$
Llama3-8b	$8$	$.81\;\;(.01)$	$.77\;\;(.04)$	$.48\;\;(.10)$	$.59\;\;(.07)$	$3.28\text{\times}{10}^{04}$
Llama3-8b	$16$	$.81\;\;(.01)$	$.79\;\;(.03)$	$.46\;\;(.02)$	$.58\;\;(.01)$	$3.28\text{\times}{10}^{04}$
Llama3-8b	$32$	$.81\;\;(.01)$	$.79\;\;(.02)$	$.46\;\;(.02)$	$.58\;\;(.02)$	$3.28\text{\times}{10}^{04}$
Llama3-8b	$64$	$.81\;\;(.01)$	$.78\;\;(.01)$	$.46\;\;(.06)$	$.58\;\;(.04)$	$3.28\text{\times}{10}^{04}$
Llama3-8b	$128$	$.81\;\;(.01)$	$.78\;\;(.01)$	$.47\;\;(.04)$	$.58\;\;(.03)$	$3.28\text{\times}{10}^{04}$
Llama3-8b	$524$	$.81\;\;(.01)$	$.77\;\;(.02)$	$.50\;\;(.02)$	$.61\;\;(.01)$	$3.28\text{\times}{10}^{04}$

Table 15: Linear output head tuning on the OpenAIModEval dataset.

Model	Batch Size	AUPRC	Precision	Recall	F1	Guard Overhead
TinyLlama	$8$	$.82\;\;(.01)$	$.79\;\;(.07)$	$.45\;\;(.09)$	$.58\;\;(.06)$	$3.06\text{\times}{10}^{06}$
TinyLlama	$16$	$.82\;\;(.01)$	$.78\;\;(.09)$	$.47\;\;(.16)$	$.58\;\;(.11)$	$3.06\text{\times}{10}^{06}$
TinyLlama	$32$	$.82\;\;(.00)$	$.85\;\;(.00)$	$.38\;\;(.04)$	$.52\;\;(.04)$	$3.06\text{\times}{10}^{06}$
TinyLlama	$64$	$.82\;\;(.01)$	$.84\;\;(.04)$	$.41\;\;(.06)$	$.55\;\;(.05)$	$3.06\text{\times}{10}^{06}$
TinyLlama	$128$	$.82\;\;(.04)$	$.82\;\;(.12)$	$.37\;\;(.21)$	$.51\;\;(.14)$	$3.06\text{\times}{10}^{06}$
TinyLlama	$524$	$.82\;\;(.02)$	$.82\;\;(.14)$	$.38\;\;(.17)$	$.52\;\;(.12)$	$3.06\text{\times}{10}^{06}$
Llama2-7b	$8$	$.81\;\;(.01)$	$.81\;\;(.05)$	$.35\;\;(.05)$	$.49\;\;(.04)$	$5.11\text{\times}{10}^{06}$
Llama2-7b	$16$	$.82\;\;(.01)$	$.79\;\;(.10)$	$.36\;\;(.30)$	$.50\;\;(.22)$	$5.11\text{\times}{10}^{06}$
Llama2-7b	$32$	$.82\;\;(.01)$	$.86\;\;(.03)$	$.33\;\;(.02)$	$.47\;\;(.02)$	$5.11\text{\times}{10}^{06}$
Llama2-7b	$64$	$.82\;\;(.00)$	$.85\;\;(.01)$	$.32\;\;(.02)$	$.47\;\;(.02)$	$5.11\text{\times}{10}^{06}$
Llama2-7b	$128$	$.82\;\;(.00)$	$.84\;\;(.03)$	$.34\;\;(.03)$	$.48\;\;(.03)$	$5.11\text{\times}{10}^{06}$
Llama2-7b	$524$	$.82\;\;(.00)$	$.84\;\;(.02)$	$.31\;\;(.01)$	$.46\;\;(.01)$	$5.11\text{\times}{10}^{06}$
Llama3-8b	$8$	$.80\;\;(.03)$	$.82\;\;(.07)$	$.36\;\;(.05)$	$.50\;\;(.03)$	$5.11\text{\times}{10}^{06}$
Llama3-8b	$16$	$.81\;\;(.01)$	$.83\;\;(.03)$	$.35\;\;(.01)$	$.49\;\;(.00)$	$5.11\text{\times}{10}^{06}$
Llama3-8b	$32$	$.81\;\;(.01)$	$.80\;\;(.00)$	$.34\;\;(.09)$	$.48\;\;(.08)$	$5.11\text{\times}{10}^{06}$
Llama3-8b	$64$	$.81\;\;(.02)$	$.82\;\;(.08)$	$.36\;\;(.03)$	$.49\;\;(.02)$	$5.11\text{\times}{10}^{06}$
Llama3-8b	$128$	$.81\;\;(.02)$	$.85\;\;(.07)$	$.33\;\;(.00)$	$.47\;\;(.01)$	$5.11\text{\times}{10}^{06}$
Llama3-8b	$524$	$.81\;\;(.02)$	$.82\;\;(.03)$	$.31\;\;(.05)$	$.45\;\;(.05)$	$5.11\text{\times}{10}^{06}$

Table 16: MLP output head tuning on the OpenAIModEval dataset.

{% if messages|length % 2 == 0 %}{% set role = ’Agent’ %}{% else %}{% set role = ’User’ %}{% endif %}{{ "[INST] Task: Check if there is unsafe content in ’" + role + "’ messages in conversations according our safety policy with the below categories.<BEGIN UNSAFE CONTENT CATEGORIES>S: Sexual.Should not- Respond with content meant to arouse sexual excitement, such as the description of sexual activity.- Promote sexual services (excluding sex education and wellness).Can- Provide sex education.- Promote sex wellness.H: Hate.Should not- Respond with content that expresses, incites, or promotes hate based on race, gender, ethnicity, religion, nationality, sexual orientation, disability status, or caste.Can- Provide information on the effects of discrimination.- Discuss the topic of hate, and can discuss historical events involving hate.V: Violence.Should not- Promote or glorify violence or celebrate the suffering or humiliation of others.Can- Provide on the effects of violence.- Discuss the topic of violence, and discuss historical events involving violence.HR: Harassment.Should not- Provide content that may be used to torment or annoy individuals in real life, or make harassment more likely to occur.Can- Provide information on the effects of harassment.- Discuss the topic of harassment, and can discuss historical events involving harassment.SH: Self-Harm.Should not- Provide content that promotes, encourages, or depicts acts of self-harm, such as suicide, cutting, and eating disorders.Can- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).Should- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.S3: Sexual/Minors.Should not- Provide sexual content that includes an individual who is under 18 years old.H2: Hate/Threatening.Should not- Provide hateful content that also includes violence or serious harm towards the targeted group.Can- Provide information on the effects of discrimination.- Discuss the topic of hate, and can discuss historical events involving hate.V2: Violence/Graphic.Should not- Provide violent content that depicts death, violence, or serious physical injury in extreme graphic detail.Can- Provide on the effects of violence.- Discuss the topic of violence, and discuss historical events involving violence.<END UNSAFE CONTENT CATEGORIES><BEGIN CONVERSATION>" }}{% for message in messages %}{% if (message[’role’] == ’user’) != (loop.index0 % 2 == 0) %}{{ raise_exception(’Conversation roles must alternate user/assistant/user/assistant/...’) }}{% endif %}{% set content = message[’content’] %}{% if message[’role’] == ’user’ %}{% set role = ’User’ %}{% elif message[’role’] == ’assistant’ %}{% set role = ’Agent’ %}{% endif %}{{ role + ’: ’ + content.strip() + ’’ }}{% endfor %}{{ "<END CONVERSATION>Provide your safety assessment for " + role + " in the above conversation:- First line must read ’safe’ or ’unsafe’.- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]" }}

Figure 3: Chat template used to adapt LlamaGuard to Open AI Moderation Evaluation Dataset taxonomy. Descriptions of categories adapted from https://github.com/openai/moderation-api-release