Implementation and source code of algorithms from paper: "Stealth edits for provably fixing or attacking large language models".
-
Before attempting stealth edits, please first install the environment:
conda env create --name=llm-sa -f environment.yml conda activate llm-sa
-
The model
llama-3-8b
requires you to apply for access. Please follow the instructions here. You will also need to installhuggingface-cli
and input an user access token. -
To start playing with stealth edit and attacks, please refer to the Colab Demo and the Huggingface Demo. You can also run the demo locally:
python app.py
To reproduce experiments in the paper, please first run the extraction script:
bash scripts/extract.sh
and then run edits and/or attacks and evaluation with the following scripts:
bash scripts/edit.sh
bash scripts/eval.sh
It is recommended to distribute the experiments on multiple nodes.
@article{sutton2024stealth,
title = {Stealth Edits for Provably Fixing or Attacking Large Language Models},
author = {Sutton, Oliver J. and Zhou, Qinghua and Wang, Wei and Higham, Desmond J. and Gorban, Alexander N. and Bastounis, Alexander and Tyukin, Ivan Y.},
year = {2024},
month = jun,
number = {arXiv:2406.12670},
eprint = {2406.12670},
primaryclass = {cs},
publisher = {arXiv},
doi = {10.48550/arXiv.2406.12670},
urldate = {2024-06-20},
archiveprefix = {arXiv},
}