Stealth edits for provably fixing or attacking large language models

Implementation and source code of algorithms from paper: "Stealth edits for provably fixing or attacking large language models".

Getting Started

Before attempting stealth edits, please first install the environment:

conda env create --name=llm-sa -f environment.yml
conda activate llm-sa

The model llama-3-8b requires you to apply for access. Please follow the instructions here. You will also need to install huggingface-cli and input an user access token.
To start playing with stealth edit and attacks, please refer to the Colab Demo and the Huggingface Demo. You can also run the demo locally:
```
python app.py
```

Experiments

To reproduce experiments in the paper, please first run the extraction script:

bash scripts/extract.sh

and then run edits and/or attacks and evaluation with the following scripts:

bash scripts/edit.sh
bash scripts/eval.sh

It is recommended to distribute the experiments on multiple nodes.

How to Cite

@article{sutton2024stealth,
    title = {Stealth Edits for Provably Fixing or Attacking Large Language Models},
    author = {Sutton, Oliver J. and Zhou, Qinghua and Wang, Wei and Higham, Desmond J. and Gorban, Alexander N. and Bastounis, Alexander and Tyukin, Ivan Y.},
    year = {2024},
    month = jun,
    number = {arXiv:2406.12670},
    eprint = {2406.12670},
    primaryclass = {cs},
    publisher = {arXiv},
    doi = {10.48550/arXiv.2406.12670},
    urldate = {2024-06-20},
    archiveprefix = {arXiv},
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
demos		demos
dsets		dsets
evaluation		evaluation
experiments		experiments
figures		figures
hparams/SE		hparams/SE
scripts		scripts
stealth_edit		stealth_edit
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stealth edits for provably fixing or attacking large language models

Getting Started

Experiments

How to Cite

About

Releases

Packages

Languages

License

qinghua-zhou/stealth-edits

Folders and files

Latest commit

History

Repository files navigation

Stealth edits for provably fixing or attacking large language models

Getting Started

Experiments

How to Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages