This repository contains the code and instructions on how to reproduce the results of the paper ”Memorization of Named Entities in Fine-tuned BERT Models" published at CD-MAKE 2023. The goal of the study was to analyze the extent of named entity memorization in fine-tuned BERT models. We ran experiments on two datasets, using three different fine-tuning methods and two prompting strategies. We split the repository into different subfolders and scripts, based on the different components.
arxiv link: https://arxiv.org/abs/2212.03749
├── Main # Main scripts for running the experiments
| ├── common_crawl # Script for sourcing the CC dataset, location of the downloaded .wet file
│ ├── datasets # Datasets used in the study
│ ├── blogauth # Train-test split of the BlogAuthorship dataset with the preprocessing script
| ├── enron # Train-test split of the Enron dataset with the preprocessing script
| ├── entities # Collected entity lists stored as .json, scripts for collecting the entities
| ├── finetuned_models # Saved fine-tuned models
| ├── samples # Generated text samples as .txt files
└── README # Project structure overview
The subfolders datasets and entities contain additional README files with information on how to prepare the experiments.
The finetune_classification.py script contains our implementation of running the fine-tuning setups.
- Make sure the ./datasets contains the training and test set
- Check for dependencies:
numpy
,pytorch
,transformers
,opacus
,sklearn
A specific fine-tuning setup can be configured by setting the following global variables in the script:
- dataset
- bert_model_type
- TRAIN_BATCH_SIZE
- TEST_BATCH_SIZE
- LEARNING_RATE
- n_epochs
- freeze_pretrained_layers (optional)
- differential_privacy (optional)
- noise_multiplier (optional)
- grad_clipping_threshold (optional)
The text_generation.py script contains our implementation of running the text generation setups.
- Make sure the ./finetuned_models contains the models to be used
- Check for dependencies:
numpy
,pytorch
,transformers
A specific text generation setup can be configured by setting the following global variables in the script:
- fine_tuning
- dataset
- prompt_type
- n_samples
In this executable python script contains the implementation for evaluating our experimental configurations.
- Make sure the ./entities and ./samples folders contains the entity lists and text samples
- Check for dependencies:
numpy
,ahocorapy
A specific evaluation setup can be configured by setting the following global variables in the script:
- entity_file
- text
- k_value