This is the main repo for the reproduction code of PCBMs. The data preprocessing pipeline and partitioned datasets for Metashift, Metashift Survey, COCO-Stuff and SIIM ISIC can be found in HuggingFace.
The example notebook has everything you need to start testing the code, give it a try in Google Colab.
- Go to Google Colab
- In the
Open Notebook
tab, select GitHub - Select the repository
- Select the notebook
notebooks/main.ipynb
If you choose to test on your local computer, then follow the instructions below.
You have two options to install the dependencies: poetry (recommended) or conda.
Check the documentation on how to setup and install poetry.
- Create a virtual environment.
python -m venv .venv
- Activate the python environment.
# For UNIX-based systems only
source .venv/bin/activate
# For Windows cmd only
.venv\Scripts\activate.bat
# For Windows Powershell
.venv\Scripts\Activate.ps1
- Install the dependencies in the virtual environment.
poetry install
And you're done!
For adding/removing packages, and other functionalities check the docs.
- Create conda environment and install packages
conda env create -f environment.yaml
- Activate environment
conda activate PCBM
All datasets will reside on artifacts/data
. When commiting changes to the repository, please make sure it doesn't get pushed, otherwise it will clog the repo.
If desired, you can run the data downloader scripts in the main notebook. Otherwise if you prefer running the files via terminal then follow the instructions below:
- Activate your environment (conda or venv).
- Run the download script below:
./scripts/download_broden
- You will find the downloaded data in
broden_concepts
.
- Activate your environment (conda or venv).
- Run the download script below:
./scripts/download_cocostuff
- You will find the downloaded data in
COCO_STUFF
.
- Run the download script below:
./scripts/download_cub
- You will find the downloaded data in
CUB_200_2011
andclass_attr_data_10
.
Please refer to the main file for instructions regarding the Derm7pt dataset.
- Please refer to the main file for instructions regarding the HAM10000 dataset due to the necessity of having a Kaggle API token (if you already have one setup in your .kaggle, folder, you can ignore this step).
- Run the download script below:
./scripts/download_ham
(Note: If on Google Colab, you can run the cell in the main file after following the instructions there.)
- You will find the downloaded data in
HAM10K
.
- Activate your environment (conda or venv).
- Run the download script below:
./scripts/download_siim
- You will find the downloaded data in
SIIM_ISIC
.
- Run the download script below:
./scripts/download_metashift
- You will find the downloaded data in
metashift
.
The below downloads are not part of the original experiments and were done as an extension to the original paper. They amount to around 33 GB in total (with around 26 GB stemming from the AudioSet data).
- Go here and download the file corresponding to your device (link is present in the
README
). - Follow the instructions listed here. You can ignore/adjust all the steps related to downloading the file.
- Restart your device.
- Please refer to the main file for instructions regarding the AudioSet dataset due to the necessity of having a Kaggle API token (if you already have one setup in your .kaggle, folder, you can ignore this step).
- Run the download script below:
./scripts/download_esc
- You will find the downloaded data in
ESC_50
.
- Please refer to the main file for instructions regarding the UrbanSound8K dataset due to the necessity of having a Kaggle API token (if you already have one setup in your .kaggle, folder, you can ignore this step).
- Run the download script below:
./scripts/download_us8k
- Run the download script below:
./scripts/download_audioset
- You will find the downloaded data in
audioset
.
Important Note: This will only download the .csv
files and the audio files for validation. For the other audio files, you would also need to have ffmpeg
installed on your device in order to run the YouTube downloader script. This is how you can do so:
(Note: If on Google Colab, you can run the cell in the main file after following the instructions there.)
- You will find the downloaded data in
US8K
.
This script will download the dependencies for AudioCLIP, as the original repository has been integrated here in its entirety already. Run this (either in the notebook or terminal) if you would like to run the AudioCLIP experiments.
- Run the download script below:
./scripts/download_audioclip
- Instead of in
artifacts/data
, You will find the downloaded data inAnonymous/models/AudioCLIP/assets
.
Please see models/model_zoo.py
for the backbones used. Some of the original models rely on external dependencies (e.g. pytorchcv for the CUB backbone, OpenAI repo for the CLIP backbone.) or will be downloaded (e.g. HAM1000 model from the DDI repo).
Any additional models can be added by editing models/model_zoo.py
.
To replicate the original results, we have prepared a function where all the datasets can be evaluated using the parameters specified by the authors. This can be found here.
Please Note: For some scripts (outlined in the above notebooks), you may need to add this code snippet before the python
command itself: PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1
. This is due to how the repository is setup. Alternatively, you could install the AudioCLIP dependencies (the instructions for which can be found above) if you don't want to include this for some scripts.
In the original paper, two different ways to learn concepts activations vectors were proposed to build concepts banks which are used here.
To learn concepts in this way, each concept dataset needs to have a set of positive and negative images per concept. For this, the original authors proposed the CAV methodology (Kim et al. 2018).
Concept Dataset Implementations: The code provided to extract concept data loaders in data/concept_loaders.py
is the same as in the original implementation. In there, you could find the loaders for BRODEN
, CUB
, and derm7pt
datasets to extract concept loaders. If you'd like to use custom concept datasets, you could implement your own loader and place there.
Obtaining concept vectors: Once you have the concept data loaders, you could use the learn_concepts_dataset.py
script to learn the concept vectors (which is also from the original implementation). As examples, you can run the following scripts (once you obtain the corresponding datasets):
OUTPUT_DIR=/path/where/you/save/conceptbanks/
# Learning CUB Concepts
python3 learn_concepts_dataset.py --dataset-name="cub" --backbone-name="resnet18_cub" --C 0.001 0.01 0.1 1.0 10.0 --n-samples=100 --out-dir=$OUTPUT_DIR
# Learning Derm7pt Concepts
python3 learn_concepts_dataset.py --dataset-name="derm7pt" --backbone-name="ham10000_inception" --C 0.001 0.01 0.1 1.0 10.0 --n-samples=50 --out-dir=$OUTPUT_DIR
# Learning BRODEN Concepts
python3 learn_concepts_dataset.py --dataset-name="broden" --backbone-name="clip:RN50" --C 0.001 0.01 0.1 1.0 10.0 --n-samples=50 --out-dir=$OUTPUT_DIR
Alternatively, you can run example experiments in the main file.
Limitations:
- This approach relies on the existence of a concept dataset. These may be hard to get, depending on the application.
- Learning concepts with the CAV way could inherit the potential biases in the concept datasets. One should be careful about how the concept dataset is constructed, and what it means to learn that concept.
What if we don't have a concept dataset? We could leverage multimodal models, such as CLIP! In other words, we can simply prompt the text encoder with the concept name, and obtain the concept vector in the shared embedding space.
The code to do this can be found in learn_concepts_multimodal.py
. You can run the following script to learn the concept vectors:
python3 learn_concepts_multimodal.py --backbone-name="clip:RN50" --classes=cifar10 --out-dir=$OUTPUT_DIR --recurse=1
Currently, CIFAR10/CIFAR100 is supported for this approach. You can very easily add the set of class names in the script and obtain the concept bank for your own purpose.
Limitations: This approach is limited to the multimodal models that have a shared embedding space. Existing multimodal models that are not specialized may not do very well with domain-specific concepts (e.g. healthcare concepts).
Once you have a concept bank and a backbone, you are ready to train your PCBM! We provide the code to train PCBMs in train_pcbm.py
. You can run the following script to train a PCBM on CUB:
python3 train_pcbm.py --concept-bank="${OUTPUT_DIR}/cub_resnet18_cub_0.1_100.pkl" --dataset="cub" --backbone-name="resnet18_cub" --out-dir=$OUTPUT_DIR --lam=2e-4
Please see the train_pcbm.py
file for the arguments / where the models are saved.
Limitations: There is a tradeoff between the regularization and how sparse/"interpretable" (yes, hard to define what exactly this means) the linear module is. This hyperparameter selection can be a bit tedious. We can play around with the lam
parameter and alpha
parameter to observe the concept coefficients and understand what seems like a good tradeoff. Good thing is, we can simply monitor concept weights, and since concepts are more meaningful, we may have a better say here.
Once you have the PCBM, you can train the PCBM-h model by running the following script:
pcbm_path="/path/to/pcbm_cub__resnet18_cub__cub_resnet18_cub_0__lam:0.0002__alpha:0.99__seed:42.ckpt"
python3 train_pcbm_h.py --concept-bank="${OUTPUT_DIR}/cub_resnet18_cub_0.1_100.pkl" --pcbm-path=$pcbm_path --out-dir=$OUTPUT_DIR --dataset="cub"
With our current implementation, we can evaluate the performance of model editing using one script (also present in main.ipynb
), which is the following:
%%capture
# Suppress output with capture magic
PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python -m experiments.model_editing.make_table \
--seed 0 \
--device cpu \
--base_config configs/model_editing/classifier/base_clip_resnet50.yaml
The above will perform the model editing experiments for the 6 scenarios and one seed.
For replicating the results of the user study, please see this notebook. The instructions for these experiments aren't present here because it is much simpler to view them there and due to the dataset not being publishable due to GDPR.
If you find this code useful, please consider citing our paper (not out yet unfortunately):
@article{midavaine2024repcbm,
title={[Re] On the Reproducibility of Post-Hoc Concept Bottleneck Models},
author={Nesta Midavaine and Gregory Hok Tjoan Go and Diego Canez and Ioana Simion and Satchit Chatterji},
journal={Transactions on Machine Learning Research},
year={2024},
volume={2024},
url={https://openreview.net/pdf?id=8UfhCZjOV7}
}
In addition, we also recommend citing the original authors via the citation below:
@inproceedings{
yuksekgonul2023posthoc,
title={Post-hoc Concept Bottleneck Models},
author={Mert Yuksekgonul and Maggie Wang and James Zou},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=nA5AZ8CEyow}
}