This is the official repository for our paper "Stop-RAG: Value-Based Retrieval Control for Iterative RAG".
We recommend using uv with Python 3.11 for dependency management. To install the necessary dependencies, run:
pip install uv
uv sync --no-devTo run Stop-RAG training and evaluate the results, follow the steps below. All experiments were run on 4 H100 GPUs.
There are three types of variables to set:
DATASET: the name of the dataset to use (musiqueorhotpotqaor2wikimultihopqa)RETRIEVER: the type of retriever to use (contrieverorbm25)METHOD: the base pipeline to use (oursorcorag)
First, download all datasets and build retrieval embeddings from the corpora.
./scripts/download.sh {RETRIEVER}Datasets will be stored in data/raw/{DATASET}, and the retrieval corpora, embeddings and indexes will be saved in data/corpus/{DATASET}.
Run the chosen pipeline to prepare the training and evaluation data.
./scripts/dataset.sh {DATASET} {RETRIEVER} {METHOD}The processed data will be saved in data/processed/{DATASET}/{METHOD}/{RETRIEVER}.
Now run the training script. WandB logging is enabled by default, so the environment variables WANDB_API_KEY and WANDB_ENTITY must be set. To disable WandB logging, set WANDB_DISABLED=true.
./scripts/train.sh {DATASET} {RETRIEVER} {METHOD}To run evaluation, first compute the scores for all trained checkpoints and find the best checkpoint and threshold.
./scripts/stop_rag_find_best.sh {DATASET} {RETRIEVER} {METHOD}This script will print the best checkpoint and threshold. Substitute these values into the following command to run the final evaluation.
./scripts/stop_rag_test.sh {DATASET} {RETRIEVER} {METHOD} {CKPT} {THRESHOLD}You can also evaluate the LLM-Stop baseline for comparison.
./scripts/llm_stop_test.sh {DATASET} {RETRIEVER} {METHOD}The code for Contriever is adapted from EfficientRAG, and the dataset download scripts are adapted from IRCoT.