This project is following our research paper: Cross-document Event Coreference Search: Task, Dataset and Modeling
- Dataset Files
- Pre-trained Models
- Models Training
- Project Installation
- Retriever Training
- Reader Training
Download CoreSearch dataset files from:
https://huggingface.co/datasets/biu-nlp/CoreSearch
OR download the cleaner version CoreSearchV2 dataset files from:
https://huggingface.co/datasets/biu-nlp/CoreSearchV2
Using the following code snippet will download the dataset to the cache folder:
In [1]: from huggingface_hub import snapshot_download
In [2]: snapshot_download(repo_id="biu-nlp/CoreSearch", revision="main", repo_type="dataset")
- Save the output location of the downloaded CoreSearch snapshot folder
- CoreSearch/dpr: Files in DPR format used for training the retriever
- CoreSearch/squad: Files in SQuAD format used for training the reader
- CoreSearch/train: The train used for generating the dpr/squad files
- CoreSearch/clean: The clean dataset files used for evaluation
Below links to models already pre-trained on the CoreSearch dataset.
The instructions below explain how to train the retriever and reader models
Installation from the source. Python's virtual or Conda environments are recommended.
Project was tested with Python 3.9
git clone https://github.com/AlonEirew/CoreSearch.git
cd CoreSearch
pip install -r requirements.txt
pip install -e .
# I set the path to the project in the PYTHONPATH environment variable
export PYTHONPATH="${PYTHONPATH}:/<replace_with_path>/CoreSearch"
Training the retriever moder require the CoreSearch data in DPR format (avilable in the dataset huggingface link above).
Full argument description is available in the top of train_retriever.py
script.
python src/train/train_retriever.py \
--doc_dir [replace_with_hubs_cache_path]/dpr/ \
--train_filename Train.json \
--dev_filename Dev.json \
--checkpoint_dir data/checkpoints/ \
--output_model Retriever_SpanBERT \
--add_special_tokens true \
--n_epochs 5 \
--max_seq_len_query 64 \
--max_seq_len_passage 180 \
--batch_size 64 \
--query_model SpanBERT/spanbert-base-cased \
--passage_model SpanBERT/spanbert-base-cased \
--evaluate_every 500
Training reader moder require the CoreSearch data in SQuAD format.
Full argument description is available in the top of train_reader.py
script.
python src/train/train_reader.py \
--doc_dir [replace_with_hubs_cache_path]/squad/ \
--train_filename Train_squad_format_1pos_23neg.json \
--dev_filename Dev_squad_format_1pos_23neg.json \
--checkpoint_dir data/checkpoints/ \
--output_model Reader-RoBERTa_base_Kenton \
--predicting_head kenton \
--num_processes 10 \
--add_special_tokens true \
--n_epochs 5 \
--max_seq_len 256 \
--max_seq_len_query 64 \
--batch_size 24 \
--reader_model roberta-base \
--evaluate_every 750
This script is for evaluating the retriever model and generating a file index for the top-k results of every question.
Information on parameters can be found in the top of evaluate_retriever.py
script.
python src/evaluation/evaluate_retriever.py \
--query_filename [replace_with_hubs_cache_path]/train/Dev_queries.json \
--passages_filename [replace_with_hubs_cache_path]/clean/Dev_all_passages.json \
--gold_cluster_filename [replace_with_hubs_cache_path]/clean/Dev_gold_clusters.json \
--query_model data/checkpoints/Retriever_SpanBERT_notoks_5it/0/query_encoder \
--passage_model data/checkpoints/Retriever_SpanBERT_notoks_5it/0/passage_encoder \
--out_index_file file_indexes/Dev_Retriever_spanbert_notoks_5it0_top500.json \
--out_results_file file_indexes/Dev_Retriever_spanbert_notoks_5it0_top500_results.txt \
--num_processes -1 \
--add_special_tokens true \
--max_seq_len_query 64 \
--max_seq_len_passage 180 \
--batch_size 240 \
--top_k 500
Prerequisites: Generating an index to retrieve from, index can be generated by running evaluate_retriever.py
, or by creating an elastic index using elastic_index.py
(detailed below) for BM25 retriever.
Running the end2end pipeline.
Information on parameters can be found in the top of run_e2e_pipeline.py
script.
python src/pipeline/run_e2e_pipeline.py \
--predicting_head kenton \
--max_seq_len_query 64 \
--max_seq_len_passage 180 \
--add_special_tokens true \
--batch_size_retriever 24 \
--batch_size_reader 24 \
--top_k_retriever 500 \
--top_k_reader 50 \
--query_model data/checkpoints/Retriever_SpanBERT_5it/1/query_encoder \
--passage_model data/checkpoints/Retriever_SpanBERT_5it/1/passage_encoder \
--reader_model data/checkpoints/Reader-RoBERTa_base_Kenton_special/1 \
--query_filename [replace_with_hubs_cache_path]/train/Dev_queries.json \
--passages_filename [replace_with_hubs_cache_path]/clean/Dev_all_passages.json \
--gold_cluster_filename [replace_with_hubs_cache_path]/clean/Dev_gold_clusters.json \
--index_file file_indexes/Dev_Retriever_spanbert_5it1_top500.json \
--out_results_file results/Dev_End2End_5it1.txt \
--magnitude all
This script will create a new ElasticSearch index containing documents generated from the input file. In case given index already exists, it will be deleted by this process and recreated.
Prerequisite: Pulling elastic image and running it:
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
docker run -d -p 9200:9200 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.9.2
After Index is up and running:
python src/index/elastic_index.py \
--input=data/resources/clean/Train_all_passages.json \
--index=train
Generate DPR Files from CoreSearch files:
python scripts/to_dpr_format.py
Generate SQuAD Files from CoreSearch files:
python scripts/to_squad_format.py
- Run retriever training --
src/train/train_retriever.py
- Run Evaluation and Index script on DEV to generate results and passage index: --
src/evaluation/evaluate_retriever.py
- Take the best model and generate TRAIN and TEST index (using above
evaluate_retriever.py
script) - Generate Squad files using script --
scripts/to_squad_format.py
- Run reader training --
src/train/train_reader.py
- Run full pipeline on DEV set of retriever/reader --
src/pipeline/run_e2e_pipeline.py
- Run full pipeline with best model on TEST set