Leveraging Passage Retrieval With Generative Models For Open Domain Question Answering
Leveraging Passage Retrieval With Generative Models For Open Domain Question Answering
Leveraging Passage Retrieval With Generative Models For Open Domain Question Answering
…
Question + Passage N encoder
Table 1: Comparison to state-of-the-art. On TriviaQA, we report results on the open domain test set (left), and on
the hidden test set (right), competitions.codalab.org/competitions/17208#results).
Retrieval. For the retrieval of support passages, tion over the concatenation of the resulting repre-
we consider two methods: BM25 (Robertson et al., sentations of all the retrieved passages. The model
1995) and DPR (Karpukhin et al., 2020). In BM25, thus performs evidence fusion in the decoder only,
passages are represented as bag of words, and the and we refer to it as Fusion-in-Decoder.
ranking function is based on term and inverse doc- By processing passages independently in the en-
ument frequencies. We use the implementation coder, but jointly in the decoder, this method dif-
from Apache Lucene1 with default parameters, and fers from Min et al. (2020) and Lewis et al. (2020).
tokenize questions and passages with SpaCy.2 In Processing passages independently in the encoder
DPR, passages and questions are represented as allows to scale to large number of contexts, as it
dense vector representations, computed using two only performs self attention over one context at a
BERT networks. The ranking function is the dot time. This means that the computation time of the
product between the query and passage represen- model grows linearly with the number of passages,
tations. Retrieval is performed using approximate instead of quadratically. On the other hand, pro-
nearest neighbors with the FAISS library.3 cessing passages jointly in the decoder allows to
better aggregate evidence from multiple passages.
Reading. Our generative model for open domain
QA is based on a sequence-to-sequence network, 4 Experiments
pretrained on unsupervised data, such as T5 or
In this section, we report empirical evaluations of
BART (Raffel et al., 2019; Lewis et al., 2019). The
Fusion-in-Decoder for open domain QA.
model takes as input the question, as well as the
support passages, and generates the answer. More Datasets. We consider the following datasets,
precisely, each retrieved passage and its title are and use the same setting as Lee et al. (2019):
concatenated with the question, and processed in-
dependently from other passages by the encoder. • NaturalQuestions (Kwiatkowski et al., 2019)
We add special tokens question:, title: and contains questions corresponding to Google
context: before the question, title and text of search queries. The open-domain version of
each passage. Finally, the decoder performs atten- this dataset is obtained by discarding answers
with more than 5 tokens.
1
lucene.apache.org
2
spacy.io • TriviaQA (Joshi et al., 2017) contains ques-
3
github.com/facebookresearch/faiss tions gathered from trivia and quiz-league
NaturalQuestions TriviaQA SQuAD
47 66 50
46 64 48
45 46
Exact Match
62 44
44
60 42
43 40
42 58
38
41 56 36
40 54 34
5 10 25 50 100 5 10 25 50 100 5 10 25 50 100
Number of passages Number of passages Number of passages
Figure 3: Performance of Fusion-in-Decoder (base) on valid sets as a function of the number of retrieved passages.
websites. The unfiltered version of TriviaQA and SQuAD, we sample the target among the list
is used for open-domain question answering. of answers, while for TriviaQA, we use the unique
human-generated answer. For TriviaQA, answers
• SQuAD v1.1 (Rajpurkar et al., 2016) is a read- in uppercase are normalized by converting all let-
ing comprehension dataset. Given a paragraph ters in lowercase except the first letter of each word,
extracted from Wikipedia, annotators were using the title Python string method. For both
asked to write questions, for which the answer training and testing, we retrieve 100 passages (un-
is a span from the corresponding paragraph. less said otherwise), and truncate them to 250 word
Following Lee et al. (2019) we use the validation as pieces. Following the results of Karpukhin et al.
test, and keep 10% of the training set for validation. (2020), passages are retrieved with DPR for NQ
We use the Wikipedia dumps from Dec. 20, 2018 and TriviaQA, and with BM25 for SQuAD. We
for NQ and TriviaQA and from Dec. 21, 2016 for generate answers by using greedy decoding.
SQuAD. We apply the same preprocessing as Chen
Comparison to state-of-the-art. In table 1, we
et al. (2017); Karpukhin et al. (2020), leading to
compare the results obtained by Fusion-in-Decoder
passages of 100 words, which do not overlap.
with existing approaches for open domain ques-
Evaluation. Predicted answers are evaluated tion answering. We observe that while conceptu-
with the standard exact match metric (EM), as in- ally simple, this method outperforms existing work
troduced by Rajpurkar et al. (2016). A generated on the NaturalQuestion and TriviaQA benchmarks.
answer is considered correct if it matches any an- In particular, generative models seem to perform
swer of the list of acceptable answers after normal- well when evidence from multiple passages need to
ization. This normalization step consists in low- be aggregated, compared to extractive approaches.
ercasing and removing articles, punctuation and Our method also performs better than other genera-
duplicated whitespace. tive models, showing that scaling to large number
of passages and processing them jointly leads to
Technical details. We initialize our models with
improvement in accuracy. Second, we observe that
the pretrained T5 models (Raffel et al., 2019), avail-
using additional knowledge in generative models
able in the HuggingFace Transformers library.4 We
by using retrieval lead to important performance
consider two model sizes, base and large, contain-
gains. On NaturalQuestions, the closed book T5
ing respectively 220M and 770M parameters. We
model obtains 36.6% accuracy with 11B parame-
fine-tune the models on each dataset independently,
ters, while our approach obtains 44.1% with 770M
using Adam (Kingma and Ba, 2014) with a con-
parameters plus Wikipedia with BM25 retrieval.
stant learning rate of 10−4 and a dropout rate of
Both methods use roughly the same amount of
10%. We train the model for 10k gradient steps,
memory to store information, indicating that text
with a batch size of 64, using 64 Tesla V100 32Gb.
based explicit memories are competitive for knowl-
We evaluate models every 500 steps and select the
edge retrieval tasks.
best one on the validation set based on the Exact
Match score. During training on NaturalQuestions
Scaling with number of passages. In Figure 3,
4
github.com/huggingface/transformers we report the performance with respect to the
NaturalQuestions TriviaQA
Training Passages w/o finetuning w/ finetuning w/o finetuning w/ finetuning
5 37.8 45.0 58.1 64.2
10 42.3 45.3 61.1 63.6
25 45.3 46.0 63.2 64.2
50 45.7 46.0 64.2 64.3
100 46.5 - 64.7 -
Table 2: Performance depending on the number of passages used during training. Exact Match scores are reported
on dev sets.
Impact of the number of training passages. In Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Bordes. 2017. Reading Wikipedia to answer open-
the previous section, the model was trained and
domain questions. In Proc. ACL.
evaluated with the same number of passages. To
reduce the training computational budget, a simple Christopher Clark and Matt Gardner. 2018. Simple
solution consists in training the model with fewer and effective multi-paragraph reading comprehen-
passages. In Table 2, we report the performance sion. In Proc. ACL.
obtained by training with different numbers of pas- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
sages, while testing with 100 passages. We observe Kristina Toutanova. 2019. BERT: Pre-training of
that reducing the number of training passages leads deep bidirectional transformers for language under-
to a decrease of accuracy. Further, we propose to standing. In Proc. NAACL.
finetune the previous models using 100 passages
Angela Fan, Yacine Jernite, Ethan Perez, David Grang-
for 1000 steps. This allows to reduce the accuracy ier, Jason Weston, and Michael Auli. 2019. ELI5:
gap, while using significantly less computational Long form question answering. In Proc. ACL.
resources: we can reach 46.0 EM on NaturalQues-
tions, using 147 GPU hours, compared to 425 GPU Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
hours when training on 100 passages. augmented language model pre-training. arXiv
preprint arXiv:2002.08909.
5 Conclusion
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham
In this paper, we study a simple approach to open Neubig. 2019. How can we know what language
domain question answering, which relies on retriev- models know? arXiv preprint arXiv:1911.12543.
ing support passages before processing them with a
generative model. We show that while conceptually Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
simple, this approach is competitive with existing supervised challenge dataset for reading comprehen-
methods, and that it scales well with the number sion. In Proc. ACL.
of retrieved passages. In future work, we plan to
make this model more efficient, in particular when Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell
Wu, Sergey Edunov, Danqi Chen, and Wen-
scaling to large number of support passages. We tau Yih. 2020. Dense passage retrieval for
also plan to integrate the retrieval in our model, and open-domain question answering. arXiv preprint
to learn the whole system end-to-end. arXiv:2004.04906.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
method for stochastic optimization. arXiv preprint Dario Amodei, and Ilya Sutskever. 2019. Language
arXiv:1412.6980. models are unsupervised multitask learners. OpenAI
Technical Report.
Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom,
Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Edward Grefenstette. 2018. The NarrativeQA read- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
ing comprehension challenge. TACL. Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- former. arXiv preprint arXiv:1910.10683.
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Percy Liang. 2016. SQuAD: 100,000+ questions for
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob machine comprehension of text. In Proc. EMNLP.
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral Questions: a benchmark for question answering Siva Reddy, Danqi Chen, and Christopher D Manning.
research. TACL. 2019. CoQA: A conversational question answering
challenge. TACL.
Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, Miyoung
Ko, and Jaewoo Kang. 2018. Ranking paragraphs Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.
for improving answer recall in open-domain ques- How much knowledge can you pack into the pa-
tion answering. In Proc. EMNLP. rameters of a language model? arXiv preprint
arXiv:2002.08910.
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
2019. Latent retrieval for weakly supervised open Stephen E Robertson, Steve Walker, Susan Jones,
domain question answering. In Proc. ACL. Micheline M Hancock-Beaulieu, Mike Gatford, et al.
1995. Okapi at TREC-3. NIST Special Publication
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Sp.
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer. Alon Talmor, Yanai Elazar, Yoav Goldberg, and
2019. BART: Denoising sequence-to-sequence Jonathan Berant. 2019. oLMpics–on what lan-
pre-training for natural language generation, trans- guage model pre-training captures. arXiv preprint
lation, and comprehension. arXiv preprint arXiv:1912.13283.
arXiv:1910.13461.
Ellen M Voorhees et al. 1999. The TREC-8 question
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio answering track report. In TREC.
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,
Rocktäschel, et al. 2020. Retrieval-augmented gen- Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
eration for knowledge-intensive nlp tasks. arXiv Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3 :
preprint arXiv:2005.11401. Reinforced ranker-reader for open-domain question
answering. In Proc. AAAI.
Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and
Luke Zettlemoyer. 2019a. A discrete hard EM ap- Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaox-
proach for weakly supervised question answering. iao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger,
In Proc. EMNLP-IJCNLP. Gerald Tesauro, and Murray Campbell. 2018b. Ev-
idence aggregation for answer re-ranking in open-
Sewon Min, Danqi Chen, Luke Zettlemoyer, and Han- domain question answering. In Proc. ICLR.
naneh Hajishirzi. 2019b. Knowledge guided text re-
trieval and reading for open domain question answer- Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallap-
ing. arXiv preprint arXiv:1911.03868. ati, and Bing Xiang. 2019. Multi-passage BERT: A
globally normalized BERT model for open-domain
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and question answering. In Proc. EMNLP-IJCNLP.
Luke Zettlemoyer. 2020. Ambigqa: Answering
ambiguous open-domain questions. arXiv preprint Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen
arXiv:2004.10645. Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.
End-to-end open-domain question answering with
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt BERTserini. In Proc. NAACL (Demonstrations).
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proc. NAACL.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
edge bases? In Proc. EMNLP-IJCNLP.