Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Leveraging Passage Retrieval With Generative Models For Open Domain Question Answering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Leveraging Passage Retrieval with Generative Models

for Open Domain Question Answering

Gautier Izacard1,2,3 Edouard Grave1


1
Facebook AI Research, Paris
2
ENS, PSL University, Paris
3
Inria, Paris
gizacard|egrave@fb.com

Abstract Where was Alan


Turing born?
Generative models for open domain question Generative
answering have proven to be competitive, with- seq2seq model
arXiv:2007.01282v2 [cs.CL] 3 Feb 2021

out resorting to external knowledge. While


promising, this approach requires to use mod- Alan Turing
els with billions of parameters, which are ex- was a British
pensive to train and query. In this paper, we computer
investigate how much these models can ben- scientist.
Born in Maida Maida Vale,
efit from retrieving text passages, potentially Vale, London… London
containing evidence. We obtain state-of-the-
art results on the Natural Questions and Triv-
iaQA open benchmarks. Interestingly, we ob-
Figure 1: A simple approach to open domain question
serve that the performance of this method sig-
answering. First, it retrieves support text passages from
nificantly improves when increasing the num-
an external source of knowledge such as Wikipedia.
ber of retrieved passages. This is evidence that
Then, a generative encoder-decoder model produces
sequence-to-sequence models offers a flexible
the answer, conditioned on the question and the re-
framework to efficiently aggregate and com-
trieved passages. This approach scales well with the
bine evidence from multiple passages.
number of retrieved passages, as the performance keeps
improving when retrieving up to one hundred passages.
1 Introduction
Recently, several works have shown that factual
information can be extracted from large scale support documents, before extracting the answer
language models trained on vast quantities of from these documents. Different retrieval tech-
data (Radford et al., 2019; Petroni et al., 2019; niques have been considered, either using sparse
Jiang et al., 2019; Talmor et al., 2019). Building representations based on TF/IDF or using dense
on that observation and the advances in pretrain- embeddings (Guu et al., 2020; Karpukhin et al.,
ing of natural language processing models, Roberts 2020). The models which extract the answers are
et al. (2020) introduced a generative model for open often based on contextualized word representations
domain question answering. Without relying on such as ELMo or BERT (Peters et al., 2018; De-
external knowledge, this method obtained compet- vlin et al., 2019), and predict a span as answer.
itive results on several benchmarks. However, it Aggregating and combining evidence from mul-
requires models containing billions of parameters, tiple passages is not straightforward when using
since all the information needs to be stored in the extractive models, and multiple techniques have
weights. This makes models expensive to query been proposed to address this limitation (Clark and
and train. In this paper, we investigate how much Gardner, 2018; Min et al., 2019a).
this method could benefit from having access to an In this paper, we explore a simple approach hav-
external source of knowledge, such as Wikipedia. ing the best of both worlds, by building on the
Retrieval based approaches were previously con- exciting developments in generative modeling and
sidered in the context of open domain question retrieval for open domain question answering. This
answering with extractive models (Chen et al., method proceeds in two steps, by first retrieving
2017). In that case, systems start by retrieving supporting passages using either sparse or dense
Question + Passage 1 encoder

Question + Passage 2 encoder concat … decoder Answer


Question + Passage N encoder

Figure 2: Architecture of the Fusion-in-Decoder method.

representations. Then, a sequence-to-sequence method to rerank paragraphs based on BiLSTM,


model generates the answer, taking as input the re- while Wang et al. (2018a) trained a ranking system
trieved passages in addition to the question. While with reinforcement learning. A second approach
conceptually simple, this method sets new state-of- to improve the retrieval step of QA systems is to
the-art results on the TriviaQA and NaturalQues- used additional information such as the Wikipedia
tions benchmarks. In particular, we show that the or Wikidata graphs (Min et al., 2019b; Asai et al.,
performance of our method significantly improves 2020). Recently, multiple works show that retrieval
when the number of retrieved passages increases. systems entirely based on dense representation
We believe that this is evidence that generative mod- and approximate nearest neighbors were competi-
els are good at combining evidence from multiple tive with traditional approaches. Such models can
passages, compared to extractive ones. be trained using weak supervision in the form of
question-answer pairs (Karpukhin et al., 2020), or
2 Related work pretrained using a cloze task and finetuned end-to-
Open domain question answering is the task end (Guu et al., 2020; Lee et al., 2019).
of answering general domain questions, in which
the evidence is not given as input to the system. Generative question answering was mostly
While being a longstanding problem in natural lan- considered in previous work for datasets requiring
guage processing (Voorhees et al., 1999), this task to generate answers, such as NarrativeQA (Kočiskỳ
has recently regained interest following the work et al., 2018), CoQA (Reddy et al., 2019) or
by Chen et al. (2017). In that version of the prob- ELI5 (Fan et al., 2019). These datasets were gen-
lem, strong supervision is available to the learning erated in a way that answers do not correspond
system, in the form of spans corresponding to an- to spans in support documents, thus requiring ab-
swers. Chen et al. (2017) proposed to solve the stractive models. Raffel et al. (2019) showed that
problem by first retrieving support document from generative models are competitive for reading com-
Wikipedia, before extracting the answer from the prehension tasks such as SQuAD (Rajpurkar et al.,
retrieved document. Different methods were pro- 2016), where answers are spans. Roberts et al.
posed to tackle the setting where no gold spans are (2020) proposed to use large pretrained generative
given to the system, but only the correct answer. models, without using additional knowledge, for
Clark and Gardner (2018) proposed to use a global open domain question answering. Closest to our
normalization over all the span corresponding to work, Min et al. (2020) and Lewis et al. (2020) in-
the answer, which was later applied to BERT based troduced retrieval augmented generative models for
models (Wang et al., 2019). Min et al. (2019a) open domain question answering. Our approach
introduced a method based on hard expectation- differs from these works by how the generative
maximization to tackle noisy supervision from this model processes the retrieved passages. This al-
setting. Wang et al. (2018b) described a technique lows to scale to large numbers of documents, and
to aggregate answers from different paragraphs, to benefit from this large amount of evidence.
using confidence and coverage scores.
Passage retrieval is an important step in open 3 Method
domain question answering, and is an active area of
research to improve QA systems. Initially, sparse In this section, we describe our approach to open
representations based on TF/IDF were used to domain question answering. It proceeds in two
retrieve support documents (Chen et al., 2017). steps, first retrieving support passages before pro-
Lee et al. (2018) introduced a supervised learning cessing them with a sequence to sequence model.
Model NQ TriviaQA SQuAD Open
EM EM EM EM F1
DrQA (Chen et al., 2017) - - - 29.8 -
Multi-Passage BERT (Wang et al., 2019) - - - 53.0 60.9
Path Retriever (Asai et al., 2020) 31.7 - - 56.5 63.8
Graph Retriever (Min et al., 2019b) 34.7 55.8 - - -
Hard EM (Min et al., 2019a) 28.8 50.9 - - -
ORQA (Lee et al., 2019) 31.3 45.1 - 20.2 -
REALM (Guu et al., 2020) 40.4 - - - -
DPR (Karpukhin et al., 2020) 41.5 57.9 - 36.7 -
SpanSeqGen (Min et al., 2020) 42.5 - - - -
RAG (Lewis et al., 2020) 44.5 56.1 68.0 - -
T5 (Roberts et al., 2020) 36.6 - 60.5 - -
GPT-3 few shot (Brown et al., 2020) 29.9 - 71.2 - -
Fusion-in-Decoder (base) 48.2 65.0 77.1 53.4 60.6
Fusion-in-Decoder (large) 51.4 67.6 80.1 56.7 63.2

Table 1: Comparison to state-of-the-art. On TriviaQA, we report results on the open domain test set (left), and on
the hidden test set (right), competitions.codalab.org/competitions/17208#results).

Retrieval. For the retrieval of support passages, tion over the concatenation of the resulting repre-
we consider two methods: BM25 (Robertson et al., sentations of all the retrieved passages. The model
1995) and DPR (Karpukhin et al., 2020). In BM25, thus performs evidence fusion in the decoder only,
passages are represented as bag of words, and the and we refer to it as Fusion-in-Decoder.
ranking function is based on term and inverse doc- By processing passages independently in the en-
ument frequencies. We use the implementation coder, but jointly in the decoder, this method dif-
from Apache Lucene1 with default parameters, and fers from Min et al. (2020) and Lewis et al. (2020).
tokenize questions and passages with SpaCy.2 In Processing passages independently in the encoder
DPR, passages and questions are represented as allows to scale to large number of contexts, as it
dense vector representations, computed using two only performs self attention over one context at a
BERT networks. The ranking function is the dot time. This means that the computation time of the
product between the query and passage represen- model grows linearly with the number of passages,
tations. Retrieval is performed using approximate instead of quadratically. On the other hand, pro-
nearest neighbors with the FAISS library.3 cessing passages jointly in the decoder allows to
better aggregate evidence from multiple passages.
Reading. Our generative model for open domain
QA is based on a sequence-to-sequence network, 4 Experiments
pretrained on unsupervised data, such as T5 or
In this section, we report empirical evaluations of
BART (Raffel et al., 2019; Lewis et al., 2019). The
Fusion-in-Decoder for open domain QA.
model takes as input the question, as well as the
support passages, and generates the answer. More Datasets. We consider the following datasets,
precisely, each retrieved passage and its title are and use the same setting as Lee et al. (2019):
concatenated with the question, and processed in-
dependently from other passages by the encoder. • NaturalQuestions (Kwiatkowski et al., 2019)
We add special tokens question:, title: and contains questions corresponding to Google
context: before the question, title and text of search queries. The open-domain version of
each passage. Finally, the decoder performs atten- this dataset is obtained by discarding answers
with more than 5 tokens.
1
lucene.apache.org
2
spacy.io • TriviaQA (Joshi et al., 2017) contains ques-
3
github.com/facebookresearch/faiss tions gathered from trivia and quiz-league
NaturalQuestions TriviaQA SQuAD
47 66 50
46 64 48
45 46
Exact Match
62 44
44
60 42
43 40
42 58
38
41 56 36
40 54 34
5 10 25 50 100 5 10 25 50 100 5 10 25 50 100
Number of passages Number of passages Number of passages

Figure 3: Performance of Fusion-in-Decoder (base) on valid sets as a function of the number of retrieved passages.

websites. The unfiltered version of TriviaQA and SQuAD, we sample the target among the list
is used for open-domain question answering. of answers, while for TriviaQA, we use the unique
human-generated answer. For TriviaQA, answers
• SQuAD v1.1 (Rajpurkar et al., 2016) is a read- in uppercase are normalized by converting all let-
ing comprehension dataset. Given a paragraph ters in lowercase except the first letter of each word,
extracted from Wikipedia, annotators were using the title Python string method. For both
asked to write questions, for which the answer training and testing, we retrieve 100 passages (un-
is a span from the corresponding paragraph. less said otherwise), and truncate them to 250 word
Following Lee et al. (2019) we use the validation as pieces. Following the results of Karpukhin et al.
test, and keep 10% of the training set for validation. (2020), passages are retrieved with DPR for NQ
We use the Wikipedia dumps from Dec. 20, 2018 and TriviaQA, and with BM25 for SQuAD. We
for NQ and TriviaQA and from Dec. 21, 2016 for generate answers by using greedy decoding.
SQuAD. We apply the same preprocessing as Chen
Comparison to state-of-the-art. In table 1, we
et al. (2017); Karpukhin et al. (2020), leading to
compare the results obtained by Fusion-in-Decoder
passages of 100 words, which do not overlap.
with existing approaches for open domain ques-
Evaluation. Predicted answers are evaluated tion answering. We observe that while conceptu-
with the standard exact match metric (EM), as in- ally simple, this method outperforms existing work
troduced by Rajpurkar et al. (2016). A generated on the NaturalQuestion and TriviaQA benchmarks.
answer is considered correct if it matches any an- In particular, generative models seem to perform
swer of the list of acceptable answers after normal- well when evidence from multiple passages need to
ization. This normalization step consists in low- be aggregated, compared to extractive approaches.
ercasing and removing articles, punctuation and Our method also performs better than other genera-
duplicated whitespace. tive models, showing that scaling to large number
of passages and processing them jointly leads to
Technical details. We initialize our models with
improvement in accuracy. Second, we observe that
the pretrained T5 models (Raffel et al., 2019), avail-
using additional knowledge in generative models
able in the HuggingFace Transformers library.4 We
by using retrieval lead to important performance
consider two model sizes, base and large, contain-
gains. On NaturalQuestions, the closed book T5
ing respectively 220M and 770M parameters. We
model obtains 36.6% accuracy with 11B parame-
fine-tune the models on each dataset independently,
ters, while our approach obtains 44.1% with 770M
using Adam (Kingma and Ba, 2014) with a con-
parameters plus Wikipedia with BM25 retrieval.
stant learning rate of 10−4 and a dropout rate of
Both methods use roughly the same amount of
10%. We train the model for 10k gradient steps,
memory to store information, indicating that text
with a batch size of 64, using 64 Tesla V100 32Gb.
based explicit memories are competitive for knowl-
We evaluate models every 500 steps and select the
edge retrieval tasks.
best one on the validation set based on the Exact
Match score. During training on NaturalQuestions
Scaling with number of passages. In Figure 3,
4
github.com/huggingface/transformers we report the performance with respect to the
NaturalQuestions TriviaQA
Training Passages w/o finetuning w/ finetuning w/o finetuning w/ finetuning
5 37.8 45.0 58.1 64.2
10 42.3 45.3 61.1 63.6
25 45.3 46.0 63.2 64.2
50 45.7 46.0 64.2 64.3
100 46.5 - 64.7 -

Table 2: Performance depending on the number of passages used during training. Exact Match scores are reported
on dev sets.

number of retrieved passages. In particular, we References


observe that increasing the number of passages
Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi,
from 10 to 100 leads to 6% improvement on Trivi- Richard Socher, and Caiming Xiong. 2020. Learn-
aQA and 3.5% improvement on NaturalQuestions. ing to retrieve reasoning paths over wikipedia graph
On the other hand, the performance of most ex- for question answering. In Proc. ICLR.
tractive models seems to peak around 10 to 20
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie
passages (Wang et al., 2019; Yang et al., 2019). Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
We believe that this is evidence that sequence-to- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
sequence models are good at combining informa- Askell, et al. 2020. Language models are few-shot
tions from multiple passages. learners. arXiv preprint arXiv:2005.14165.

Impact of the number of training passages. In Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Bordes. 2017. Reading Wikipedia to answer open-
the previous section, the model was trained and
domain questions. In Proc. ACL.
evaluated with the same number of passages. To
reduce the training computational budget, a simple Christopher Clark and Matt Gardner. 2018. Simple
solution consists in training the model with fewer and effective multi-paragraph reading comprehen-
passages. In Table 2, we report the performance sion. In Proc. ACL.
obtained by training with different numbers of pas- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
sages, while testing with 100 passages. We observe Kristina Toutanova. 2019. BERT: Pre-training of
that reducing the number of training passages leads deep bidirectional transformers for language under-
to a decrease of accuracy. Further, we propose to standing. In Proc. NAACL.
finetune the previous models using 100 passages
Angela Fan, Yacine Jernite, Ethan Perez, David Grang-
for 1000 steps. This allows to reduce the accuracy ier, Jason Weston, and Michael Auli. 2019. ELI5:
gap, while using significantly less computational Long form question answering. In Proc. ACL.
resources: we can reach 46.0 EM on NaturalQues-
tions, using 147 GPU hours, compared to 425 GPU Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
hours when training on 100 passages. augmented language model pre-training. arXiv
preprint arXiv:2002.08909.
5 Conclusion
Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham
In this paper, we study a simple approach to open Neubig. 2019. How can we know what language
domain question answering, which relies on retriev- models know? arXiv preprint arXiv:1911.12543.
ing support passages before processing them with a
generative model. We show that while conceptually Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. Triviaqa: A large scale distantly
simple, this approach is competitive with existing supervised challenge dataset for reading comprehen-
methods, and that it scales well with the number sion. In Proc. ACL.
of retrieved passages. In future work, we plan to
make this model more efficient, in particular when Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell
Wu, Sergey Edunov, Danqi Chen, and Wen-
scaling to large number of support passages. We tau Yih. 2020. Dense passage retrieval for
also plan to integrate the retrieval in our model, and open-domain question answering. arXiv preprint
to learn the whole system end-to-end. arXiv:2004.04906.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
method for stochastic optimization. arXiv preprint Dario Amodei, and Ilya Sutskever. 2019. Language
arXiv:1412.6980. models are unsupervised multitask learners. OpenAI
Technical Report.
Tomáš Kočiskỳ, Jonathan Schwarz, Phil Blunsom,
Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Edward Grefenstette. 2018. The NarrativeQA read- Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
ing comprehension challenge. TACL. Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- former. arXiv preprint arXiv:1910.10683.
field, Michael Collins, Ankur Parikh, Chris Alberti,
Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Percy Liang. 2016. SQuAD: 100,000+ questions for
Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob machine comprehension of text. In Proc. EMNLP.
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral Questions: a benchmark for question answering Siva Reddy, Danqi Chen, and Christopher D Manning.
research. TACL. 2019. CoQA: A conversational question answering
challenge. TACL.
Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, Miyoung
Ko, and Jaewoo Kang. 2018. Ranking paragraphs Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.
for improving answer recall in open-domain ques- How much knowledge can you pack into the pa-
tion answering. In Proc. EMNLP. rameters of a language model? arXiv preprint
arXiv:2002.08910.
Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
2019. Latent retrieval for weakly supervised open Stephen E Robertson, Steve Walker, Susan Jones,
domain question answering. In Proc. ACL. Micheline M Hancock-Beaulieu, Mike Gatford, et al.
1995. Okapi at TREC-3. NIST Special Publication
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- Sp.
jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Ves Stoyanov, and Luke Zettlemoyer. Alon Talmor, Yanai Elazar, Yoav Goldberg, and
2019. BART: Denoising sequence-to-sequence Jonathan Berant. 2019. oLMpics–on what lan-
pre-training for natural language generation, trans- guage model pre-training captures. arXiv preprint
lation, and comprehension. arXiv preprint arXiv:1912.13283.
arXiv:1910.13461.
Ellen M Voorhees et al. 1999. The TREC-8 question
Patrick Lewis, Ethan Perez, Aleksandara Piktus, Fabio answering track report. In TREC.
Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,
Rocktäschel, et al. 2020. Retrieval-augmented gen- Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
eration for knowledge-intensive nlp tasks. arXiv Tesauro, Bowen Zhou, and Jing Jiang. 2018a. R3 :
preprint arXiv:2005.11401. Reinforced ranker-reader for open-domain question
answering. In Proc. AAAI.
Sewon Min, Danqi Chen, Hannaneh Hajishirzi, and
Luke Zettlemoyer. 2019a. A discrete hard EM ap- Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang, Xiaox-
proach for weakly supervised question answering. iao Guo, Shiyu Chang, Zhiguo Wang, Tim Klinger,
In Proc. EMNLP-IJCNLP. Gerald Tesauro, and Murray Campbell. 2018b. Ev-
idence aggregation for answer re-ranking in open-
Sewon Min, Danqi Chen, Luke Zettlemoyer, and Han- domain question answering. In Proc. ICLR.
naneh Hajishirzi. 2019b. Knowledge guided text re-
trieval and reading for open domain question answer- Zhiguo Wang, Patrick Ng, Xiaofei Ma, Ramesh Nallap-
ing. arXiv preprint arXiv:1911.03868. ati, and Bing Xiang. 2019. Multi-passage BERT: A
globally normalized BERT model for open-domain
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and question answering. In Proc. EMNLP-IJCNLP.
Luke Zettlemoyer. 2020. Ambigqa: Answering
ambiguous open-domain questions. arXiv preprint Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen
arXiv:2004.10645. Tan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.
End-to-end open-domain question answering with
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt BERTserini. In Proc. NAACL (Demonstrations).
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proc. NAACL.
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl-
edge bases? In Proc. EMNLP-IJCNLP.

You might also like