Retrospective Reader For Machine Reading Comprehension: Zhuosheng Zhang, Junjie Yang, Hai Zhao
Retrospective Reader For Machine Reading Comprehension: Zhuosheng Zhang, Junjie Yang, Hai Zhao
Retrospective Reader For Machine Reading Comprehension: Zhuosheng Zhang, Junjie Yang, Hai Zhao
Abstract Passage:
Computational complexity theory is a branch of the the-
Machine reading comprehension (MRC) is an AI challenge ory of computation in theoretical computer science that
that requires machines to determine the correct answers to
questions based on a given passage. MRC systems must
focuses on classifying computational problems according
not only answer questions when necessary but also tact- to their inherent difficulty, and relating those classes to
fully abstain from answering when no answer is available each other. A computational problem is understood to be
according to the given passage. When unanswerable ques- a task that is in principle amenable to being solved by a
tions are involved in the MRC task, an essential verifica- computer, which is equivalent to stating that the problem
tion module called verifier is especially required in addition may be solved by mechanical application of mathemati-
to the encoder, though the latest practice on MRC modeling cal steps, such as an algorithm.
still mostly benefits from adopting well pre-trained language Question:
models as the encoder block by only focusing on the “read- What cannot be solved by mechanical application of
ing”. This paper devotes itself to exploring better verifier de-
mathematical steps?
sign for the MRC task with unanswerable questions. Inspired
by how humans solve reading comprehension questions, we Gold Answer: hno answeri
proposed a retrospective reader (Retro-Reader) that integrates Plausible answer: algorithm
two stages of reading and verification strategies: 1) sketchy
reading that briefly investigates the overall interactions of Table 1: An unanswerable MRC example.
passage and question, and yields an initial judgment; 2) inten-
sive reading that verifies the answer and gives the final pre-
diction. The proposed reader is evaluated on two benchmark
MRC challenge datasets SQuAD2.0 and NewsQA, achieving tems (Zhang et al. 2018; Choi et al. 2018; Reddy, Chen, and
new state-of-the-art results. Significance tests show that our Manning 2019; Zhang, Huang, and Zhao 2018; Xu, Zhao,
model is significantly better than strong baselines. and Zhang 2021; Zhu et al. 2018). The early MRC systems
(Kadlec et al. 2016; Chen, Bolton, and Manning 2016; Dhin-
1 Introduction gra et al. 2017; Wang et al. 2017; Seo et al. 2016) were de-
Be certain of what you know and be aware what you signed on a latent hypothesis that all questions can be an-
don’t. That is wisdom. swered according to the given passage (Figure 1-[a]), which
is not always true for real-world cases. The recent progress
Confucius (551 BC - 479 BC) on the MRC task has required that the model must be capa-
Machine reading comprehension (MRC) aims to teach ble of distinguishing those unanswerable questions to avoid
machines to answer questions after comprehending given giving plausible answers (Rajpurkar, Jia, and Liang 2018).
passages (Hermann et al. 2015; Joshi et al. 2017; Rajpurkar, MRC task with unanswerable questions may be usually de-
Jia, and Liang 2018), which is a fundamental and long- composed into two subtasks: 1) answerability verification
standing goal of natural language understanding (NLU) and 2) reading comprehension. To determine unanswerable
(Zhang, Zhao, and Wang 2020). It has significant applica- questions requires a deep understanding of the given text
tion scenarios such as question answering and dialog sys- and requires more robust MRC models, making MRC much
closer to real-world applications. Table 1 shows an unan-
* Corresponding author. This paper was partially supported
swerable example from SQuAD2.0 MRC task (Rajpurkar,
by National Key Research and Development Program of China Jia, and Liang 2018).
(No. 2017YFB0304100), Key Projects of National Natural Science
Foundation of China (U1836222 and 61733011), Huawei-SJTU
So far, a common reading system (reader) which solves
long term AI project, Cutting-edge Machine reading comprehen- MRC problem generally consists of two modules or building
sion and language model. steps as shown in Figure 1-[a]: 1) building a robust language
Copyright © 2021, Association for the Advancement of Artificial model (LM) as encoder; 2) designing ingenious mechanisms
Intelligence (www.aaai.org). All rights reserved. as decoder according to MRC task characteristics.
Part I: Model Desgins Part II: Retrospective Reader Architecture
Input Sketchy Reading Module Decoder
Encoder Decoder
(External Front Verification, E-FV) (Rear Verification)
[a] Encoder+Decoder
Encoding Interaction
Encoder t1
Decoder
Verifier
scoreext
[b] (Encoder+E-FV)-Decoder
Decoder
t2
Encoder
Verifier
[c] Encoder-(Decoder+I-FV) t3
Answer Prediction
Sketchy Decoder
Intensive Reading Module
(Internal Front Vefification, I-FV)
Intensive
scorehas
Sketchy Reading
E-FV
t5
Decoder
(R-V)
Encoder
...
scorenull
I-FV Intensive Reading
tn
[e] (Encoder+FV)+FV-(Decoder+RV)
Figure 1: Reader overview. For the left part, models [a-c] summarize the instances in previous work, and model [d] is ours,
with the implemented version [e]. In the names of models [a-e], “(·)” represents a module, “+” means the parallel module and
“-” is the pipeline. The right part is the detailed architecture of our proposed Retro-Reader.
Pre-trained language models (PrLMs) such as BERT (De- a pipeline or concatenation way (Figure 1-[b-c]), which is
vlin et al. 2019) and XLNet (Yang et al. 2019) have achieved shown suboptimal for installing a verifier.
success on various natural language processing tasks, which As a natural practice of how humans solve complex read-
broadly play the role of a powerful encoder (Zhang et al. ing comprehension (Zheng et al. 2019; Guthrie and Mosen-
2019; Li et al. 2020; Zhou, Zhang, and Zhao 2019). How- thal 1987), the first step is to read through the full passage
ever, it is quite time-consuming and resource-demanding to along with the question and grasp the general idea; then,
impart massive amounts of general knowledge from external people re-read the full text and verify the answer if not so
corpora into a deep language model via pre-training. sure. Inspired by such a reading and comprehension pattern,
Recently, most MRC readers keep the primary focus on we proposed a retrospective reader (Retro-Reader, Figure 1-
the encoder side, i.e., the deep PrLMs (Devlin et al. 2019; [d]) that integrates two stages of reading and verification
Yang et al. 2019; Lan et al. 2020), as readers may simply strategies: 1) sketchy reading that briefly touches the rela-
and straightforwardly benefit from a strong enough encoder. tionship of passage and question, and yields an initial judg-
Meanwhile, little attention is paid to the decoder side1 of ment; 2) intensive reading that verifies the answer and gives
MRC models (Hu et al. 2019; Back et al. 2020; Reddy et al. the final prediction. Our major contributions are three folds:2
2020), though it has been shown that better decoder or bet- 1. We propose a new retrospective reader design which is
ter manner of using encoder still has a significant impact on capable of effectively performing answer verification in-
MRC performance, no matter how strong the encoder it is stead of simply stacking verifier in existing readers.
(Zhang et al. 2020a; Liu et al. 2021; Li et al. 2019, 2018;
Zhu, Zhao, and Li 2020). 2. Experiments show that our reader can yield substan-
For the concerned MRC challenge with unanswerable tial improvements over strong baselines and achieve new
questions, a reader has to handle two aspects carefully: state-of-the-art results on benchmark MRC tasks.
1) give the accurate answers for answerable questions; 3. For the first time, we apply the significance test for the
2) effectively distinguish the unanswerable questions, and concerned MRC task and show that our models are sig-
then refuse to answer. Such requirements lead to the re- nificantly better than the baselines.
cent reader’s design by introducing an extra verifier mod-
ule or answer-verification mechanism. Most readers simply 2 Related Work
stack the verifier along with encoder and decoder parts in The research of machine reading comprehension have at-
1
tracted great interest with the release of a variety of bench-
We define decoder here as the task-specific part in an MRC
2
system, such as passage and question interaction and answer veri- Our source code is available at https://github.com/cooelf/
fication. AwesomeMRC.
mark datasets (Hill et al. 2015; Hermann et al. 2015; Ra- sitions in the passage P and extract span as answer A but
jpurkar et al. 2016; Joshi et al. 2017; Rajpurkar, Jia, and also return a null string when the question is unanswerable.
Liang 2018; Lai et al. 2017). The early trend is a variety Our retrospective reader is composed of two parallel mod-
of attention-based interactions between passage and ques- ules: a sketchy reading module and an intensive reading
tion, including Attention Sum (Kadlec et al. 2016), Gated module to conduct a two-stage reading process. The intu-
attention (Dhingra et al. 2017), Self-matching (Wang et al. ition behind the design is that the sketchy reading makes
2017), Attention over Attention (Cui et al. 2017) and Bi- a coarse judgment (external front verification) about the an-
attention (Seo et al. 2016). Recently, PrLMs dominate the swerability, whether the question is answerable; and then the
encoder design for MRC and achieve great success. These intensive reading jointly predicts the candidate answers and
PrLMs include ELMo (Peters et al. 2018), GPT (Radford combines its answerability confidence (internal front veri-
et al. 2018), BERT (Devlin et al. 2019), XLNet (Yang et al. fication) with the sketchy judgment score to yield the final
2019), RoBERTa (Liu et al. 2019), ALBERT (Lan et al. answer (rear verification). 3
2020), and ELECTRA (Clark et al. 2020). They show strong
capacity for capturing the contextualized sentence-level lan- 3.1 Sketchy Reading Module
guage representations and greatly boost the benchmark per-
Embedding We concatenate question and passage texts as
formance of current MRC. Following this line, we take
the input, which is firstly represented as embedding vec-
PrLMs as our backbone encoder.
tors to feed an encoder (i.e., a PrLM). In detail, the input
In the meantime, the study of the decoder mechanisms texts are first tokenized to word pieces (subword tokens).
has come to a bottleneck due to the already powerful PrLM Let T = {t1 , . . . , tn } denote a sequence of subword tokens
encoder. Thus this work focuses on the non-encoder part, of length n. For each token, the input embedding is the sum
such as passage and question attention interactions, and es- of its token embedding, position embedding, and token-type
pecially the answer verification. embedding. Let X = {x1 , . . . , xn } be the outputs of the en-
To solve the MRC task with unanswerable questions is coder, which are embedding features of encoding sentence
though important, only a few studies paid attention to this tokens of length n. The input embeddings are then fed to the
topic with straightforward solutions. Mostly, a treatment is interaction layer to obtain the contextual representations.
to adopt an extra answer verification layer, the answer span
prediction and answer verification are trained jointly with
multi-task learning (Figure 1-[c]). Such an implemented ver- Interaction Following Devlin et al. (2019), the encoded
ification mechanism can also be as simple as an answerable sequence X is processed to a multi-layer Transformer
threshold setting broadly used by powerful enough PrLMs (Vaswani et al. 2017) for learning contextual representa-
for quickly building readers (Devlin et al. 2019; Zhang et al. tions. For the following part, we use H = {h1 , . . . , hn } to
2020b). Liu et al. (2018) appended an empty word token denote the last-layer hidden states of the input sequence.
to the context and added a simple classification layer to the
reader. Hu et al. (2019) used two types of auxiliary loss, External Front Verification After reading, the sketchy
independent span loss to predict plausible answers and in- reader will make a preliminary judgment, whether the ques-
dependent no-answer loss the to decide answerability of tion is answerable given the context. We implement this
the question. Further, an extra verifier is adopted to de- reader as an external front verifier (E-FV) to identify unan-
cide whether the predicted answer is entailed by the input swerable questions. The pooled first token (the special sym-
snippets (Figure 1-[b]). Back et al. (2020) developed an bol, [CLS]) representation h1 ∈ H, as the overall repre-
attention-based satisfaction score to compare question em- sentation of the sequence, is passed to a fully connection
beddings with the candidate answer embeddings (Figure 1- layer to get classification logits ŷi composed of answer-
[c]). Zhang et al. (2020c) proposed a verifier layer, which able (logitans ) and unanswerable (logitna ) elements. We
is a linear layer applied to context embedding weighted by use cross entropy as training objective:
start and end distribution over the context words representa-
tions concatenated to [CLS] token representation for BERT 1 X
N
(Figure 1-[c]). Lans = − [yi log ŷi + (1 − yi ) log(1 − ŷi )] (1)
Different from these existing studies which stack the ver- N i=1
ifier module in a simple way or just jointly learn answer
location and non-answer losses, our Retro-Reader adopts a where ŷi ∝ SoftMax(FFN(h1 )) denotes the prediction and
two-stage humanoid design (Zheng et al. 2019; Guthrie and yi is the target indicating whethter the question is answer-
Mosenthal 1987) based on a comprehensive survey over ex- bale or not. N is the number of examples. We calculate
isting answer verification solutions. the difference as the external verification score: scoreext =
logitna −logitans , which is used in the later rear verification
as effective indication factor.
3 Our Proposed Model
3
Intuitively, our model is supposed to be designed as shown
We focus on the span-based MRC task, which can be de- in Figure 1-[d]. In the implementation, we find that modeling the
scribed as a triplet hP, Q, Ai, where P is a passage, and Q entire reading process into two parallel modules is both simple and
is a query over P , in which a span is a right answer A. Our practicable with basically the same performance, which results in
system is supposed to not only predict the start and end po- a parallel reading module design at last as shown in Figure 1-[e].
3.2 Intensive Reading Module with SoftMax operation and feed H0 as the input to obtain
The objective of the intensive reader is to verify the answer- the start and end probabilities, s and e:
ability, produce candidate answer spans, and then give the
s, e ∝ SoftMax(FFN(H0 )). (3)
final answer prediction. It employs the same encoding and
interaction procedure as the sketchy reader, to obtain the rep-
The training objective of answer span prediction is de-
resentation H. In previous studies (Devlin et al. 2019; Yang
fined as cross entropy loss for the start and end predictions,
et al. 2019; Lan et al. 2020), H is directly fed to a linear layer
to yield the prediction.
N
1 X
Lspan = − [log(psyis ) + log(peyie )] (4)
Question-aware Matching Inspired by previous success N i
of explicit attention matching between passage and question
(Kadlec et al. 2016; Dhingra et al. 2017; Wang et al. 2017; where yis and yie are respectively ground-truth start and end
Seo et al. 2016), we are interested in whether the advance positions of example i. N is the number of examples.
still holds based on the strong PrLMs. Here we investigate
two alternative question-aware matching mechanisms as an
extra layer. Note that this part is only used for ablation in Internal Front Verification We adopted an internal front
Table 7 as a reference for interested readers. We do not use verifier (I-FV) such that the intensive reader can identify
any extra matching part in our submission on test evaluations unanswerable questions as well. In general, a verifier’s func-
(e.g., in Tables 2-3) for the sake of simplicity as our major tion can be implemented as a cross-entropy loss (I-FV-
focus is the verification. CE), binary cross-entropy loss (I-FV-BE), or regression-
To obtain the representation of each passage and ques- style mean square error loss (I-FV-MSE). The pooled rep-
0
tion, we split the last-layer hidden state H into HQ and HP resentation h1 ∈ H0 , is passed to a fully connected layer
as the representations of the question and passage, accord- to get the classification logits or regression score. Let ŷi ∝
ing to its position information. Both of the sequences are Linear(h1 ) denote the prediction and yi is the answerability
padded to the maximum length in a minibatch. Then, we target, the three alternative loss functions are as defined as
investigate two potential question-aware matching mecha- follows:
nisms, 1) Transformer-style multi-head cross attention (CA) (1) We use cross entropy as loss function for the classifi-
and 2) traditional matching attention (MA). cation verification:
• Cross Attention We feed the HQ and H to a revised 0