Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Summary Evaluation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

QuestEval: Summarization Asks for Fact-based Evaluation

Thomas Scialom?‡ , Paul-Alexis Dray? , Patrick Gallinari‡ , Sylvain Lamprier‡ ,


Benjamin Piwowarski‡ , Jacopo Staiano? , Alex Wang†

CNRS, France

Sorbonne Université, CNRS, LIP6, F-75005 Paris, France
?
reciTAL, Paris, France

New York University
thomas@recital.ai

Abstract of standard automatic evaluation metrics with hu-


man judgments is low (Louis and Nenkova, 2013).
Summarization evaluation remains an open Furthermore, since a summary must be shorter than
research problem: current metrics such as the corresponding source document, information
ROUGE are known to be limited and to corre-
arXiv:2103.12693v2 [cs.CL] 9 Apr 2021

selection (Li et al., 2018) is critical so that the


late poorly with human judgments. To allevi-
ate this issue, recent work has proposed evalua- summary only contains the salient contents from
tion metrics which rely on question answering its source document. For these reasons, n-gram
models to assess whether a summary contains based metrics, such as ROUGE (Lin, 2004), are
all the relevant information in its source doc- known to poorly reflect human preference (Louis
ument. Though promising, the proposed ap- and Nenkova, 2013; Novikova et al., 2017; Paulus
proaches have so far failed to correlate better et al., 2017; Bhandari et al., 2020). Finally, it is
than ROUGE with human judgments. crucial for reliable summarization to generate texts
In this paper, we extend previous approaches that are factually consistent with their source doc-
and propose a unified framework, named uments. However, this aspect is not measured by
QuestEval. In contrast to established n-grams based metrics. Notably, while recent state-
metrics such as ROUGE or BERTScore,
of-the-art generative models (Lewis et al., 2019;
QuestEval does not require any ground-
truth reference. Nonetheless, QuestEval Zhang et al., 2019a) produce fluent summaries,
substantially improves the correlation with hu- they frequently contain false or unsupported in-
man judgments over four evaluation dimen- formation (Kryściński et al., 2019), a phenomenon
sions (consistency, coherence, fluency, and rel- also known as neural hallucination (Rohrbach et al.,
evance), as shown in the extensive experiments 2018; Zhao et al., 2020).
we report.
To overcome these limitations, a new approach
to evaluate summarization systems has recently
1 Introduction
emerged, based on question generation (QG) and
The reliability of automatic evaluation metrics is answering (QA) (Chen et al., 2017; Scialom et al.,
an important factor for progress in artificial intelli- 2019; Eyal et al., 2019). These metrics measure
gence tasks, enabling the comparison and improve- to which extent a summary provides sufficient in-
ment of proposed systems. The design of reliable formation to answer questions posed on its corre-
metrics for natural language generation (NLG) sys- sponding source document. They can be used to
tems is very challenging, and still an open research assess the factual consistency (i.e. precision) (Dur-
problem: Novikova et al. (2017); Peyrard (2019) mus et al., 2020; Wang et al., 2020) or the relevance
showed that current metrics do not correlate well (i.e. recall) (Scialom et al., 2019) of the evaluated
with human judgments, and argued for the develop- summary, with respect to its source document. Al-
ment of new evaluation metrics. though these works have introduced an interesting
Among NLG tasks, summarization is one of the and novel method to evaluate summarization, with
most difficult to evaluate automatically. For a given encouraging preliminary results, none of those met-
document, the number of possible correct outputs rics is found to perform better than ROUGE (Fabbri
is much larger than for other tasks such as machine et al., 2020): automatic evaluation of summariza-
translation. Thus, when only a single reference tion systems remains an open research problem
summary is given – as is typically the case for (Kryscinski et al., 2019).
large-scale summarization datasets, the correlation Inspired by these works, and motivated to take
Source Source
Evaluated Evaluated
summary summary

QG QA W QG QA

Questions (Qy) Questions (Qx)


Scores

Weighted
Scores ( eq.1 )
scores ( eq.2 )
Precision Recall

+
QuestEval scores
QuestEval

Figure 1: Illustration of the QuestEval framework: the blue area corresponds to the precision-oriented frame-
work proposed by Wang et al. (2020). The orange area corresponds to the recall-oriented SummaQA (Scialom
et al., 2019). We extend it with a weighter component for an improved recall (red area). The encompassing area
corresponds to our proposed unified approach, QuestEval.

up the challenge of summarization evaluation, we 2 Related Work


propose QuestEval, a new reference-less evalua-
tion metric, which is found to correlate dramatically
better with humans judgments. Our contributions Summarization Metrics The most popular eval-
are as follows: uation metric for summarization is ROUGE (Lin,
2004), which computes the recall of reference n-
• We show that, by unifying the precision and grams in the evaluated summary. Other n-grams
recall-based QA metrics, we obtain a more based metrics have been proposed such as CIDEr
robust metric; (Vedantam et al., 2015) and METEOR (Lavie and
Agarwal, 2007), but none of them correlates bet-
ter with humans according to SummEval, a recent
• We propose a method to learn the saliency of
large study conducted by Fabbri et al. (2020).
the generated queries, allowing to integrate
the notion of information selection; Recent works have leveraged the success of pre-
trained language models. Zhang et al. (2019b) pro-
• We evaluate QuestEval on two cor- posed BERTScore, which uses BERT (Devlin et al.,
pora containing annotated summaries from 2018) to compute a similarity score between the
CNN/Daily Mail (Nallapati et al., 2016) and reference and the evaluated text. However, its per-
XSUM (Narayan et al., 2018) datasets. The formance is similar to that of ROUGE (Fabbri et al.,
proposed metric obtains state-of-the-art re- 2020). Several works have explored using natural
sults in terms of correlation with humans judg- language inference (NLI) models to evaluate the
ments, over all the evaluated dimensions. No- factual consistency of summaries (Kryściński et al.,
tably, QuestEval is effective at measuring 2019; Falke et al., 2019; Maynez et al., 2020), find-
factual consistency, a crucial yet challenging ing mixed results in using NLI models rather than
aspect for summarization. QA models.
QA-Based Metrics QA-based approaches for 3.2 Question Generation
summary evaluation were proposed a decade ago For the QG component, we draw on recent work
by Clarke and Lapata (2010) for human evaluation. on neural answer-conditional question generation
Chen et al. (2017) and Eyal et al. (2019) proposed (Zhou et al., 2017). The component also consists of
to automate this approach by automatically generat- a T5 model, finetuned to maximize the likelihood of
ing questions from the reference summary. Scialom human questions, given the corresponding answer
et al. (2019) extended these works by generating and source document.
the questions from the source document, which At test time, given a source document or gener-
probes information recalled from the input text in ated summary, we first select a set of answers from
the output summary, and thus is recall oriented. the text to condition the QG model on. Follow-
However, by weighing each question equally, their ing Wang et al. (2020), we consider all the named
approach lacks a way to select questions that re- entities and nouns from the source document as
flect the most important information of the input. answers. Then, for each selected answer, we gen-
Conversely, Wang et al. (2020) and Durmus et al. erate a question via beam search.1 We filter out
(2020) proposed to generate questions from the every question for which the QA model predicts an
evaluated summary. These methods are precision incorrect answer. Based on this, we denote QG (T )
oriented, since they measure the amount of infor- the set of question-answer pairs (q, r) for a text T
mation in the evaluated summary that are supported such that QA (T, q) = r.
by the input text. We show in this paper that com-
bining these recall and precision approaches leads 4 The QuestEval metric
to an improved metric.
In the following, D and S are two sequences of
3 A Question-Answering based tokens with D denoting the source document and
Framework S the corresponding evaluated summary.

This paper introduces the QuestEval framework 4.1 Precision


for evaluating summarization systems, that ac- A summary is deemed inconsistent with respect
counts for both factual consistency and relevance to its source text if, given a question, the answer
of the generated text, without requiring any human differs when conditioned on S or D. Therefore, we
reference. QuestEval consists of a QG compo- define the precision score for the evaluated sum-
nent QG and a QA component QA , described in mary as:
this section and depicted in Figure 1.
1 X
P rec(D, S) = F 1(QA (D, q), r)
3.1 Question Answering |QG (S)|
(q,r)∈QG (S)

Recently, there has been significant progress on (1)


factoid question answering, with models obtaining The F1 score is a standard metric for evaluating
human-level performance on benchmarks such as factoid question answering models, and measures
SQuAD (Rajpurkar et al., 2016). Leveraging on the overlap between the predicted answer and the
these advancements, our QA component consists of corresponding ground truth. It outputs 1 for an ex-
a pretrained T5 model (Raffel et al., 2019), which act match between both answers and 0 if there is no
extracts answers from a source document given common token. This definition of factual consis-
the document and a question to answer. In the tency corresponds to the frameworks concurrently
following, we refer to QA (r|T, q) as the probability proposed by Wang et al. (2020) and Durmus et al.
of the answer r to question q on a text T , and (2020).
QA (T, q) as the answer greedily generated from
4.2 Recall
the model.
When a summary is evaluated, there is no guar- While a summary should contain only factual infor-
antee that it contains the answer. Therefore, it is mation (precision), it should also contain the most
crucial for the QA model to be able to predict when important information from its source text (recall).
a question is unanswerable. Our QA component Extending Scialom et al. (2019) by introducing a
thus includes the unanswerable token, that we de- 1
We experimented with nucleus sampling (Holtzman et al.,
note , among its possible outputs. 2019) to increase diversity of the questions, with no success.
query weighter W , we define recall as: 4.3 Unifying Precision and Recall
P
W (q, D)(1 − QA (|S, q)) The final QuestEval score accounts for both
q,r∈QG (D)
Rec(D, S) = P (2) the precision and recall by computing their har-
W (q, D)
q,r∈QG (D) monic mean (i.e. the F-Score): 2 PPrec+Rec
rec.Rec
. The
QuestEval score is thus directly comparable
where QG (D) is the set of all question-answer
with existing evaluation metrics, such as ROUGE
pairs for the source text D, and W (q, D) is the
or BLEU, as it lies in the same numerical range.
weight of query q for text D.
Answerability and F1 Factoid question answer- 5 Experiments
ing models are commonly evaluated using F1 score, 5.1 Summarization Datasets
measuring the overlap between the predicted an-
swer and the corresponding ground truth (Ra- To evaluate QuestEval, we measure its correla-
jpurkar et al., 2016). However, an answer could be tion with human judgments on different datasets:
correctly expressed in different ways, e.g. “ACL” SummEval Released by Fabbri et al. (2020), it
and “Association for Computational Linguistics”. is one of the largest human-annotated datasets for
Unfortunately, the F1 score is 0 in this example. summarization. Derived from CNN/Daily Mail
To sidestep this issue, Scialom et al. (2019) use (Nallapati et al., 2016), it consists of 12.800 sum-
the QA confidence of answerability, i.e. 1 − QA (), mary level annotations. To ensure diversity, the
rather than F1 score. Defining recall this way al- summaries were generated from 16 different sum-
lows to measure answerability independently of marization models, including extractive and ab-
the way the answer is expressed, but does not take stractive architectures. To ensure quality, three
into account possible model hallucinations, i.e. the experts annotated four dimensions: i) Consistency:
summary could answer the question incorrectly. the proportion of facts in the summary correspond-
Conversely, when we assess factual consistency, ing to facts in the original text; ii) Coherence: how
it is not enough for a question from the summary well-structured and well-organized is the summary;
to be answerable from the source document. The iii) Fluency: how fluent the summary is to read;
two answers to this question should also share the and, iv) Relevance: the ratio between important
same meaning to be factually consistent. While and excess information in the summary.2
using answerability allows for more true positive
(e.g. “ACL”), for precision it is crucial to detect QAGS-XSUM Wang et al. (2020) made avail-
true negatives. This motivates our use of the F1 able a subset of 239 BART outputs (Lewis et al.,
score in this case, similar to Wang et al. (2020). 2019) fine-tuned on XSUM (Narayan et al., 2018).3
Three annotators measured the “correctness” of
Query Weighting In Scialom et al. (2019), all each summary, which corresponds to consistency
questions are considered equally important, i.e. in SummEval.
the weight W (q, D) = 1 for every query q ∈
QG (D). However, since a summary necessarily 5.2 Question Answering & Generation
has a constrained length, an effective summary To train our QG and QA models, we used two fac-
should contain the most important information toid question answering datasets: SQuAD-v2 (Ra-
from the source. To account for this, we introduce jpurkar et al., 2018) and NewsQA (Trischler et al.,
a question weighter, which is trained to distinguish 2016). Such datasets are composed of (paragraph,
important questions from anecdotal ones. We lever- question, answer) triplets. SQuAD-v2 provides
age existing summarization datasets to create train- unanswerable questions, while NewsQA is com-
ing data for the weighter: given a source document posed of news articles, corresponding to our sum-
D, each question q ∈ QG (D) is labeled as impor- marization domain. Note that QG can be seen
tant if the corresponding human summary contains as the dual task for QA. Any QA dataset can be
the answer, as computed by the QA component reversed into a QG dataset, by switching the gener-
applied on the summary (i.e. QA (S, q) 6= ). ation target from the answer to the question.
W (q, D) denotes the probability that q is impor-
2
tant for D. Note that the question weighter only See 4.3 Human Annotations in Fabbri et al. (2020) for
more details.
concerns recall, and therefore is not applied when 3
Note that XSUM provides more abstractive summaries
computing precision. than those of CNN/Daily Mail.
#Ref Consistency Coherence Fluency Relevance Average
ROUGE-1 11 18.1 20.1 14.9 35.6 22.2
ROUGE-L 11 15.7 15.6 13.8 33.4 19.6
METEOR 11 3.3 2.9 7.1 -0.5 3.2
BLEU 11 17.5 22. 13.7 35.6 22.2
BERTScore-f 11 20.3 18.5 21.6 31.9 23.1
ROUGE-1 1 11.0 9.8 7.5 18.9 11.8
ROUGE-L 1 8.2 7.3 5.7 13.5 8.7
BLEU 1 8.9 3.9 4.0 12.7 7.4
BERTScore-f 1 8.7 9.8 10.6 17.9 11.8
SummaQA 0 8.3 8.0 -2.9 26.2 9.9
QAGS (our impl.) 0 20.4 7.7 16.8 9.1 13.7
QuestEvalW =unif orm 0 43.7 22.9 28.2 37.5 33.1
w/o QA neg sampl. 0 42.5 22.5 27.7 37.2 32.4
QuestEvalW =learned 0 42.0 24.0 28.4 39.2 33.5
Precision Only 0 46.5 14.0 30.9 22.2 28.4
Recall Only 0 30.5 22.6 19.2 37.6 27.5

Table 1: Summary-level Pearson correlation coefficients for various dimensions between automatic metrics and
human judgments on SummEval. The top section corresponds to correlations for metrics computed on 11 reference
summaries, as reported in Fabbri et al. (2020). The second section corresponds to these metrics, but given only
one reference. The third section corresponds to the QA-based baselines. The bottom section corresponds to the
proposed QuestEval and its ablations.

Lastly, we found it helpful to train our QA model textualized token embeddings of the reference sum-
using additional synthetic unanswerable questions. mary, and the overall mean is reported.
This is done by considering a shuffled version of
the dataset, where each question is randomly as- Question based SummaQA (Scialom et al.,
signed to a paragraph from another triplet of the 2019) is a recall oriented metric, with questions
dataset. We consider these additional samples, with generated from the source document. QAGS
flipped contexts, as unanswerable. All experiments, (Wang et al., 2020) is a precision oriented metric,
except otherwise specified, use this additional neg- with questions generated from the summary.
ative sampling process to improve identification of 5.4 Results
unanswerable queries.
In Tables 1 and 2 we report the results for
5.3 Baselines Metrics QuestEval, along with several ablations. W =
unif orm corresponds to setting all questions
As baselines, we considered the following:
weights equal. Conversely, W = learned cor-
N-gram based ROUGE (Lin, 2004) is the most responds to the weights learned as detailed in Sec-
widely used evaluation in summarization. This met- tion 4.2. We also report the recall and precision
ric measures the recall of reference n-grams in the component separately.
evaluated summary. Conversely, BLEU (Papineni In Table 1, we observe that, amongst existing
et al., 2002) computes the precision of summary metrics, BERTScore achieves the best average
n-grams in the references. METEOR (Lavie and Pearson correlation with human judgements (23.1),
Agarwal, 2007) is a variant that uses stemming, slightly above ROUGE-1 (22.2) and BLEU (22.2).
synonyms and paraphrastic matches. These correlations are obtained when providing no
less than 11 gold references, and averaging results.
Neural based Leveraging recent progress in lan- Given a single reference, all these correlations are
guage modeling, Zhang et al. (2019b) proposed halved. Most of the large scale datasets provide
BERTScore: for each token of the summary, the only one reference per example in their test set
maximum cosine similarity is computed over con- (e.g. CNN/Daily Mail and XSUM), a fact that
Metric Consistency important answered Relevance Corr.
ROUGE-1 13.2 3 3 37.6
ROUGE-L 8.9 3 7 -33.5
METEOR 10.0 7 3 -5.7
BLEU 5.6
BERTScore 2.5 Table 3: Pearson correlation coefficients between hu-
man judgments (for Relevance) and the percentage of
SummaQA - important and/or answered questions, on SummEval
QAGS 17.5 data.
QuestEvalW =unif orm 30.4
w/o QA neg sampl. 28.5
QuestEvalW =learned 29.0 ument. We argue that the universe of correct out-
Precision Only 32.7 puts is much larger than in other generation tasks
Recall Only 13.9 such as machine translation. This explains why
the correlations with humans is largely reduced
Table 2: Summary-level Pearson correlation coeffi- when computed with one reference instead of 11
cients for Correctness between various automatic met- (see Table 1: BERTScore-f drops from 23.1 to 11.8
rics and human judgments on QAGS-XSUM. The top in average, and other metrics likewise). Unfortu-
section corresponds to correlations for diverse metrics nately, assuming the availability of as many as 11
computed on one reference summary, as reported in gold references is not realistic in most scenarios,
Wang et al. (2020). The middle section corresponds to
due to the cost of obtaining reference summaries.
QA-based baselines. The bottom section corresponds
to this work. To complement Table 1, we report in Figure 2
the correlations for the best baselines as we pro-
gressively decrease the number of available gold
highlights the importance of searching for more references from 11 to 1. We observe that for all four
reference-efficient alternatives. dimensions and all the baselines, the correlations
With regards to sample efficiency, QA-based decrease and the variance increases as the number
metrics do not require any references. We expect of references decreases. However, QuestEval
Relevance to be better measured by Recall oriented does not require any reference. Therefore, the im-
metrics, and less so for Consistency. This is con- provement over the other metrics grows larger as
firmed in the results, where SummaQA correlates the number of references used decreases. Further-
better with Relevance than Consistency (26.2 vs more, QuestEval enables the evaluation of sys-
8.3), and vice versa for QAGS (9.1 vs 20.4). By uni- tems on datasets even if no gold reference is avail-
fying and extending the two, QuestEval allows able.
to take both dimensions into account, improving
the average correlation by 18% (28.4 to 33.5). Query Weighter There is no unique answer to
The dimension that benefits the most from the the question “What makes a good summary?”:
learned question weighter is Relevance (+4%, from it depends on the reader’s point of view, which
37.5 to 39.2), indicating that our classifier learns makes summarization evaluation challenging. For
which questions target important information. We instance, given a contract, the seller and the buyer
discuss this aspect more in depth in the following could be interested in different information within
section. the same document.
Finally, compared to the other metrics, the In this paper, to instantiate the weighter W ,
improvement is remarkable (33.5 vs 11.8 for we propose to learn a specific dataset policy:
BERTScore), and allows safer evaluations of the “what kind of questions are likely answered in the
systems while not even requiring references. CNN/Daily Mail training summaries?" This is a
reasonable heuristic given that editors created the
5.5 Discussion summaries following their specific policy.
Reference-less One of the main limitations for To demonstrate the effectiveness of the weighter,
the current metrics is that they require gold refer- we proceed as follows. We first consider that a
ences to compute similarity scores. However, many question q ∈ QG (D), generated on the source
possible summaries are valid for one source doc- document, is important if the probability given
Consistency Coherence
ROUGE-1 25.0
ROUGE-L
40 22.5
Pearson correlation coefficients
BLEU

Pearson correlation coefficients


BertScore-f
QuestEval (0 reference) 20.0
30
17.5

20 15.0
12.5 ROUGE-1
ROUGE-L
10 10.0 BLEU
BertScore-f
7.5 QuestEval (0 reference)
0
11 10 9 8 7 6 5 4 3 2 1 11 10 9 8 7 6 5 4 3 2 1
Number of references Number of references
Fluency Relevance
30 ROUGE-1 40
ROUGE-L
Pearson correlation coefficients

Pearson correlation coefficients


BLEU 35
25 BertScore-f
QuestEval (0 reference)
20 30

15 25

20 ROUGE-1
10 ROUGE-L
BLEU
5 15 BertScore-f
QuestEval (0 reference)
10
11 10 9 8 7 6 5 4 3 2 1 11 10 9 8 7 6 5 4 3 2 1
Number of references Number of references

Figure 2: Variation of the Pearson correlations between various metrics and humans, versus the number of refer-
ences available. QuestEval is constant, since it is independent from the references.

by the query weighter is above a threshold, i.e. if


W (D, q) > 0.5. We then say that a question is
Source Document This is the embarrassing moment a
answered if the probability of being unanswerable Buckingham Palace guard slipped and fell on a manhole
is below a threshold, i.e. QA (|S, q) < 0.5. There- cover in front of hundreds of shocked tourists as he took
up position in his sentry box. [...] The Guard comprises
fore, a question can belong to one of four folds, two detachments, one each for Buckingham Palace and St
given the two above criteria (important and/or an- James’s Palace, under the command of the Captain of The
swered). In Table 3, we measure how the percent- Queen’s Guard.
Generated Question Where was the Changing of the
age of questions belonging to a specific fold corre- Guard held?
lates with the Relevance dimension for each gen- Weighter prediction Important Question
erated summary on SummEval. We observe that Answer Span Buckingham Palace
the percentage of questions that are important and Correct Summary The Queen’s Guard slipped on a man-
answered correlates positively with Relevance, as hole cover during the Changing of the Guard at Bucking-
ham Palace last week. [...]
opposed to the percentage of questions that are im- Predicted Answer Buckingham Palace: 3
portant but not answered. Finally, the percentage Hallucinated Summary The Queen’s Guard slipped on
of questions that are answered but not important a manhole cover during the Changing of the Guard at St
does not correlate with Relevance. It indicates that James’s Palace last week. [...]
Predicted Answer St James’s Palace: 7
our proposed approach is able to learn what are the
Incomplete Summary The Queen’s Guard slipped on a
questions that should be asked or not. manhole cover during the Changing of the Guard during
an embarrassing moment.. [...]
Predicted Answer Unanswerable: 7
We emphasize that W is a flexible component
of our framework. It can be adapted to specific Table 4: Sample output from QuestEval: a gener-
ated question, it’s predicted importance given a source
domains and applications. For instance, one could
document; the corresponding predicted answers to the
design a specific W , to focus the evaluation on question, for three different summaries.
information about specific entities, such as people
or events.
that the QA model exposed to the negative sam-
pling during training, has learned to separate better
the negative sampled questions (for negative, i.e.
red lines, the dashed line is more on the left than
the solid line).
Indeed, the unanswerable questions of SQuAD-
v2 were written adversarially by crowd-workers, to
look similar to answerable ones. However, in the
context of QuestEval, unanswerable questions
are not adversarial. It simply often happens that
the summary does not contain the answer. There-
fore, QuestEval sees unanswerable questions
Figure 3: Distribution of the log probabilities of an- that look like the one we built trough our neg-
swerability – i.e. log(1 − QA (|T, q)) – for two QA ative sampling method, rather than the adversar-
models. 1) solid lines: a model trained on SQuAD- ial ones. This may explain the improvement of a
v2 without the negative sampled examples. 2) dashed QuestEval with a QA model trained with nega-
lines: a model trained on SQuAD-v2 with the negative tive sampling.
sampled examples. The evaluated samples belong to
three distinct categories: 1) answerable, 2) unanswer-
able questions (but present in SQuAD-v2) and 3) the
negatively sampled ones (as described in Section 5.1).
35.75
35.50
Pearson correlation coefficient

An Explainable Metric One important feature 35.25


of QuestEval is its explainability. It is straight-
35.00
forward to investigate 1) what are the important
34.75
points not answered in the summary and 2) what
are the inconsistencies between the source docu- 34.50

ment and the summary. We illustrate this in Table 4, 34.25


with a source document, from which a question q is 34.00
generated and answered. According to the weighter 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Beam Size
W , q is categorized as important. Three evaluated
summaries are then shown. Figure 4: Pearson correlation with humans on Sum-
The first summary S correct is factually consistent mEval w.r.t. the QG beam size.
with the source document: the predicted answer
QA (S correct , q) corresponds to the source document
answer Buckingham Palace. The second summary Computational Complexity Following Wang
S hallu is factually inconsistent with the source doc- et al. (2020), we generate the questions with K =
ument: the predicted answer QA (S hallu , q) does 20 beams during decoding and we keep all the dif-
not correspond to Buckingham Palace. Finally, ferent versions of the questions in the latter steps,
the third summary S incomplete does not answer the which improves correlations. However, the down-
question, i.e. QA (S incomplete , q) = , and is thus side of this is the inference time which increases
incomplete. linearly w.r.t the beam size. To be widely adopted, a
metric should not only correlate with human judg-
Negative Sampling Effect In Tables 1 and 2, ment, but also be computationally efficient. In
when QuestEval uses a QA model trained with- Figure 4 we show the variation of the average cor-
out negative sampling (see Section 5.2), we ob- relation with respect to the beam size. The im-
serve a decrease of performance, from 33.3 to 32.4 provement from K = 1 to K = 20 is small (34.4
on SummEval and from 30.4 to 28.5 on QAGS- to 35.6), and the rank order for the different sys-
XSUM. tems remains unchanged. Therefore, we believe
In Figure 3, we report the distribution of the log that using QuestEval with K = 1 is a reason-
probabilities for the two QA models, trained with able choice, allowing for fast computation while
and without negative sampling. We can observe preserving the correlation with human judgments.
6 Conclusion Esin Durmus, He He, and Mona Diab. 2020. Feqa: A
question answering evaluation framework for faith-
We proposed QuestEval, a new reference-less fulness assessment in abstractive summarization.
framework to evaluate summarization models, arXiv preprint arXiv:2005.03754.
which unifies and extends previous QA-based ap-
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.
proaches with question weighting and negative Question answering as an automatic evaluation met-
sampling, accounting for factual consistency, rele- ric for news article summarization. arXiv preprint
vance and information selection. arXiv:1906.00318.
We implement QuestEval leveraging state-of-
Alexander R Fabbri, Wojciech Kryściński, Bryan
the-art deep learning models. Compared to existing McCann, Caiming Xiong, Richard Socher,
metrics, we find that QuestEval correlates dra- and Dragomir Radev. 2020. Summeval: Re-
matically better with human judgments, while at evaluating summarization evaluation. arXiv
the same time not requiring any gold reference. preprint arXiv:2007.12626.
This allows for more accurate comparison between Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie
systems. Moreover, any progress in question an- Utama, Ido Dagan, and Iryna Gurevych. 2019.
swering and generation can directly be applied to Ranking generated summaries by correctness: An in-
our proposed approach, leading to further potential teresting but challenging application for natural lan-
guage inference. In Proceedings of the 57th Annual
improvements. We make the code available4 with
Meeting of the Association for Computational Lin-
the hope that it will contribute to further progress guistics, pages 2214–2220.
in the field.
We are currently adapting QuestEval to other Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and
Natural Language Generation tasks that suffer from Yejin Choi. 2019. The curious case of neural text
degeneration. arXiv preprint arXiv:1904.09751.
the same evaluation limitations, such as machine
translation and text simplification. In future work, Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
we plan to extend QuestEval to a multilingual Cann, Caiming Xiong, and Richard Socher. 2019.
version. Neural text summarization: A critical evaluation. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
Acknowledgments 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 540–
This work was partially performed using HPC
551, Hong Kong, China. Association for Computa-
resources from GENCI-IDRIS (Grant 2021- tional Linguistics.
AD011011841).
Wojciech Kryściński, Bryan McCann, Caiming Xiong,
and Richard Socher. 2019. Evaluating the factual
References consistency of abstractive text summarization. arXiv
preprint arXiv:1910.12840.
Manik Bhandari, Pranav Gour, Atabak Ashfaq, and
Pengfei Liu. 2020. Metrics also disagree in the low Alon Lavie and Abhaya Agarwal. 2007. METEOR: An
scoring range: Revisiting summarization evaluation automatic metric for MT evaluation with high levels
metrics. In Proceedings of COLING 2020, the 30th of correlation with human judgments. In Proceed-
International Conference on Computational Linguis- ings of the Second Workshop on Statistical Machine
tics: Technical Papers. The COLING 2020 Organiz- Translation, pages 228–231, Prague, Czech Repub-
ing Committee. lic. Association for Computational Linguistics.
Ping Chen, Fei Wu, Tong Wang, and Wei Ding. 2017. Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
A semantic qa-based approach for text summariza- jan Ghazvininejad, Abdelrahman Mohamed, Omer
tion evaluation. arXiv preprint arXiv:1704.06259. Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019.
Bart: Denoising sequence-to-sequence pre-training
James Clarke and Mirella Lapata. 2010. Discourse con-
for natural language generation, translation, and
straints for document compression. Computational
comprehension. arXiv preprint arXiv:1910.13461.
Linguistics, 36(3):411–441.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo Wang.
Kristina Toutanova. 2018. Bert: Pre-training of deep 2018. Improving neural abstractive document sum-
bidirectional transformers for language understand- marization with explicit information selection mod-
ing. arXiv preprint arXiv:1810.04805. eling. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
4 pages 1787–1796, Brussels, Belgium. Association
https://github.com/recitalAI/
QuestEval for Computational Linguistics.
Chin-Yew Lin. 2004. Rouge: A package for automatic Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
evaluation of summaries. In Text summarization wowarski, and Jacopo Staiano. 2019. Answers
branches out, pages 74–81. unite! unsupervised metrics for reinforced summa-
rization models. arXiv preprint arXiv:1909.01610.
Annie Louis and Ani Nenkova. 2013. Automatically
assessing machine summary content without a gold Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
standard. Computational Linguistics, 39(2):267– ris, Alessandro Sordoni, Philip Bachman, and Ka-
300. heer Suleman. 2016. Newsqa: A machine compre-
hension dataset. arXiv preprint arXiv:1611.09830.
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Ryan McDonald. 2020. On faithfulness and factu- Ramakrishna Vedantam, C Lawrence Zitnick, and Devi
ality in abstractive summarization. In Proceedings Parikh. 2015. Cider: Consensus-based image de-
of the 58th Annual Meeting of the Association for scription evaluation. In Proceedings of the IEEE
Computational Linguistics. conference on computer vision and pattern recogni-
tion, pages 4566–4575.
Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre,
Bing Xiang, et al. 2016. Abstractive text summariza- Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
tion using sequence-to-sequence rnns and beyond. Asking and answering questions to evaluate the
arXiv preprint arXiv:1602.06023. factual consistency of summaries. arXiv preprint
arXiv:2004.04228.
Shashi Narayan, Shay B Cohen, and Mirella Lap-
ata. 2018. Don’t give me the details, just the Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Pe-
summary! topic-aware convolutional neural net- ter J Liu. 2019a. Pegasus: Pre-training with ex-
works for extreme summarization. arXiv preprint tracted gap-sentences for abstractive summarization.
arXiv:1808.08745. arXiv preprint arXiv:1912.08777.

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Curry, and Verena Rieser. 2017. Why we need Weinberger, and Yoav Artzi. 2019b. Bertscore:
new evaluation metrics for nlg. arXiv preprint Evaluating text generation with bert. arXiv preprint
arXiv:1707.06875. arXiv:1904.09675.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Zheng Zhao, Shay B Cohen, and Bonnie Webber. 2020.
Jing Zhu. 2002. Bleu: a method for automatic eval- Reducing quantity hallucinations in abstractive sum-
uation of machine translation. In Proceedings of the marization. arXiv preprint arXiv:2009.13312.
40th annual meeting of the Association for Compu-
tational Linguistics, pages 311–318. Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,
Hangbo Bao, and Ming Zhou. 2017. Neural ques-
Romain Paulus, Caiming Xiong, and Richard Socher. tion generation from text: A preliminary study. In
2017. A deep reinforced model for abstractive sum- National CCF Conference on Natural Language
marization. arXiv preprint arXiv:1705.04304. Processing and Chinese Computing, pages 662–671.
Springer.
Maxime Peyrard. 2019. Studying summarization eval-
uation metrics in the appropriate scoring range. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 5093–
5100.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine


Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2019. Exploring the limits
of transfer learning with a unified text-to-text trans-
former. arXiv preprint arXiv:1910.10683.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.


Know what you don’t know: Unanswerable ques-
tions for squad. arXiv preprint arXiv:1806.03822.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and


Percy Liang. 2016. Squad: 100,000+ questions
for machine comprehension of text. arXiv preprint
arXiv:1606.05250.

Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns,


Trevor Darrell, and Kate Saenko. 2018. Object
hallucination in image captioning. arXiv preprint
arXiv:1809.02156.

You might also like