Summary Evaluation
Summary Evaluation
Summary Evaluation
QG QA W QG QA
Weighted
Scores ( eq.1 )
scores ( eq.2 )
Precision Recall
+
QuestEval scores
QuestEval
Figure 1: Illustration of the QuestEval framework: the blue area corresponds to the precision-oriented frame-
work proposed by Wang et al. (2020). The orange area corresponds to the recall-oriented SummaQA (Scialom
et al., 2019). We extend it with a weighter component for an improved recall (red area). The encompassing area
corresponds to our proposed unified approach, QuestEval.
Table 1: Summary-level Pearson correlation coefficients for various dimensions between automatic metrics and
human judgments on SummEval. The top section corresponds to correlations for metrics computed on 11 reference
summaries, as reported in Fabbri et al. (2020). The second section corresponds to these metrics, but given only
one reference. The third section corresponds to the QA-based baselines. The bottom section corresponds to the
proposed QuestEval and its ablations.
Lastly, we found it helpful to train our QA model textualized token embeddings of the reference sum-
using additional synthetic unanswerable questions. mary, and the overall mean is reported.
This is done by considering a shuffled version of
the dataset, where each question is randomly as- Question based SummaQA (Scialom et al.,
signed to a paragraph from another triplet of the 2019) is a recall oriented metric, with questions
dataset. We consider these additional samples, with generated from the source document. QAGS
flipped contexts, as unanswerable. All experiments, (Wang et al., 2020) is a precision oriented metric,
except otherwise specified, use this additional neg- with questions generated from the summary.
ative sampling process to improve identification of 5.4 Results
unanswerable queries.
In Tables 1 and 2 we report the results for
5.3 Baselines Metrics QuestEval, along with several ablations. W =
unif orm corresponds to setting all questions
As baselines, we considered the following:
weights equal. Conversely, W = learned cor-
N-gram based ROUGE (Lin, 2004) is the most responds to the weights learned as detailed in Sec-
widely used evaluation in summarization. This met- tion 4.2. We also report the recall and precision
ric measures the recall of reference n-grams in the component separately.
evaluated summary. Conversely, BLEU (Papineni In Table 1, we observe that, amongst existing
et al., 2002) computes the precision of summary metrics, BERTScore achieves the best average
n-grams in the references. METEOR (Lavie and Pearson correlation with human judgements (23.1),
Agarwal, 2007) is a variant that uses stemming, slightly above ROUGE-1 (22.2) and BLEU (22.2).
synonyms and paraphrastic matches. These correlations are obtained when providing no
less than 11 gold references, and averaging results.
Neural based Leveraging recent progress in lan- Given a single reference, all these correlations are
guage modeling, Zhang et al. (2019b) proposed halved. Most of the large scale datasets provide
BERTScore: for each token of the summary, the only one reference per example in their test set
maximum cosine similarity is computed over con- (e.g. CNN/Daily Mail and XSUM), a fact that
Metric Consistency important answered Relevance Corr.
ROUGE-1 13.2 3 3 37.6
ROUGE-L 8.9 3 7 -33.5
METEOR 10.0 7 3 -5.7
BLEU 5.6
BERTScore 2.5 Table 3: Pearson correlation coefficients between hu-
man judgments (for Relevance) and the percentage of
SummaQA - important and/or answered questions, on SummEval
QAGS 17.5 data.
QuestEvalW =unif orm 30.4
w/o QA neg sampl. 28.5
QuestEvalW =learned 29.0 ument. We argue that the universe of correct out-
Precision Only 32.7 puts is much larger than in other generation tasks
Recall Only 13.9 such as machine translation. This explains why
the correlations with humans is largely reduced
Table 2: Summary-level Pearson correlation coeffi- when computed with one reference instead of 11
cients for Correctness between various automatic met- (see Table 1: BERTScore-f drops from 23.1 to 11.8
rics and human judgments on QAGS-XSUM. The top in average, and other metrics likewise). Unfortu-
section corresponds to correlations for diverse metrics nately, assuming the availability of as many as 11
computed on one reference summary, as reported in gold references is not realistic in most scenarios,
Wang et al. (2020). The middle section corresponds to
due to the cost of obtaining reference summaries.
QA-based baselines. The bottom section corresponds
to this work. To complement Table 1, we report in Figure 2
the correlations for the best baselines as we pro-
gressively decrease the number of available gold
highlights the importance of searching for more references from 11 to 1. We observe that for all four
reference-efficient alternatives. dimensions and all the baselines, the correlations
With regards to sample efficiency, QA-based decrease and the variance increases as the number
metrics do not require any references. We expect of references decreases. However, QuestEval
Relevance to be better measured by Recall oriented does not require any reference. Therefore, the im-
metrics, and less so for Consistency. This is con- provement over the other metrics grows larger as
firmed in the results, where SummaQA correlates the number of references used decreases. Further-
better with Relevance than Consistency (26.2 vs more, QuestEval enables the evaluation of sys-
8.3), and vice versa for QAGS (9.1 vs 20.4). By uni- tems on datasets even if no gold reference is avail-
fying and extending the two, QuestEval allows able.
to take both dimensions into account, improving
the average correlation by 18% (28.4 to 33.5). Query Weighter There is no unique answer to
The dimension that benefits the most from the the question “What makes a good summary?”:
learned question weighter is Relevance (+4%, from it depends on the reader’s point of view, which
37.5 to 39.2), indicating that our classifier learns makes summarization evaluation challenging. For
which questions target important information. We instance, given a contract, the seller and the buyer
discuss this aspect more in depth in the following could be interested in different information within
section. the same document.
Finally, compared to the other metrics, the In this paper, to instantiate the weighter W ,
improvement is remarkable (33.5 vs 11.8 for we propose to learn a specific dataset policy:
BERTScore), and allows safer evaluations of the “what kind of questions are likely answered in the
systems while not even requiring references. CNN/Daily Mail training summaries?" This is a
reasonable heuristic given that editors created the
5.5 Discussion summaries following their specific policy.
Reference-less One of the main limitations for To demonstrate the effectiveness of the weighter,
the current metrics is that they require gold refer- we proceed as follows. We first consider that a
ences to compute similarity scores. However, many question q ∈ QG (D), generated on the source
possible summaries are valid for one source doc- document, is important if the probability given
Consistency Coherence
ROUGE-1 25.0
ROUGE-L
40 22.5
Pearson correlation coefficients
BLEU
20 15.0
12.5 ROUGE-1
ROUGE-L
10 10.0 BLEU
BertScore-f
7.5 QuestEval (0 reference)
0
11 10 9 8 7 6 5 4 3 2 1 11 10 9 8 7 6 5 4 3 2 1
Number of references Number of references
Fluency Relevance
30 ROUGE-1 40
ROUGE-L
Pearson correlation coefficients
15 25
20 ROUGE-1
10 ROUGE-L
BLEU
5 15 BertScore-f
QuestEval (0 reference)
10
11 10 9 8 7 6 5 4 3 2 1 11 10 9 8 7 6 5 4 3 2 1
Number of references Number of references
Figure 2: Variation of the Pearson correlations between various metrics and humans, versus the number of refer-
ences available. QuestEval is constant, since it is independent from the references.
Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Curry, and Verena Rieser. 2017. Why we need Weinberger, and Yoav Artzi. 2019b. Bertscore:
new evaluation metrics for nlg. arXiv preprint Evaluating text generation with bert. arXiv preprint
arXiv:1707.06875. arXiv:1904.09675.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Zheng Zhao, Shay B Cohen, and Bonnie Webber. 2020.
Jing Zhu. 2002. Bleu: a method for automatic eval- Reducing quantity hallucinations in abstractive sum-
uation of machine translation. In Proceedings of the marization. arXiv preprint arXiv:2009.13312.
40th annual meeting of the Association for Compu-
tational Linguistics, pages 311–318. Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan,
Hangbo Bao, and Ming Zhou. 2017. Neural ques-
Romain Paulus, Caiming Xiong, and Richard Socher. tion generation from text: A preliminary study. In
2017. A deep reinforced model for abstractive sum- National CCF Conference on Natural Language
marization. arXiv preprint arXiv:1705.04304. Processing and Chinese Computing, pages 662–671.
Springer.
Maxime Peyrard. 2019. Studying summarization eval-
uation metrics in the appropriate scoring range. In
Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics, pages 5093–
5100.