Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Automatic Summary Evaluation without Human Models Annie Louis University of Pennsylvania Philadelphia, PA 19104, USA lannie@seas.upenn.edu Abstract We present a fully automatic approach for summarization evaluation that does not require the creation of human model summaries.1 Our work capitalizes on the fact that a summary contains the most representative information from the input and so it is reasonable to expect that the distribution of terms in the input and a good summary are similar to each other. To compare the term distributions, we use KL and Jensen-Shannon divergence, cosine similarity, as well as unigram and multinomial models of text. Our results on a large scale evaluation from the Text Analysis Conference show that input-summary comparisons can be very effective. They can be used to rank participating systems very similarly to manual model-based evaluations (pyramid evaluation) as well as to manual human judgments of summary quality without reference to a model. Our best feature, Jensen-Shannon divergence, leads to a correlation as high as 0.9 with manual evaluations. 1 Introduction Automatic text summarizers are expected to produce a condensed form of an input, retaining the most important content and presenting the information in a coherent fashion. The development of suitable and efficient evaluations of summary content and organization is crucial and necessary to guide system development. Ideally, summary quality would be measured through extrinsic evaluations where the usefulness of a system summary is assessed in specific task scenarios such as reading comprehension (Morris et al., 1992), relevance judgements in information retrieval (Mani et al., 2002; Jing et al., 1998; Brandow et al., 1995) and other tasks (McKeown et al., 2005; Sakai and Sparck-Jones, 2001; Mani 1 This work was presented at the poster session in TAC on Nov. 18, 2008. We would like to thank Hoa Trang Dang and the TAC advisory board for giving us the opportunity to work on this project. Ani Nenkova University of Pennsylvania Philadelphia, PA 19104, USA nenkova@seas.upenn.edu et al., 2002; Tombros and Sanderson, 1998; Roussinov and Chen, 2001). However, organizing and carrying out such evaluations is difficult and in practice intrinsic evaluations are the standard in summarization. In intrinsic evaluations, system summaries are either compared with reference summaries produced by humans (model summaries), or directly assessed by human judges on a scale (most commonly 1 to 5), without reference to a model summary. The refinement and usability analysis of these evaluation techniques have been the focus of large scale evaluation efforts such as the Document Understanding Conferences (DUC) (Baldwin et al., 2000; Harman and Over, 2004; Over et al., 2007) and TIPSTER SUMMAC reports (Mani et al., 2002). Still, in recent years by far the most popular evaluation method used during system development and for reporting results in publications has been the automatic evaluation tool ROUGE (Lin, 2004; Lin and Hovy, 2003). ROUGE compares system summaries against one or more model summaries automatically, by computing ngram word overlaps between the two. The wide adoption of such automatic measures is understandable, as they are convenient and have greatly reduced the complexity of evaluations. They have also been shown to correlate well with manual evaluations of content, based on comparison with a single model summary, as used in the early editions of DUC. However, the creation of gold standard summaries for comparison is still time-consuming and expensive. In our work, we explore the feasibility of developing a fully automatic evaluation method, that does not make use of human model summaries at all. Proposals for developing such fully automatic methods have been put forward in the past, but no substantial progress has been made so far in this research direction. For example in (Radev et al., 2003), a large scale fully automatic evaluation of eight summarizer systems on 18,000 documents was performed without any human effort, using an information retrieval scenario. A search engine was used to rank documents according to their relevance to a posed query. The summaries for each document were also ranked for relevance with respect to the same query. For good summarization systems, the ranking of relevance of summaries is expected to be similar to that of the full documents. Based on this intuition, the correlation between query relevance rankings of a system’s summaries and the ranking of original documents was used to compare the different systems. Effectively this approach is motivated by the assumption that the distribution of terms in a good summary is similar to the distribution of terms in the original input text. Even earlier, (Donaway et al., 2000) suggested that there are considerable benefits to be had in adopting model-free methods of evaluation involving direct comparisons between input and summary. Their work was motivated by the well documented fact that there are multiple good summaries of the same text and that there is incredible variation in content selection choices in human summarization (Rath et al., 1961; Radev and Tam, 2003; van Halteren and Teufel, 2003; Nenkova and Passonneau, 2004). As a result, the identity of the model writer significantly affects summary evaluations (McKeown et al., 2001), and evaluations of the same systems can be rather different when two different models are used. In their experiments, Donaway et al. demonstrated that the correlation between a manual evaluation using comparison with a model summary and a) manual evaluation using comparison with a different model summary and b) evaluation by directly comparing input and summary, are the same. They used cosine similarity with singular value decomposition to perform the input-summary comparison. Their conclusion was that automatic methods for comparison between input and a summary should be seriously considered as an alternative for evaluation. In this paper, we present a comprehensive study of fully automatic summary evaluation, without human models. A summary’s content is judged for quality by directly estimating its closeness to the input. We compare several probabilistic (information-theoretic) approaches for characterizing the similarity and differences between input and summary content. The utility of the various approaches for comparison varies widely, but a number of them lead to rankings of systems that correlate well with manual evaluations performed in the recent Text Analysis Conference (NIST). A simple information theoretic measure, Jensen Shannon divergence between input and summary emerges as the best feature. System rankings produced using this measure lead to correlations with human rankings as high as 0.9. 2 Data: TAC 2008 2.1 Topic focused and Update Summarization Two types of summaries, query focused summaries and update summaries, were evaluated in the main task of the summarization track of the 2008 Text Analysis Conference (TAC) run by NIST2 . Query focused summaries are produced from the input documents in response to a stated user information need (query). The update summaries require more sophistication: two sets of articles on the same topic are provided. The first set of articles represent the background of a story and the users are assumed to be already familiar with the information contained in them. The task is produce a multi-document summary from the second set of articles on the same topic, that can serve as an update to the user. This task is reminiscent of the novelty detection task explored at TREC (Soboroff and Harman, 2005). 2.2 Test set The test set for the TAC 2008 update task contains 48 inputs. Each input consists of two sets of documents, A and B, of 10 documents each. A and B are on the same general topic, and B contains documents published later than those in A. In addition, for each input, the user’s need is represented by a topic statement which consists of a title and narrative. An example topic statement is given below. Title: Airbus A380 Narrative: Describe developments in the production and launch of the Airbus A380. The task for participating systems is to produce two summaries, a query focused summary of document set A and an update summary of document set B. The first summary is expected to summarize the documents in A using the topic statement to focus content selection. The second summary is expected to be a compilation of updates from document set B, assuming that the user has read all the documents in A. The maximum word limit for both types of summaries is 100 words. In order to allow for in-depth discussion, we will analyze our findings only for query focused summaries. Similar results were obtained for the evaluation of update summaries and are reported in separate tables in Section 6. 2.3 Summarizers There were 57 participating systems in TAC 2008. The baseline summary in both cases— query focused and update tasks was created by choosing the first sentences of 2 http://www.nist.gov/tac manual score R-1 recall R-2 recall Query Focused summaries pyramid score 0.859 0.905 responsiveness 0.806 0.873 Update summaries pyramid score 0.912 0.941 responsiveness 0.865 0.884 Table 1: Spearman correlation between system scores assigned by manual methods and those assigned by ROUGE, ROUGE-1 and ROUGE-2 recall (TAC2008, 57 systems). All correlations are highly significant with p-value < 0.00001. the most recent document in document sets A and B respectively. In addition, four human summaries were produced for each type to serve as model summaries for evaluation. Only the 57 systems were used for our evaluation experiments. 2.4 Evaluations Both manual and automatic evaluations were conducted at NIST to assess quality of summaries produced by the systems. Summarizer performance is defined by two key aspects of summary quality— content selection (identification of important content in the input) and linguistic quality (structure and presentation of selected content). Three methods of manual evaluation were used to assign scores to summaries. Pyramid eval The pyramid evaluation method (Nenkova and Passonneau, 2004) has been developed for reliable and diagnostic assessment of content selection quality in summarization and has been used in several large scale evaluations (Nenkova et al., 2007). It uses multiple human models from which annotators identify semantically defined Summary Content Units (SCU). Each SCU is assigned a weight equal to the number of human model summaries that express that SCU. An ideal maximally informative summary would express a subset of the most highly weighted SCUs, with multiple maximally informative summaries being possible. Four human summaries were used for the annotation. The pyramid score for a system summary is equal to the ratio between the sum of weights of SCUs expressed in a summary (again identified manually) and the sum of weights of an ideally informative summary with the same number of SCUs.3 3 In addition, pyramids using all combinations of three models were constructed from the four-model pyramid. A human summary was scored against a pyramid comprising of SCUs from the other three model summaries. Jackknifing was implemented for system summaries by comparing them to each of the four 3-model pyramids obtaining a pyramid score from each comparison. The average of these scores is also reported together with the score from the four-model pyramid. The correlation between the two pyramid scores (using three and four models) is very high— 0.9997 for query focused and 0.9993 for update Responsiveness eval The responsiveness of a summary is defined as a measure of overall quality combining content selection and linguistic quality: summaries must present useful content in a structured fashion in order to better satisfy the user’s need. Assessors directly assigned responsiveness scores on a scale of 1 (poor summary) to 5 (very good summary) to each summary. The assessments are done without reference to any model summaries. ROUGE NIST also evaluated the summaries automatically using ROUGE (Lin, 2004; Lin and Hovy, 2003). Comparison between a summary and a set of model summaries is computed using unigram (R1) and bigram overlaps (R2). The scores were computed after stemming but stop words were retained in the summaries. Table 1 shows that ROUGE obtains very good correlations with the manual scores for content selection. The correlation with pyramid scores is 0.90 and 0.94 for query focused and update summary respectively, and 0.87 and 0.88 with responsiveness. Given these observations, ROUGE is a high performance automatic evaluation metric when human summaries are available and sets a high standard for comparison of other automatic evaluation methods. Linguistic quality questions were used to assess readability and well-formedness of the produced summaries. Assessors scored the well-formedness of a summary on a scale from 1 (very poor) to 5 (very good). Grammaticality, non-redundancy, referential clarity, focus, structure and coherence were the factors to be considered during evaluation. But since our features are designed to capture content selection quality, manual pyramid and responsiveness scores will be used for comparison with our automatic method. The correlation between these two evaluations is overall high, 0.885 and 0.923 respectively for query focused and update summarization. Despite of this, we use both measures as a reference for comparison with our fully automatic evaluation method because albeit high, the correlation between them is not perfect (as was the correlation between the two alternative pyramid scores). 3 Features We describe three classes of features used to compare input and summary content: distributional similarity, summary likelihood and use of topic signatures. Words in both input and summary were stemmed before feature computations. 3.1 Distributional Similarity Measures of similarity between two probability distributions are a natural choice for the task at hand. We choose tasks. So we will not discuss the correlations of our features with the three-model pyramid scores. to experiment with three common such measures: KL and Jensen Shannon divergence and cosine similarity. We expect that good summaries are characterized by low divergence between the probability distributions of words in the input and the summary, and by high similarity with the input. Moreover, these three metrics have already been used for summary evaluation, albeit in different contexts. (Lin et al., 2006) compared the performance of ROUGE with KL and JS divergence for the evaluation of summaries using human models. The divergence between human and machine summary distributions was used as an estimate of summary score. The study found that JS divergence always outperformed KL divergence and using multiple human references, the performance of JS divergence was better than standard ROUGE scores for multi-document summarization. JS divergence has also been found useful in other NLP tasks as a good predictor of unseen events (Dagan et al., 1994; Lapata, 2000). The use of cosine similarity in (Donaway et al., 2000) is more directly related to our work. In this study, it was shown that the differences between evaluations based on two different models is about the same as the difference between system ranking based on one model summary and ranking produced using input-summary comparisons. Cosine similarity with singular value decomposition was used to compare input with summaries. Only this one approach for similarity comparison was used. In contrast, we explore a variety of features and the experiments outlined in this paper enable us to compare the usefulness of different similarity measures for automatic evaluation. Kullback Leibler (KL) divergence The KL divergence between two probability distributions P and Q is given by D(P ||Q) = X pP (w) log2 w pP (w) pQ (w) (1) bution and δ was set to a small value of 0.0005 to avoid shifting too much probability mass to unseen events. Jensen Shannon (JS) divergence The JS divergence is based on the idea that the distance between two distributions cannot be very different from the average of distances from their mean distribution. It is formally defined as J(P ||Q) = where A = 1 [D(P ||A) + D(Q||A)], 2 P +Q is the mean distribution of P and Q. 2 In contrast to KL divergence, the JS distance is symmetric and always defined. We use both smoothed and unsmoothed versions of the divergence as features. Similarity between input and summary The third metric is cosine overlap between the tf-idf vector representations of input and summary contents. cosθ = vinp .vsumm . ||vinp ||||vsumm || (3) We compute two variants, 1. Cosine overlap between input and summary words 2. Cosine overlap between topic signatures of input and words of summary Topic signatures are words highly descriptive of the input, as determined by the application of log-likelihood test (Lin and Hovy, 2000). Using only topic signatures from the input to represent text is expected to be more accurate and to remove noise from peripherally related content. In addition, the refined input vector has a smaller dimension suitable for comparison with a vector of summary words which is typically small compared to a complete bag of words vector of the input. 3.2 Summary Probabilities It is defined as the average number of bits wasted by coding samples belonging to P using another distribution Q, an approximate of P. In our case, the two distributions are those for words in the input and summary respectively. However, KL divergence is not symmetric. So the divergence computed both ways, input-summary and summary-input are used as features. In addition, the divergence is undefined when pP (w) > 0 but pQ (w) = 0. We perform simple smoothing to overcome the problem. p(w) = C +δ N +δ∗B (2) Here C is the count of word w and N is the number of tokens. A value of 1.5 times the input vocabulary was used as an estimate for outcomes (B) of the probability distri- These features capture the log likelihood of a summary given its input. The probability of a word appearing in the summary is estimated from the input. We compute both the unigram bag of words probability as well as the probability of the summary under a multinomial model. The comparison with ROUGE in (Lin et al., 2006) (described under Section 3.1) also included unigram log likelihood alongside KL and JS divergences. However JS divergence proved better than the other two. Unigram summary probability (pinp w1 )n1 (pinp w2 )n2 ...(pinp wr )nr (4) where pinp wi is the probability in the input of word wi , ni is the number of times wi appears in the summary, and w1 ...wr are all words in the summary vocabulary. Multinomial summary probability N! (pinp w1 )n1 (pinp w2 )n2 ...(pinp wr )nr n1 !n2 !...nr ! (5) where N = n1 + n2 + ... + nr is the total number of words in the summary. 3.3 Use of input’s topic words in summary Summarizer systems that directly optimize for more topic signatures during content selection have fared very well in evaluations (Conroy et al., 2006). Hence the number of topic signatures from the input present in a summary might be a good indicator of summary content quality. We experiment with two features that quantify the presence of topic signatures in a summary. 1. Percentage of summary composed from input’s topic signatures 2. Percentage of topic signature words from the input that also appear in the summary Features JSD div JSD div smoothed % of ip topic in summ KL div summ-inp cosine inp-summ % of summ = topic wd topic overlap KL div inp-summ mult. summ prob unigram summ prob regression pyramid score -0.880 -0.874 0.795 -0.763 0.712 0.712 0.699 -0.688 0.222 -0.188 0.867 responsiveness -0.736 -0.737 0.627 -0.694 0.647 0.602 0.629 -0.585 0.235 -0.101 0.705 Table 2: Spearman correlation between fully automatically computed features and manually assigned system scores (avg. over all test inputs) for the query focused summarization subtask in TAC 2008. All results are highly significant with pvalues < 0.000001 except unigram and multinomial summary probability, which are not significant. While both features will obtain higher values for summaries containing many topic signature words, the first is guided simply by the presence of any topic word while the second measures the diversity of topic words used in the summary. across all test inputs (macro level) and can we identify which summaries for a given input were good and which were bad (micro level) ? In addition, we compare our results to model-based evaluations using ROUGE and analyze the effects of stemming the input and summary vocabularies. 3.4 Feature combination using linear regression 4.1 Performance at macro level We also evaluated the performance of a feature combining all of the above features into a single measure using linear regression. The value of the feature for each summary was obtained using leave-one-out approach: for a particular input and system-summary combination, a linear regression model using the automatic features to predict the manual evaluation scores was trained. The training set consisted only of examples which included neither the same input nor the same system. Hence during training, no examples of either the test input or system were seen. Table 2 shows the Spearman correlation between the manual and automatic scores averaged across the 48 inputs. We find that both distributional similarity as well as the topic signature features obtain rankings very similar to those produced by humans while summary probabilities turn out unsuitable for the evaluation task. Notably, the linear regression combination of features does not lead to better results than the single best feature: the JS divergence. It outperforms other features including the regression metric and obtains the best correlations with both types of manual scores, 0.88 with pyramid score and 0.74 with responsiveness. The correlation with pyramid score is in fact better than that obtained by ROUGE-1 recall (0.86). Similar results establishing that JS divergence is the most suitable measure for automatic evaluation were reported in a study of model-based evaluation metrics (Lin et al., 2006). In their study of generic multi-document summarization, JS divergence between system and model summaries obtained better correlations with manual rankings than ROUGE overlap scores. Our results provide further evidence that this divergence metric is indeed best suited for content comparison of two texts. The best topic signature based feature—percentage of input’s topic signatures that are present in the summary— ranks next only to JS divergence and regression. Given this result, systems optimizing for topic signatures would 4 Comparison to manual evaluations In this section, we report the correlations between system ranking using our automatic features and manual evaluations. More precisely, the value of features was computed for each summary submitted for evaluations. We studied the predictive power of features in two scenarios macro level; per system: the average feature value across all inputs was used to rank the systems. The average manual score (pyramid or responsiveness) was also computed for each system, and the correlations between the two rankings were analyzed; micro level; per input the systems were ranked for each input separately, and correlations between the summary ranking for each input were computed. The two levels of analysis address different questions: Can we automatically identify system performance score well with respect to content as was observed in previous large scale evaluations conducted by NIST. We also find that the feature simply reflecting the proportion of topic signatures in the summary performs worse as an evaluation metric. This observation leads us to the conclusion that a summary that contains many different topic signatures from the input seems to carry better content than one that contains topic signatures of fewer types. The most simple comparison metric—cosine overlap of words—performs worse than the best divergence and topic signature features. The modified overlap score of input topic signatures and summary words also fails to obtain very high correlations. The rankings based on unigram and multinomial summary probabilities do not correlate with manual scores. Almost all systems use frequency in some form to inform content selection and this could be a reason why likelihood fails to distinguish between the system summaries. 4.2 Performance on micro level As a more stringent assessment of the automatic evaluation, let us consider the rankings obtained on a per-input basis. These results are summarized in Table 3. The number of inputs for which correlations were significant are reported along with their minimum, maximum values and the number of inputs for which correlations above 0.5 were observed. The results are less spectacular at the level of individual inputs: JS divergence rankings obtain significant correlations with pyramid scores for 73% of the inputs. Further, only 40% of inputs obtain correlations above 0.5. The results are worse for other features and for comparison with responsiveness scores. Overall, the micro level results suggest that the fully automatic measures we examined will not be useful for providing information about summary quality for a single input. For average over many test sets, the fully automatic evaluations measures give more reliable and useful results, highly correlated with rankings produced by manual evaluations. 4.3 Effects of stemming So far, the analysis was based on feature values computed after stemming the input and summary words. We also computed the values of the same features without stemming and found that divergence metrics benefit greatly when stemmed vocabularies are used. The biggest improvements in correlations are for JS and KL divergences with respect to responsiveness. For JS divergence, the correlation increases from 0.571 to 0.736 and for KL divergence (summary-input), from 0.528 to 0.694. Before stemming, topic signature and bag of words overlap features are best predictive of responsiveness (correlations are 0.630 and 0.642 respectively) but do not change much after stemming (topic overlap— 0.629, bag of words— 0.647). Divergences emerge as better metrics only after stemming. Stemming also proves beneficial for likelihood features. Before stemming, their correlations are directed in the wrong direction, they improve after stemming to being either positive or closer to zero. However, these probabilities remain unable to produce human-like rankings. 4.4 Difference in correlations: pyramid and responsiveness scores Overall, we find that correlations with pyramid scores are higher than correlations with responsiveness. Clearly our features are designed to compare input-summary contents only. On the other hand, higher order ROUGE n-gram scores can be expected to capture some aspects of fluency in addition to an estimate of content quality. Since responsiveness judgements were based on both content and linguistic qualities of summaries, it is not surprising that these rankings are harder to replicate using our content based features. 5 Comparison with ROUGE Although, JS divergence outperforms ROUGE-1 recall for correlations with pyramid scores at the average level, ROUGE-2 recall is still better. Also, ROUGE obtains the best correlations with responsiveness judgements. At the per-input micro level, ROUGE clearly gives the best human-like rankings— ROUGE-1 recall obtains significant correlations for over 95% of inputs and correlations above 0.5 for at least 50% of inputs. The ROUGE results are shown in the last two rows of Table 3. However, when making these comparisons we need to keep in mind the fact that ROUGE evaluates system summaries using four manual models for each input. The evaluations using our features are fully automatic, with no human summaries at all. For manual pyramid scores, the best obtained correlation with fully automatic evaluation is 0.88 (JS divergence) while the best correlation with ROUGE is 0.90 (R2). The difference is negligably small for contentbased evaluations. For manual responsiveness scores which combine aspects of linguistic quality along with content selection evaluation, the best correlations are 0.73 (JS divergence) and 0.87 (R2). For this measure, the difference between ROUGE and the fully automatic comparisons is significant, indicating that our intuition that the proposed metrics will not be suitable for linguistic quality evaluation was correct. Other metrics for linguistic quality need to be explored for this task (Lapata and Barzilay, 2005). features JSD JSD smoothed KL summ-inp % of input sign cosine overlap KL inp-summ topic overlap % summ sign mult. summ prob uni. summ prob regression Rouge-1 recall Rouge-2 recall max -0.714 -0.712 -0.736 0.701 0.622 -0.628 0.597 0.607 0.434 0.292 0.736 0.833 0.875 min -0.271 -0.269 -0.276 0.286 0.276 -0.262 0.265 0.269 0.268 0.261 0.281 0.264 0.316 pyramid sig sig% 35 72.9 35 72.9 35 72.9 31 64.6 31 64.6 28 58.3 30 62.5 23 47.9 8 16.7 2 4.2 37 77.1 47 97.9 48 100 a0.5 19 18 17 16 6 8 5 7 0 0 14 32 33 a0.5% 39.6 37.5 35.4 33.3 12.5 16.7 10.4 14.6 0 0 29.2 66.7 68.8 max -0.654 -0.649 -0.628 0.693 0.618 -0.577 0.689 0.534 0.459 0.466 0.642 0.754 0.742 min -0.262 -0.279 -0.261 0.279 0.265 -0.267 0.277 0.272 0.272 0.287 0.262 0.266 0.299 responsiveness sig sig% a0.5 35 72.9 17 33 68.8 17 35 72.9 13 29 60.4 9 28 58.3 4 22 45.8 6 26 54.2 3 23 47.9 1 10 20.8 0 2 4.2 0 32 66.7 6 46 95.8 25 44 91.7 22 a0.5% 35.4 35.4 27.1 18.8 8.3 12.5 6.3 2.1 0 0 12.5 52.1 45.8 Table 3: Spearman correlations between feature values and manual system scores on a per input basis (TAC 2008 Query Focused summarization). Only the minimum, maximum values of the significant correlations are reported together with the number of significant correlations and the number of inputs for which correlations above 0.5 were obtained. 6 Evaluation of systems for TAC 2008 Update Summarization task In the paper, we discussed only the results from evaluations of the query focused summaries produced at TAC 2008. The results for the update task are very similar and all conclusions hold for these data as well. For completeness we give the correlations between fully automatic and manual evaluations in Table 4. 7 Summarization as optimization—is JSD enough? We have demonstrated that comparison of input and summary contents is predictive of summary quality. Further, our experiments show that a single best feature could approximate the comparison. A natural question arises in this situation—when a summarizer is built that globally optimizes for JS divergence during summary creation, wouldn’t this input-based evaluation method be voided? It must be remembered that the goal of summarization is not the selection of good content alone. Summarizers must also reduce redundancy, improve coherence and adhere to a length restriction. Often these goals might enter into conflicts during summary creation and satisfying them simultaneously becomes a difficult problem. Studies of global inference algorithms for multidocument summarization (McDonald, 2007; tau Yih et al., 2007; Hassel and Sjöbergh, 2006) found that optimizing for content is NP-hard and equivalent to the Knapsack problem. (McDonald, 2007) further showed that intractability of a relevance maximization framework increases with the addition of redundancy constraints. Although, exact solutions may be found using ILP formulations, they can be used only on small document sets. Their huge runtimes makes them prohibitively expensive for summarizing large collections of documents. Hence only approximate solutions to the problem are feasible in real world situations. Although some approximate solutions seem to obtain very good results in (McDonald, 2007), we must note that coherence is not included in that framework and that coherence is in fact a multi-faceted constraint requiring considerations of anaphors, discourse relations, cohesion and ordering. Together with coherence constraints, the inference could only become harder. Hence our evaluation method might still be suitable for content evaluation of summaries provided the overall summarizer scores include judgements of linguistic quality and redundancy as well. 8 Improving input-summary comparisons Our experiments are clearly a starting point in understanding the role of inputs in summary evaluations. We demonstrated that simple comparison of summary and input contents using suitable features can capture perceptions of summary quality. These features can nevertheless be extended with other capabilities. In fact, our test data comes from a query focused summarization task where a topic statement is also available for relevance assessment. We can expect better results by incorporating the query statement during evaluations. For example, we can select portions from the input that are relevant to the query and only use these for comparison with summary content. For update summarization task, we experimented with different sets of features. Using averages of feature values comparing summary-background input and summary-update input, we obtained lower correlations with manual scores than when features based only on the update input were used. The summary- update in- put features also outperform a linear regression metric which combines individual features from comparison with background and update inputs (Table 4). This result is not intuitive given the task definition. The background input is an important factor affecting the decision to include a particular content unit from the update set of documents. Further analysis needs to be carried out to ascertain the relative importance of the two input sets and how to best combine their features. We also plan to expand our suite of features. A handful of other distributional similarity functions remain unexplored for our task and will be a readily accessible set of features— Euclidean distance, Jaccard’s coefficient, L1 norm, confusion probability and skew divergence (Lee, 1999). 9 Conclusion Summarization evaluation has always included human effort thereby limiting their scale and repeatability. In this paper, we have presented a successful framework for moving towards model-free evaluations—using the input as reference. We have analyzed a variety of features for input/summary comparisons and demonstrated that the strength of different features vary, with certain features better suited for content comparisons. Low divergence from the input and diverse use of topic signatures in the summary are highly indicative of good content. We also find that preprocessing like stemming is useful in leveraging the capability of some features. Very good results were obtained from a correlation analysis with human judgements, showing that input can indeed substitute for model summaries and manual efforts in summary evaluation. The best correlations were obtained by a single feature, JS divergence (0.9 with pyramid scores and 0.7 with responsiveness). We have shown that the power of model-free evaluations generalizes across atleast two summarization tasks. Input is found useful in evaluating both query focused and update summaries. We have also presented a discussion on interesting questions on optimization and evaluation that arise as a result of this work and some future directions for input based evaluations. References Breck Baldwin, Robert Donaway, Eduard Hovy, Elizabeth Liddy, Inderjeet Mani, Daniel Marcu, Kathleen McKeown, Vibhu Mittal, Marc Moens, Dragomir Radev, Karen SparckJones, Beth Sundheim, Simone Teufel, Ralph Weischedel, and Michael White. 2000. An Evaluation Road Map for Summarization Research. The Summarization Roadmap. Ronald Brandow, Karl Mitze, and Lisa F. Rau. 1995. Automatic condensation of electronic publications by sen- tence selection. Information Processing and Management, 31(5):675–685. John Conroy, Judith Schlesinger, and Dianne O’Leary. 2006. Topic-focused multi-document summarization using an approximate oracle score. In Proceedings of ACL, short paper. Ido Dagan, Fernando Pereira, and Lillian Lee. 1994. Similarity-based estimation of word cooccurrence probabilities. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 272–278. Robert L. Donaway, Kevin W. Drummey, and Laura A. Mather. 2000. A comparison of rankings produced by summarization evaluation measures. In NAACL-ANLP Workshop on Automatic Summarization. Donna Harman and Paul Over. 2004. The effects of human variation in duc summarization evaluation. In ACL Text summarization branches out workshop. Martin Hassel and Jonas Sjöbergh. 2006. Towards holistic summarization: Selecting summaries, not sentences. In Proceedings of LREC 2006, Genoa, Italy. Hongyan Jing, Regina Barzilay, Kathleen McKeown, and Michael Elhadad. 1998. Summarization evaluation methods: Experiments and analysis. In AAAI Symposium on Intelligent Summarization. Mirella Lapata and Regina Barzilay. 2005. Automatic evaluation of text coherence: Models and representations. In IJCAI’05. Maria Lapata. 2000. The automatic interpretation of nominalizations. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 716– 721. Lillian Lee. 1999. Measures of distributional similarity. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 25–32. Chin-Yew Lin and Eduard Hovy. 2000. The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th conference on Computational linguistics, pages 495–501. Chin-Yew Lin and Eduard Hovy. 2003. Automatic evaluation of summaries using n-gram co-occurance statistics. In Proceedings of HLT-NAACL 2003. Chin-Yew Lin, Guihong Cao, Jianfeng Gao, and Jian-Yun Nie. 2006. An information-theoretic approach to automatic evaluation of summaries. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 463–470. Chin-Yew Lin. 2004. ROUGE: a package for automatic evaluation of summaries. In ACL Text Summarization Workshop. Inderjeet Mani, Gary Klein, David House, Lynette Hirschman, and Therese Firmin abd Beth Sundheim. 2002. Summac: a text summarization evaluation. Natural Language Engineering, 8(1):43–68. Ryan McDonald. 2007. A study of global inference algorithms in multi-document summarization. In ECIR, pages 557–564. Kathleen McKeown, Regina Barzilay, David Evans, Vasileios Hatzivassiloglou, Barry Schiffman, and Simone Teufel. 2001. Columbia multi-document summarization: Approach and evaluation. In DUC’01. comparison with update inputs only avg. comparisons with update and background features pyramid score responsiveness pyramid score responsiveness JSD divergence -0.827 -0.764 -0.716 -0.669 -0.713 -0.670 JSD divergence smoothed -0.825 -0.764 % of ip topic wds in summ 0.770 0.709 0.677 0.616 KL divergence summ-inp -0.749 -0.709 -0.651 -0.624 KL divergence inp-summ -0.741 -0.717 -0.644 -0.638 0.649 0.631 cosine inp-summ 0.727 0.691 % of summary = topic wd 0.721 0.707 0.647 0.636 topic overlap inp- summ 0.707 0.674 0.645 0.619 multinomial summ prob 0.284 0.355 0.152 0.224 -0.151 -0.053 unigram summ prob -0.093 0.038 regression 0.789 0.605 0.699 0.522 regression combining features comparing with background and update inputs (without averaging) correlations = 0.8058 with pyramid 1, 0.6729 with responsiveness pyramid score responsiveness features max min sig %sig a0.5 %a0.5 max min sig %sig a0.5 %a0.5 JSD smoothed -0.753 -0.269 41 85.4 23 47.9 -0.747 -0.266 36 75.0 16 33.3 36 75.0 16 33.3 JSD -0.746 -0.291 41 85.4 22 45.8 -0.738 -0.263 KL summ-inp -0.739 -0.293 41 85.4 20 41.7 -0.705 -0.275 37 77.1 15 31.3 % of sign from inp 0.778 0.277 38 79.2 17 35.4 0.706 0.297 29 60.4 13 27.1 cosine overlap 0.665 0.275 33 68.8 10 20.8 0.685 0.267 28 14.6 7 14.6 0.672 0.265 28 58.3 6 12.5 % summ sign terms 0.737 0.263 32 66.7 11 22.9 topic overlap 0.679 0.264 31 64.6 9 18.8 0.665 0.274 26 54.2 5 10.4 KL inp-summ -0.663 -0.281 30 62.5 9 18.8 -0.600 -0.285 24 50.0 5 10.4 mult. summ prob 0.479 0.267 12 25.0 0 0.0 0.547 0.262 13 27.1 1 2.1 0.266 0.266 1 2.1 0 0.0 uni. summ prob 0.363 0.362 1 2.1 0 0.0 regression 0.765 0.284 40 83.3 19 39.6 0.659 0.285 29 60.4 10 20.8 ROUGE-1 recall 0.842 0.392 48 100 41 85.4 0.811 0.268 46 95.8 30 62.5 0.816 0.286 47 97.9 28 58.3 ROUGE-2 recall 0.913 0.355 47 97.9 39 81.3 Table 4: Spearman correlations between fully automatic evaluation and manually assigned system scores for update summarization. Results are reported separately for features comparing update summaries with the update input only or with both update and background inputs and averaging the two (macro level). At the per-input level, only results for features comparing with update inputs are reported. Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, and Julia Hirschberg. 2005. Do summaries help? a task-based evaluation of multi-document summarization. In SIGIR. Andrew H. Morris, George M. Kasper, and Dennis A. Adams. 1992. The effects and limitations of automatic text condensing on reading comprehension. Information System Research, 3(1):17–35. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In HLT/NAACL. Ani Nenkova, Rebecca Passonneau, and Kathleen McKeown. 2007. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Trans. Speech Lang. Process., 4(2):4. Paul Over, Hoa Dang, and Donna Harman. 2007. Duc in context. Inf. Process. Manage., 43(6):1506–1520. Dragomir Radev and Daniel Tam. 2003. Single-document and multi-document summary evaluation via relative utility. In Poster session, International Conference on Information and Knowledge Management (CIKM’03). Dragomir Radev, Simone Teufel, Horacio Saggion, Wai Lam, John Blitzer, Hong Qi, Arda Çelebi, Danyu Liu, and Elliott Drabek. 2003. Evaluation challenges in large-scale multidocument summarization: the mead project. In Proceedings of ACL 2003, Sapporo, Japan. G. J. Rath, A. Resnick, and R. Savage. 1961. The formation of abstracts by the selection of sentences: Part 1: sentence selection by man and machines. American Documentation, 2(12):139–208. Dmitri G. Roussinov and Hsinchun Chen. 2001. Information navigation on the web by clustering and summarizing query results. Inf. Process. Manage., 37(6):789–816. Tetsuya Sakai and Karen Sparck-Jones. 2001. Generic summaries for indexing in information retrieval. In SIGIR ’01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 190–198. Ian Soboroff and Donna Harman. 2005. Novelty detection: the trec experience. In HLT ’05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 105–112. Wen tau Yih, Joshua Goodman, Lucy Vanderwende, and Hisami Suzuki. 2007. Multi-document summarization by maxi- mizing informative content-words. In Proceedings of IJCAI 2007. Anastasios Tombros and Mark Sanderson. 1998. Advantages of query biased summaries in information retrieval. In SIGIR ’98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 2–10. Hans van Halteren and Simone Teufel. 2003. Examining the consensus between human summaries: initial experiments with factoid analysis. In HLT-NAACL DUC Workshop.