Ouyang 2011

Information Processing and Management 47 (2011) 227–237
Contents lists available at ScienceDirect
Information Processing and Management

journal homepage: www.elsevier.com/locate/infoproman
Applying regression models to query-focused multi-document

summarization
You Ouyang a,*, Wenjie Li a, Sujian Li b, Qin Lu a
a
Department of Computing, The Hong Kong Polytechnic University, Hong Kong
b
Key Laboratory of Computational Linguistics, Peking University, Ministry of Education, China
a r t i c l e i n f o a b s t r a c t
Article history: Most existing research on applying machine learning techniques to document summariza-
Received 6 January 2009 tion explores either classification models or learning-to-rank models. This paper presents
Received in revised form 8 March 2010 our recent study on how to apply a different kind of learning models, namely regression
Accepted 14 March 2010
models, to query-focused multi-document summarization. We choose to use Support Vec-
Available online 3 April 2010
tor Regression (SVR) to estimate the importance of a sentence in a document set to be sum-
marized through a set of pre-defined features. In order to learn the regression models, we
Keywords:
propose several methods to construct the ‘‘pseudo” training data by assigning each sen-
Query-focused summarization
Support Vector Regression
tence with a ‘‘nearly true” importance score calculated with the human summaries that
Training data construction have been provided for the corresponding document set. A series of evaluations on the
DUC data sets are conducted to examine the efficiency and the robustness of the proposed
approaches. When compared with classification models and ranking models, regression
models are consistently preferable.
Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction
Document summarization techniques are one way of helping people find information effectively and efficiently. There are
two main approaches to automatic summarization: abstractive approaches and extractive approaches. Abstractive ap-
proaches, which promise to produce summaries that are more like what a human might generate but are limited by the pro-
gress of natural language understanding and generation and the more widely used extractive approaches which rank
sentences by importance, extract salient sentences, and then compose the summary. Sentence ranking has for some time
now been carried out using machine learning techniques. Early techniques usually treated sentence ranking as a binary clas-
sification problem where classification models learnt from sets of ‘‘important” and ‘‘unimportant” sentences sets to build
classification models which were used to identify ‘‘key” sentences. Subsequent learning-to-rank approaches normally
learned from ordered sentence pairs or lists, which are easier to obtain than the training data for classification models. In
either case, most classification and ranking models transform binary or ordering information into continuous values which
are used in classification or ranking.
An alternative to these approaches is offered by regression models, which learn continuous functions which directly esti-
mate the importance of sentences, which can be better characterized as continuous than discrete. Another advantage of
regression models is that their training data uses continuous importance values, unlike the classification or ranking models
which use either discrete sentence labels or ranked sentence pairs. Thus, the learned regression functions should estimate
* Corresponding author. Tel.: +852 6857 0255.

E-mail addresses: csyouyang@comp.polyu.edu.hk (Y. Ouyang), cswjli@comp.polyu.edu.hk (W. Li), lisujian@pku.edu.cn (S. Li), csluqin@comp.polyu.
edu.hk (Q. Lu).
0306-4573/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ipm.2010.03.005
228 Y. Ouyang et al. / Information Processing and Management 47 (2011) 227–237
sentence importance more accurately, depending on the quality and the quantity of the training data, whether it is obtained
manually or automatically.
In this paper, we study how to apply regression models to the sentence ranking problem in query-focused multi-docu-
ment summarization. We implement the regression models using Support Vector Regression (SVR). SVR is the regression
type of Support Vector Machines (Vapnik, 1995) and is capable of building state-of-the-art optimum approximation func-
tions. As data for this study, we will construct ‘‘pseudo” training data automatically from human summaries and then use
these and their document sets to develop and compare several N-gram based methods that estimate ‘‘nearly true” sentence
importance scores. The training data is then used to learn a mapping function from a set of pre-defined sentence features to
these ‘‘nearly true” sentence importance scores. The learned function is then used to predict the importance of the sentences
in the test data. We carry out a series of experiments to evaluate the efficiency and robustness of our approaches.
The remainder of the paper is organized as follows. Section 2 briefly introduces the related work. Section 3 explains the
proposed approach. Section 4 presents experiments, evaluations and discussions. Section 5 concludes the paper.
2. Related work
The application of machine learning techniques in document summarization has a long history. Kupiec, Pedersen, and
Chen (1995) first proposed a trainable summarization approach which adopted word-based features and used a naïve Bayes-
ian classifier to learn feature weights according to a set of given summaries. The learning-based system that combined all the
features performed better than any other system using only one single feature. Many early studies followed this idea and
extended Kupiec et al.’s work by examining more extensive feature sets and/or classification models, such as decision trees
and neural networks. (Chuang & Yang, 2000; Mani & Bloedorn, 1998; Neto, Freitas, & Celso, 2002). Hirao and Isozaki (2002)
used a set of documents in which key sentences had been annotated manually to train a Support Vector Machine (SVM) clas-
sification model to learn how to extract important sentences. They reported that SVM outperformed other machine learning
models such as decision tree or boosting methods in the Japanese Text Summarization Challenge (TSC). Zhou and Hovy
(2003) proposed a Hidden Markov Model (HMM) based approach to estimate the extract desirability of an English sentence
and trained the parameters on the labeled data generated from the Yahoo Full Coverage Collection. The resulting system was
comparable to the best system in the Document Understanding Conference (DUC) 2001 generic summarization data set.
Zhao, Wu, and Huang (2005) applied the Conditional Maximum Entropy (ME) model to the DUC 2005 query-focused sum-
marization task but achieved only mediocre performance. Shen, Sun, Li, Yang, and Chen (2007) presented a Conditional Ran-
dom Fields (CRF) based framework for generic summarization and reported that CRF performed better than many existing
models, such as HMM and SVM. A common feature of all these work is that they all relied on classification models to rank
sentences.
More recently, learning-to-rank models have been examined. Amini, Usunier, and Gallinari (2005) investigated how to
use learning-to-rank models for query-focused single-document summarization and compared the proposed ranking algo-
rithm to a logistic classifier. The ranking algorithm outperformed the logistic classifier. Fisher and Roark (2006) implemented
a perceptron-based ranking system learned from automatically constructed training data. The system ranked eighth of 34
participating systems. At DUC 2007, Toutanova (2007) proposed the PYTHY system which learned a log-linear ranking func-
tion to combine more than 20 features. The PYTHY system was augmented with two sophisticated post-processing pro-
cesses, i.e. heuristic sentence simplification and dynamic sentence scoring. It performed very well and ranked second of
30 DUC participating systems. Learning-to-rank models were also applied to webpage summarization tasks and compared
to classification models (Metzler & Kanungo, 2008; Wang, Jing, Zhang, & Zhang, 2007). Amini and Usunier (2009) presented a
transductive approach that learned the ranking function with few labeled instances. This approach outperformed classifica-
tion models in sentence ranking.
An important requirement of learning-based sentence ranking approaches is that one should have sufficient training data.
Since document summarization tasks usually involve many factors, it takes a lot of time and effort to manually annotate the
training data. To reduce the expense of manual annotation, semi-automatic strategies which use other types of resources to
generate the training data have been widely adopted. The most common manually-generated resources are human summa-
ries which are primarily provided for automatic evaluations. The human summaries have been successfully used to judge
sentence importance and to generate the positive and negative sentence sets for classification models (Chuang, 2000; Kupiec
et al., 1995). They have also been used to judge the preference between sentences and to generate the training data for rank-
ing models (Fisher & Roark, 2006; Toutanova et al., 2007). The reason why human summaries can be used for generating the
training data is that these manually-written summaries are regarded as containing the most important concepts of the ori-
ginal data set and thus can be used to judge sentence importance. Experiment done by Conroy, Schlesinger, and O’Leary
(2006) supported this. They defined an ‘‘Oracle” score which was calculated from the probability distribution of the Uni-
grams in human summaries and found that the summaries generated by directly using the ‘‘Oracle” sentence scores to ex-
tract sentences are even comparable to human summaries on the DUC 2006 data set under the automatic evaluation method
ROUGE. This showed that human summaries can be used effectively in judging the importance of sentences and thus can be
used to generate the training data for the learning models.
During our participation in the DUC competition, we made an initial attempt at applying regression models to the query-
focused multi-document summarization task (Li, Ouyang, Wang, & Sun, 2007). We utilized the human summaries from DUC
Y. Ouyang et al. / Information Processing and Management 47 (2011) 227–237 229
2006 to generate the training data and used it to learn a sentence scoring function with Support Vector Regression. The
learned function was then applied on the DUC 2007 data set to score sentences. In this present study, we follow this regres-
sion-style summarization framework and present a further study on how to develop effective regression-based summariza-
tion approaches. First, we expand the training data construction scheme to four different methods for discovering more
accurate estimation of sentence importance in order to more effectively train the regression models. Second, we adjust
the learning framework to enable direct comparisons between regression models, classification models and learning-to-rank
models.
3. Regression models for query-focused multi-document summarization
Our summarization approach is built upon the typical feature-based extractive framework, which ranks and extracts sen-
tences according to a set of pre-defined sentence features and a composite scoring function. The learning models in our fea-
ture-based approach thus search for an optimum composite scoring function with the fixed feature set.
3.1. Feature design
Since sentences are scored according to their feature values, the features play an important role in sentence scoring and
ranking. In this paper, we design seven features to characterize various aspects of a sentence in query-focused multi-docu-
ment summarization, including three query-dependent and four query-independent features. The features are formulated as
follows.
3.1.1. Word matching feature (query-dependent)

In query-focused summarization tasks, queries directly reflect the information expected in the anticipated answers. So
the words in the queries are especially informative for sentence scoring. A corresponding query-dependent word feature
is then defined as the overlapping degree between the words in a sentence and the words in a query, i.e.,
XX
fword ðsÞ ¼ sameðwi ; wj Þ ð1Þ
wj 2s wi 2q
where f is the feature value, q is the query. The function same(wi, wj) = 1 if wi = wj, and 0 otherwise. This feature considers
exact matches between words in sentences and queries.
3.1.2. Semantic matching feature (query-dependent)

Words which do not explicitly appear in the query may also contain information that is semantically related to the query.
Therefore, we introduce a semantic feature to capture the overlapping semantic information between a sentence and a query
based on the semantic lexicon WordNet, as
XX
fwordnet ðsÞ ¼ similarityðwi ; wj Þ ð2Þ
wj 2s wi 2q
where the function similarity(wi, wj) is the lesk similarity function introduced in (Banerjee & Pedersen, 2002), which scales
the semantic relation between the two words.
3.1.3. Named entity matching feature (query-dependent)

When the query contains a question that explicitly asks for or about some entities, the sentences containing those entities
are more likely to contain the direct answers. Therefore, we believe that named entities are important in scaling the infor-
mation contained in a sentence. The entity-based query-dependent feature is defined as the number of the matched named
entities between a sentence and a query.
fentity ðsÞ ¼ jentityðsÞ \ entityðqÞj ð3Þ
where |entity(s) \ entity(q)| is the number of named entities in both s and q.
3.1.4. Word TF-IDF feature (query-independent)

It is well known that not all words in a text are of equal importance. Many criteria have been proposed for measuring
the information content in the text. In this paper, we use a traditional information retrieval (IR) word weighting scheme,
i.e. the TF-IDF (term frequency-inverse document frequency), to scale the information richness of the words and then in turn
the sentences.
X
ftfidf ðsÞ ¼ tfidf ðwi Þ ð4Þ
wi 2s
where tfidf(wi) is the tfidf score of the word wi in the data set.
3.1.5. Named entity feature (query-independent)

Similarly, we include a query-independent entity-based feature which is defined as the number of named entities in a
sentence.
fentityno ðsÞ ¼ jentityðsÞj ð5Þ
3.1.6. Stop-word penalty feature (query-independent)

As stop-words rarely carry useful information, sentences which contain many stop-words are unlikely to be informative.
We thus define a stop-word based feature as follows.
fstopword ðsÞ ¼ jstopwordðsÞj ð6Þ
where |stopword(s)| is the number of the stop-words in s.
3.1.7. Sentence position feature (query-independent)

On the assumption that authors introduce the main idea or briefly summarize the main content at the beginning of the
text, we further assume that opening sentences are more informative and more important to the document set. We thus de-
fine the position-based feature as follows:
i1
fposition ðsÞ ¼ 1 ð7Þ
n
where n is the total number of the sentences, and s is the ith sentence in the document.
3.2. Regression-style sentence ranking method
A composite function uses the defined features to synthesize the effects of all the features and compute the importance
scores for the sentences. In this paper, we adopt Support Vector Regression (SVR) to learn the scoring function with the fea-
ture set.
The regression model is trained from a set of known topics D which provide the importance score of every sentence. Here
the definition of a topic comes from the DUC competitions, in which each topic contains a given query and a set of relevant
documents. A sentence s in D (actually in the document set of D) is assigned1 with a score indicating its importance score
score(s) and an associated feature vector F(s). The training data is constructed by correlating the sentence’s score and its features
together, i.e., {(score(s), F(s))|s e D}. The target is to predict the score of a new sentence s0 in an unknown topic D0 through its
feature vector F(s0 ). This task can be cast as a typical linear regression problem, i.e., using the training data {(score(s), F(s))|s e D}
to learn the optimum regression function f: F(s) ? R from a candidate function set {f(x) = w x + b|w e Rn, b e R}. For this regres-
sion problem, linear SVR chooses the optimum function f0(x) = w0 x + b0 by minimizing the structure risk function
!
1 1 X
Uðw; bÞ ¼ jjwjj2 þ C Lðscoreðsi Þ ðw Fðsi Þ þ bÞÞ ð8Þ
2 jDj s 2D
i
where L(x) indicates the loss function, C indicates the weight to balance the factors and |D| indicates the number of sentences
in D.
Once the regression function f0 is learned, we use it to provide an estimation of the importance score of the new sentence
s0 by
^reðs0 Þ ¼ f0 ðFðs0 ÞÞ ¼ w0 Fðs0 Þ þ b0
Sco ð9Þ
In the practical tasks, the target summary is usually limited in length to a maximum number of words. To maximize the
information included in the summary, it is better to select the sentences that contain more information but fewer words.
Therefore, the final scoring function is obtained with an additional normalizing process as
1
^renorm ðs0 Þ ¼
Sco ^reðs0 Þ
sco ð10Þ
js0 j
where |s0 | is the number of the words in s0 .
3.3. Model comparison: classification, ranking and regression
Generally, a classification problem is depicted mathematically as follows: given the input as a vector x e Rn and the output
as a binary value y e { +1, 1}, the target function is a function g(x) learned from a labeled training data set D = {(x1, y1), (x-
l, yl)} to approximate the true relationship between input x and output y. A basic principle for finding the best approximation
P
function is to minimize the total classification error li¼1 Lðgðxi Þ; yi Þ on the training data set D, where L is a pre-defined loss
1
The issue of how to gain to assign the score to the sentence will be addressed in Section 3.4.
function (for example, the square loss function L(a, b) = (a b)2). There are many different specific classification models such
as SVM, decision tree, and neutral network. Here we will not look at the specific differences between the classification
models.
Classification-based summarization often categorizes sentences as either worthy of extraction into the summary or not.
The shortcoming of binary decisions is that most of the time too many or too few sentences are taken as summary sentences
relative to the summary length requirement. So it is normally necessary to convert binary decisions into real-valued scores in
some way so that scores can be used to rank sentences and top-ranked sentences can be selected to form a summary.
In contrast, a learning-to-rank problem is depicted mathematically as follows: given the input as a vector x e Rn, and the
output as a rank y e {1, 2, , r}, the target function is a function g(x) learned from a training data set consisted of ordered

output pairs D ¼ x11 ; x21 ; r 1 ; x12 ; x22 ; r 2 ; ; x1l ; x2l ; rl , where ri represents the relative preference between the two inputs x1i
2
Pl 1 2
and xi . The best function should minimize the total ranking error i¼1 L g xi ; xi ; ri on the training data D. When applied to
summarization, most learning-to-rank models first use real-valued scores to make a judgment between the two sentences
and then use pair-wise preference to obtain the full rank of the whole sentence set. This means that the learning-to-rank
models still rank sentences using real-valued scores.
A regression problem can be depicted mathematically as follows: given the input as a vector x e Rn, the output as a real
value y e R, the target function is a function g(x) learned from a labeled training data set D = {(x1, y1), (xl, yl)} to approxi-
mate the true relationship between input x and output y. The best approximation function must also minimize the total error
P
of the predicted value and the true value li¼1 Lðgðxi Þ; yi Þ on the training data D. Summarization methods based upon regres-
sion models attempt to directly construct a mapping function between the feature vectors and the importance scores.
3.4. Training data construction
Learning regression models requires a set of topics in which the sentence importance is known. However, there is no such
kind of training data available to us. Therefore, we should construct the training data in advance. However, it is impractical to
precisely annotate sentence importance manually. In this paper, we adopt an alternative strategy of semi-automatically
assigning ‘‘nearly true” importance scores to the sentences by using several N-gram-based methods and with reference to
human summaries. The basic assumption is that if human summaries are good, the sentences in the documents which
are more similar to those in human summaries are also more likely to be good would thus be likely to be assigned higher
scores.
Given a document set D and a human summary set H = {H1, , Hm} (Hi is the ith human summary), each sentence s in D
will be assigned an importance score score(s|H). The scores are calculated by the N-gram probabilities of s to be recognized as
a summary sentence given the human summaries.
Using the bag-of-words model, the probability of an N-gram t under the ith human summary Hi can be calculated by
pðtjHi Þ ¼ freqðtÞ=jHi j ð11Þ
where freq(t) is the frequency of t in Hi and |Hi| is the number of the words in Hi. To obtain the probability of t under all hu-
man summaries, we consider two strategies, i.e. the Maximum strategy
pMax ðtjHÞ ¼ MaxðpðtjHi ÞÞ ð12Þ

Hi 2H
and the Average strategy
1 X
pAv g ðtjHÞ ¼ pðtjHi Þ ð13Þ
jHj H 2H
i
where |H| is the total number of human summaries in H.

The overall score of a sentence s is calculated using the sum of the probabilities of all the N-grams it contains, as follows (p
can be either pmax or pavg),
X
scoreðsjHÞ ¼ pðt j jHÞ ð14Þ
t j 2s
Analogously, we have another two scoring methods based on N-gram statistics, i.e.
X
scoreMax ðsjHÞ ¼ Maxðfreqðtj Þ=jHi jÞ ð15Þ
Hi 2H
tj 2S
and
X X
scoreAv g ðsjHÞ ¼ ðfreqðt j Þ=jHi jÞ ð16Þ
t j 2S Hi 2H
More specifically, in our experiments we use Uni-grams and Bi-grams.

3.5. Redundancy removal
If the words in two sentences are very similar, the sentences will probably have similar feature values and consequently
also similar sentence scores. Thus it is inevitable for an extractive approach to select certain high scored sentences that con-
vey the same or quite similar information into a summary. Given the fixed length of the summary, a summary that includes
too much redundant information will be missing opportunities to include more relevant information. To alleviate this prob-
lem, the maximum marginal relevance (MMR) approach (Carbonell & Goldstein, 1998) is applied during the process of sen-
tence selection. First, all the sentences are ranked in descending order according to their estimated scores. Then, the
summary sentences are selected iteratively, each time with the current candidate sentence of interest being compared
against the other sentences already chosen. If the sentence is not significantly similar to any sentence already in the sum-
mary (the similarity value of the two sentences is smaller than a given threshold, 0.7 in this study), the sentence is added to
the summary. The iteration is repeated until the length of the summary reaches the upper limit.
4. Experiment and evaluation
4.1. Experiment set-up
We conduct a series of experiments on the query-focused multi-document summarization task initiated by DUC in 2005.
The task requires creating a brief, well-organized, and fluent summary from a set of documents related to a given query. This
task has been specified as the main evaluation task over 3 years (2005–2007) and thus provides a good benchmark for
researchers to exchange their ideas and experiences in this area. Each year, DUC assessors develop a total of about 50
DUC topics. Each topic includes 25–50 newswire documents and a topic description simulating a potential user’s query.
For each topic, four human summarizers are asked to use the related documents to write a 250-word summary that is sub-
sequently used in automatic evaluation.
In all the experiments, the documents and queries are pre-processed by removing the stop-words and stemming the
remaining words. Four types of named entities, including persons, organizations, locations and times, are automatically
tagged by GATE (Cunningham, Maynard, Bontcheva, & Tablan, 2002). According to the task definitions, system-generated
summaries are strictly limited to a length of 250 English words. After sentence scoring, the highest scored sentences are se-
lected from the original documents into the summary until the word (actually the sentence) limitation is reached. Consid-
ering the focus of this study, no post-processing such as sentence compression or fusion is carried out. SVMlight (Joachims,
1999) is used to implement SVR and the parameters of SVMlight are set to default values.
4.2. Evaluation metrics
In DUC, the system-generated summaries are evaluated against several manual and automatic evaluation metrics (Dang,
2005). In this paper, we use two of the DUC automatic evaluation criteria, namely ROUGE-2 and ROUGE-SU4,2 to compare our
systems (implemented using the proposed approaches) with human summarizers and top-performing DUC systems. ROUGE
(Recall Oriented Understudy for Gisting Evaluation) (Lin & Hovy, 2002) is a state-of-the-art automatic summarization evaluation
method that makes use of N-gram comparison. It evaluates the system summaries by comparing them with human summaries.
For example, ROUGE-2 evaluates a system summary by matching its Bi-grams against the human summaries, i.e.,
Ph P
j¼1 t 2S Countðt i jS; Hj Þ
Rn ðsÞ ¼ Ph Pi ð17Þ
j¼1 t i 2S Countðt i jHj Þ
where S is the summary to be evaluated, Hj (j = 1, 2, . . . , h) is the jth human summary regarded as the gold standard. ti indi-
cates the Bi-grams in the summary S, Count(ti |Hj) is the number of times the Bi-gram ti occurred in the jth human summary
Hj and Count(ti |S, Hj) is the number of times ti occurred in both S and Hj.
ROUGE-SU4 is very similar to ROUGE-2. It matches Uni-grams and skip-Bi-grams of a summary against human summa-
ries instead of Bi-grams. A skip-Bi-gram is a pair of words in their sentence order, allowing for gaps within a limited size. A
more detailed description of ROUGE can be found in (Lin & Hovy, 2002). Although ROUGE uses simple N-gram statistics, it
works well in DUC. For example, in the DUC 2005, ROUGE-2 had a Spearman correlation of 0.95 and a Pearson correlation of
0.97 compared with human evaluation.
4.3. Experiment 1: comparison of feature weights automatically learned and manually assigned on the DUC 2005
In the first experiment, we use all the features introduced in Section 3.2 in all the runs of the SVR-based summarization
system. Therefore, the influential factor to the results is the training data used for learning the regression functions. Recall
that in Section 3.4 we introduced four methods using two kinds of N-grams (including Uni-gram and Bi-gram) and two
2
We run ROUGE-1.5.5 with the parameters ‘‘n 2 x m 2 4 u c 95 r 1000 f A p 0.5 t 0 d”.
alternative scoring strategies (including Maximum and Average). In this experiment, we conduct a comparison of these
methods for model learning. The ‘‘oracle” scoring method (Conroy and Schlesinger, 2006) is also implemented for reference.
Here the ‘‘oracle” score is used to learn the regression function like the proposed methods in Section 3.4. Besides the SVR-
based systems, we also develop several baseline systems that use linear functions with manually assigned weights to com-
bine the features. Since manually assigned weights may happen to produce the worst or the best performance, four different
sets of weights are chosen. All the systems are evaluated on the DUC 2005 (i.e. the first year of query-focused multi-docu-
ment summarization) data set. The training data for the learning-based system is constructed from the DUC 2006 data set.
For simplicity and consistency, we use the DUC 2006 data set to construct the training data in most of the follow-up exper-
iments unless otherwise stated.
Table 1 presents the average ROUGE-2 and ROUGE-SU4 scores and the corresponding 95% confidential intervals of the
systems on the DUC 2005 data set. The systems with the same features but different composite functions performed differ-
ently. The best system outperformed the worst one by more than 30%. This means that the efficiency of feature combination
is indeed very important to the sentence scoring function. In general, the learned regression functions perform better than
the linear combinations with manually assigned weights. This clearly demonstrates the advantages of using regression mod-
els to supervise the combination process. Of the regression-based systems, the one learned by the ‘‘Uni+Max” training data
construction method was the best.
4.4. Experiment 2: different features on the DUC 2005
No matter which models are used, the features always have a direct influence on the efficiency of the composite scoring
function. In this second experiment, we test the systems with different feature sets. We gradually combined the seven fea-
tures introduced in Section 3.2 to form nine different feature sets and scoring functions. The training data is all constructed
using the ‘‘Uni+Max” method. Again the evaluations are carried out on the DUC 2005 data sets. Table 2 illustrates the ROUGE
scores of the system with different feature sets. The results suggest that SVR is effective in searching for the optimal weights.
When more appropriate features are involved, the SVR model is capable of tuning the weights of the incremental feature sets
to achieve better composite functions.
4.5. Experiment 3: comparison of different learning models
The purpose of the third experiment is to compare the effectiveness of different machine learning models in sentence
importance estimation, including regression models, classification models and learning-to-rank models. Here we consis-
tently use the Support Vector Machine to implement all the models to make the comparison more fair, i.e., Support Vector
Classification (Vapnik, 1995), Support Vector Regression (Vapnik, 1995) and ranking SVM (Joachims, 2002).
The ‘‘Uni+Max” method is used to construct the training data for all these models. After scoring all the sentences in a topic
against human summaries, we normalize all the estimated scores by the maximum one among them. The training data for
the regression models is constructed by pairing the normalized score and the feature vector for every sentence. The training
data for the classification models is constructed by using the sentences with scores above 0.7 as the positive sentences and
the sentences with scores below 0.3 as the negative sentences. The training data for the ranking models is constructed by
pairing any two sentences whose score gap is larger than 0.5. In our implementation, all the thresholds are obtained
experimentally.
Table 1
Results of learned and manually assigned weights on DUC 2005.
Submission Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Uni+Max 0.0757 0.1335
(0.0720, 0.0791) (0.1300, 0.1370)
Uni+Avg 0.0747 0.1319
(0.0711, 0.0783) (0.1284, 0.1355)
Oracle 0.0726 0.1298
(0.0691, 0.0759) (0.1263, 0.1331)
Bi+Avg 0.0713 0.1268
(0.0681, 0.0745) (0.1237, 0.1299)
Bi+Max 0.0712 0.1271
(0.0675, 0.0741) (0.1239, 0.1302)
Human Assigned Weight 1 0.0657 0.1221
(0.0623, 0.0691) (0.1183, 0.1259)
(0.0606, 0.0683) (0.1124, 0.1219)
(0.0586, 0.0651) (0.1135, 0.1208)
(0.0537, 0.0606) (0.1038, 0.1132)
Table 2
Result of combining different features with Uni+Max on DUC 2005.
fcentroid fword fposition fstopword fentity + fentityno fwordnet Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)
p
0.0603 0.1111
(0.0571, 0.0635) (0.1071, 0.1152)
p
0.0628 0.1222
(0.0598, 0.0656) (0.1190, 0.1252)
p p
0.0641 0.1171
(0.0612, 0.0670) (0.1142, 0.1199)
p p
0.0706 0.1276
(0.0673, 0.0738) (0.1243, 0.1310)
p p p
0.0709 0.1265
(0.0675, 0.0740) (0.1230, 0.1299)
p p p
0.0729 0.1321
(0.0694, 0.0761) (0.1288, 0.1356)
p p p p
0.0747 0.1331
(0.0711, 0.0781) (0.1295, 0.1367)
p p p p p
0.0751 0.1331
(0.0715, 0.0786) (0.1296, 0.1366)
p p p p p p
0.0757 0.1335
(0.0720, 0.0791) (0.1300, 0.1370)
Table 3
Results of different learning models on DUC2005.
Model Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Regression 0.0757 0.1335
(0.0720, 0.0791) (0.1300, 0.1370)
Ranking 0.0715 0.1287
(0.0682, 0.0748) (0.1253, 0.1322)
Classification 0.0641 0.1208
(0.0612, 0.0669) (0.1175, 0.1241)
Table 4
Results of different learning models on DUC2006.

(0.0883, 0.0969) (0.1443, 0.1525)
Ranking 0.0890 0.1443
(0.0852, 0.0928) (0.1403, 0.1477)
(0.793, 0.0876) (0.1344, 0.1428)
We first run the systems on the DUC 2005 data set. Table 3 provides the average ROUGE-2 and ROUGE-SU4 scores and the
corresponding 95% confidential intervals. As expected, regression models outperform both classification models and ranking
models. To further prove this result, we extend the experiment to the DUC 2006 and the 2007 data sets. When evaluated on
the DUC 2006 data set, we use 10% of the topics to construct training data in a 10-fold cross-validation process. The results
are presented in Tables 4 and 5 respectively. The evaluations on DUC 2006 and 2007 confirm the advantages of the regres-
sion-style learning approach in feature combination. Moreover, the success of utilizing more information shows that the
‘‘pseudo” scores are reliable estimations of the sentence importance.
4.6. Experiment 4: comparison with the DUC participating systems
Next, we compare our summarization systems to the state-of-the-art systems. Tables 6–8 present the ROUGE results of
the SVR-based systems (shaded grey) together with the top performing systems in DUC (labeled 15, 17 and so on), the worst
human summary (labeled as H, A and so on), and the NIST baseline that returns the first 250-words of the most recent doc-
ument for each document set. As showed in the results, our ‘‘Uni+Max” system is able to outperform all the participating
systems in the DUC 2005 (31 systems in all), stand at the second position in the DUC 2006 (34 systems in all) and the fifth
position in the DUC 2007 (30 systems in all). This shows that the proposed approaches are competitive when compared to
state-of-art summarization systems. As a matter of fact, the participating systems have improved a lot over the 2 years. This
is the reason why the rank of our system drops from 2005 to 2007.
Table 5
Results of different learning models on the DUC2007.

(0.1084, 0.1164) (0.1608, 0.1695)
Ranking 0.1075 0.1616
(0.1032, 0.1120) (0.1573, 0.1659)
(0.0967, 0.1055) (0.1519, 0.1606)
Table 6
Comparing with the DUC 2005 participating systems.

H 0.0880 0.1471
(0.0770, 0.0998) (0.1366, 0.1594)
Uni+Max 0.0757 0.1335
(0.0720, 0.0791) (0.1300, 0.1370)
15 0.0738 0.1326
(0.0711, 0.0764) (0.1300, 0.1354)
17 0.0726 0.1298
(0.0692, 0.0760) (0.1263, 0.1331)
8 0.0706 0.1285
(0.1256, 0.1313) (0.0677, 0.0735)
NIST baseline 0.0416 0.0885
(0.0386, 0.0446) (0.0842, 0.0924)
Table 7

A 0.1001 0.1648
(0.0898, 0.1123) (0.1574, 0.1734)
24 0.0950 0.1534
(0.0907, 0.0992) (0.1494, 0.1574)
Uni+Max 0.0926 0.1485
(0.0883, 0.0969) (0.1443, 0.1525)
15 0.0900 0.1448
(0.0858, 0.0942) (0.1411, 0.1488)
12 0.0892 0.1457
(0.0848, 0.0938) (0.1415, 0.1499)
(0.0451, 0.0534) (0.0918, 0.1006)
4.7. Experiment 5: comparison of training on the different DUC data sets
In all the preceding experiments, the models were learned on the same training data constructed from the DUC 2006 data
set. In the final experiment, we examine the performance of the SVR-based systems trained with different training data sets.
We use the ‘‘Uni+Max” method to generate three sets of training data from each of the DUC 2005, 2006 and 2007 data sets.
The regression functions are then trained on each training data set before being evaluated on the other data sets. When train-
ing and evaluation are carried out on the different data sets, we use entire data sets. When training and evaluation are car-
ried out on the same data sets, we use 10% of the topics in the data sets are used to construct the training data and 10-fold
cross-validations. Tables 9 and 10 present the 3 3 training-evaluation result matrices. The systems trained on the DUC
2006 data set performs better than those trained on DUC 2005 or DUC 2007. This shows that the source data for constructing
the training data also influences the efficiency of model learning. In future work, we would like to further investigate how to
adapt the learned models to a new data set with completely different topics or topic distributions, and how to choose the
training data among the known topics given the new topic to be summarized.
4.8. Discussion
Experiment 1 has shown that of the four training data construction methods, the training data generated by Uni-gram
methods are in general better for training regression functions than Bi-gram methods. This may be because in the original
document sets, Bi-grams are spread much more sparsely than Uni-grams. For example, many more sentences that receive
Table 8

H 0.1289 0.1840
(0.1154, 0.1422) (0.1737, 0.1931)
15 0.1239 0.1750
(0.1189, 0.1288) (0.1701, 0.1897)
29 0.1201 0.1694
(0.1152, 0.1249) (0.1648, 0.1740)
4 0.1181 0.1679
(0.1133, 0.1226) (0.1638,0.1718)
24 0.1176 0.1743
(0.1128, 0.1228) (0.1696, 0.1793)
Uni+Max 0.1133 0.1652
(0.1084, 0.1164) (0.1608, 0.1695)
13 0.1115 0.1630
(0.1067, 0.1164) (0.1581, 0.1678)
(0.0561, 0.0639) (0.0995, 0.1077)
Table 9
ROUGE-2 results of regression models built from different data sources.
Test 2005E 2006E 2007E

Train
2005T 0.0684 0.0860 0.1073
(0.0654, 0.0714) (0.0818, 0.0904) (0.1036, 0.1112)
2006T 0.0757 0.0926 0.1133
(0.0720, 0.0791) (0.0883, 0.0969) (0.1084, 0.1164)
2007T 0.0612 0.0843 0.1073
(0.0583, 0.0643) (0.0799, 0.0886) (0.1027, 0.1117)
Table 10
RougeSU-4 results of regression models built from different data sources.
Test 2005E 2006E 2007E

Train
2005T 0.1252 0.1411 0.1597
(0.1221, 0.1284) (0.1371, 0.1448) (0.1555, 0.1638)
2006T 0.1335 0.1485 0.1652
(0.1300, 0.1370) (0.1443, 0.1525) (0.1608, 0.1695)
2007T 0.1154 0.1398 0.1595
(0.1122, 0.1188) (0.1355, 0.1437) (0.1553, 0.1638)
zero scores in Bi-gram methods (which means the Bi-grams in these sentences never appear in human summaries) than in
Uni-gram methods, at about 75% vs. 20%. In other words, the data sparseness problem is more serious in Bi-gram methods
than in Uni-gram methods. This unavoidably influences the performance of machine learning approaches and accounts for
the better performance of Uni-gram methods. On the other hand, the ‘‘Maximum” and ‘‘Average” strategies differ very little.
In Experiment 5 we also compared the effects of different training data on model learning. The results show that the qual-
ity of the training data varies as the data resources change. This may be because the similarity scores are estimates of the
‘‘true” sentence importance scores. As the accuracy of the estimates varies on different topics, the quality of the generated
training data also varies. Therefore, when the source data changes, the quality of the generated training data using the same
construction method may also differ. The crucial factor here is how well the similarity measure based upon human summa-
ries can estimate ‘‘true” importance.
To sum up, a good training data set for learning regression models requires (1) a good topic set with well-written human
summaries, and (2) an appropriate method for sentence importance estimation.
5. Conclusion and future work
This paper has presented our studies of how to develop a regression-style sentence ranking scheme for query-focused
multi-document summarization. We examined different methods for constructing the training data based on human sum-
maries. We also presented what we have learned on how to construct good training data and compared the effectiveness of
different learning models for sentence ranking. Our experiments have shown that regression models are to be preferred over
classification models or learning-to-rank models for estimating the importance of the sentences. We also showed that the
resulting summarization system based on the proposed ranking approach is competitive on the DUC 2005–2007 data sets.
In future work we will examine the effectiveness of regression models in a wider variety of summarization tasks as well as
considering a greater number of features, such as features based on document structure or sentence relations, which would
allow us to model documents in a more sophisticated way.
Acknowledgements
The work described in this paper was partially supported by Hong Kong RGC Projects (PolyU5211/05E and PolyU5217/
07E), NSFC programs (60603093 and 60875042) and 973 National Basic Research Program of China (2004CB318102).
References
Amini, M. R., Usunier, N. (2009). Incorporating prior knowledge into a transductive ranking algorithm for multi-document summarization. In Proceedings of
the 32nd international ACM SIGIR conference on research and development in information retrieval, poster session (pp 704–705).
Amini, M. R., Usunier, N., & Gallinari, P. (2005). Automatic text summarization based on word-clusters and ranking algorithms, ECIR 2005. In D. E. Losada & J.
M. Fernández-Luna (Eds.). LNCS (Vol. 3408, pp. 142–156). Heidelberg: Springer.
Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the third international
conference on computational linguistics and intelligent text processing (CICLING-02) (pp. 136–145).
Carbonell, G. J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the
21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 335–336). Melbourne, Australia.
Chuang, W. T., & Yang, J. (2000). Extracting sentence segments for text summarization: a machine learning approach. In Proceedings of the 23rd annual
international ACM SIGIR conference on research and development in information retrieval (pp. 152–159).
Conroy, J. M., Schlesinger, J. D., & O’Leary, D. P. (2006). Topic-focused multi-document summarization using an approximate oracle score. In Proceedings of
the COLING/ACL 2006 main conference poster sessions (pp. 152–159).
Cunningham, H., Maynard, D., Bontcheva, K., & Tablan, V. (2002) GATE: A framework and graphical development environment for robust NLP tools and
applications. In Proceedings of the 40th anniversary meeting of the association for computational linguistics (pp. 168–175).
Dang, H. T. (2005). Overview of DUC 2005. In Document understanding conference 2005. <http://duc.nist.gov>.
Fisher, S., & Roark, B. (2006). Query-focused summarization by supervised sentence ranking and skewed word distributions. In Document understanding
conference 2006. <http://duc.nist.gov>.
Hirao, T., & Isozaki, H. (2002). Extracting important sentences with support vector machines. In Proceedings of the 19th international conference on
computational linguistics (pp. 342–348).
Joachims, T. (1999). Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods – Support vector
learning. MIT-Press.
Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM conference on knowledge discovery and data mining (KDD).
Kupiec, J. M., Pedersen, J., & Chen, F. (1995). A trainable document summarizer. In Edward A. Fox, Peter Ingwersen, Raya Fidel (Eds.), Proceedings of the 18th
annual international ACM SIGIR conference on research and development in information retrieval (pp. 68–73).
Li, S., Ouyang, Y., Wang, W., & Sun, B. (2007). Multi-document summarization using support vector regression. In Document understanding conference 2007.
<http://duc.nist.gov>.
Lin, C. Y., & Hovy, E. (2002). Manual and automatic evaluation of summaries. In Document understanding conference 2002. <http://duc.nist.gov>.
Mani, I., & Bloedorn, E. (1998). Machine learning of generic and user-focused summarization. In Proceedings of the 15th national/10th conference on artificial
intelligence/innovative applications of artificial intelligence (pp. 820–826). Madison, WI, United States.
Metzler, D., Kanungo, T. (2008). Machine learned sentence selection strategies for query-biased summarization. In SIGIR 2008 workshop on learning to rank
for information retrieval.
Neto, J. L., Freitas, A. A., & Celso, A. A. (2002). Kaestner. Automatic text summarization using a machine learning approach. In Proceedings of the 16th Brazilian
symposium on artificial intelligence: Advances in artificial intelligence (pp. 205–215).
Shen, D., Sun, J., Li, H., Yang, Q., Chen, Z. (2007). Document summarization using conditional random fields. In Proceedings of the 20th international joint
conference on Artificial intelligence (pp. 2862–2867).
Toutanova, K. et al. (2007). The PYTHY summarization system: Microsoft research at DUC 2007. In Document understanding conference 2007. <http://
duc.nist.gov>.
Vapnik, V. N. (1995). The nature of statistical learning theory. Springer.
Wang, C., Jing, F., Zhang, L., & Zhang, H. (2007). Learning query-biased web page summarization. In Proceedings of the 16th ACM conference on conference on
information and knowledge management (pp. 552–562).
Zhao, L., Wu, L., & Huang, X. (2005). Fudan University at DUC 2005. In Document understanding conference 2005. <http://duc.nist.gov>.
Zhou, L., & Hovy, E. (2003). A web-trained extraction summarization system. In Proceedings of HLT-NAACL 2003 (pp. 205–211).
You Ouyang is currently a Ph.D. student in the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. He received the B.Sc. and M.Sc.
degree in Peking University, Beijing, China, in 2004 and 2007 respectively. His main research interests include statistical natural language processing, text
mining and data mining.
Wenjie Li is currently an assistant professor in the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. She received her Ph.D.
degree from department of systems engineering and engineering management in the Chinese University of Hong Kong, Hong Kong, in 1997. Her main
research topics include natural language processing, information extraction and temporal information processing.
Sujian Li is currently an assistant professor in the Key Laboratory of Computational Linguistics, Peking University, China. Her main research topics include
Information Extraction, Automatic Indexing, Computational Linguistics.
Qin Lu is currently a professor and associate head of the Department of Computing, the Hong Kong Polytechnic University, Hong Kong. Her research has
been on open systems especially in interoperability and internationalization, Chinese computing and natural language processing. She is currently the
rapporteur of the ISO/IEC/JTC1/SC2/WG2’s ideographic Rapporteur Group for the standardization of ideograph characters in the ISO/IEC 10646 standard.

Ouyang 2011

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Ouyang 2011

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ouyang 2011

Uploaded by

Copyright:

Available Formats

Information Processing and Management 47 (2011) 227–237

Contents lists available at ScienceDirect

Information Processing and Management

Applying regression models to query-focused multi-document

* Corresponding author. Tel.: +852 6857 0255.

3. Regression models for query-focused multi-document summarization

3.1. Feature design

3.1.1. Word matching feature (query-dependent)

3.1.2. Semantic matching feature (query-dependent)

3.1.3. Named entity matching feature (query-dependent)

fentity ðsÞ ¼ jentityðsÞ \ entityðqÞj ð3Þ

where |entity(s) \ entity(q)| is the number of named entities in both s and q.

3.1.4. Word TF-IDF feature (query-independent)

3.1.5. Named entity feature (query-independent)

3.1.6. Stop-word penalty feature (query-independent)

3.1.7. Sentence position feature (query-independent)

3.2. Regression-style sentence ranking method

3.3. Model comparison: classiﬁcation, ranking and regression

3.4. Training data construction

pðtjHi Þ ¼ freqðtÞ=jHi j ð11Þ

pMax ðtjHÞ ¼ MaxðpðtjHi ÞÞ ð12Þ

and the Average strategy

where |H| is the total number of human summaries in H.

More speciﬁcally, in our experiments we use Uni-grams and Bi-grams.

3.5. Redundancy removal

4. Experiment and evaluation

4.1. Experiment set-up

4.2. Evaluation metrics

4.4. Experiment 2: different features on the DUC 2005

4.5. Experiment 3: comparison of different learning models

Submission Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Model Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Model Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

4.6. Experiment 4: comparison with the DUC participating systems

Model Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Submission Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Submission Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

4.7. Experiment 5: comparison of training on the different DUC data sets

Submission Average ROUGE-2 (CI) Average ROUGE-SU4 (CI)

Test 2005E 2006E 2007E

Test 2005E 2006E 2007E

5. Conclusion and future work

You might also like