Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Yirdaw 2012

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Topic-based Amharic Text Summarization with

Probabilisic Latent Semantic Analysis


Eyob Delele Yirdaw Dejene Ejigu
Addis Ababa Univesity Addis Ababa Univesity
School of Infromation Science Department of Computer Science
Addis Ababa, Ethiopia Addis Ababa, Ethiopia
eyob.delele@aau.edu.et dejene.ejigu@aau.edu.et

ABSTRACT information. However, given the reality of time constraint to


This paper investigates the problem of building a concept-based process the available information and make the best out of it, it is
single-document Amharic text summarization system. Because very wise to think of computer assisted automatic text
local languages like Amharic lack extensive linguistic resources, summarization as a solution.
we propose to use statistical approaches called topic modeling to In Ethiopia, currently, there is huge amount of digital data being
create our text summarizer. The proposed algorithms are language made available in Amharic. Amharic is the second most spoken
and domain independent and hence can also be used for other Semitic language in the World, after Arabic, and is the official
local languages. More specifically, we propose to use the topic working language of the federal government and many other
modeling approach of probabilistic latent semantic analysis regional states. Many organizations and individual users are in
(PLSA). need of efficient Amharic text processing tools like retrieval,
We show that a principled use of the term by concept matrix that classification and summarization systems.
results from a PLSA model can help produce summaries that A number of works exist that apply concept-based approaches to
capture the main topics of a document. We propose and test six summarization of English documents. Some examples include
algorithms to help explore the use of the term by concept matrix. [18] (latent semantic analysis (LSA) and anaphoric resolution),
All of the algorithms have two common steps. In the first step, [9] (named-entity recognizer, phrasal parser and entity co-
keywords of the document are selected using the term by concept reference) and [10] (PLSA). These systems are shown to perform
matrix. In the second step, sentences that best contain the among the best in their respective evaluation tasks.
keywords are selected for inclusion in the summary. To take
advantage of the kind of texts we experiment with (news articles) There are a number of research works carried out on the
the algorithms always select the first sentence of the document for summarization task of Amharic documents. Kamil [13] and
inclusion in the summary. After experimenting with corpus of Abraham [1] employ surface level statistical features to
news articles of different category at different extraction rates, the summarize Amharic text. Teferi [20] uses Nave Bayes machine
result obtained is encouraging. learning approach based on four surface level features. But all the
three [13, 1, 20] are not concept-based. Melese [15] has used an
Categories and Subject Descriptors LSA approach to address this gap to single-document
I.2.7 [Artificial Intelligence]: Natural Language Processing summarization. However, LSA has a drawback, namely its
Language models, Text analysis. unsatisfactory statistical foundations [11]. The development of a
high performance summarizer is quite a challenging task.
General Terms Effective concept-based methods need to be applied to provide
Algorithms, Experimentation. satisfactory solutions to this problem.

Keywords In this paper, we investigate the application of language


Amharic Text Summarization, Keyword Approach, Probabilistic independent statistical methods called topic modeling for concept-
Latent Semantic Analysis. based single-document Amharic text summarization. The topic
modeling approaches considered are LSA [7], PLSA [11] and
1. INTRODUCTION latent Dirichlet allocation (LDA) [3]. Current research has shown
People in various areas have become increasingly exposed to vast that representing the content of a text based on the language and
amount of information. Their work usually involves filtering the domain-independent methods of topic modeling represents a
main substance out of large amount of potentially relevant promising approach to text summarization. This is especially true
for local languages such as Amharic which lack extensive
linguistic resources that can be used to identify important
Permission to make digital or hard copies of all or part of this work for concepts of a document. We develop an extractive text
personal or classroom use is granted without fee provided that copies are summarization system that tries to identify those sentences that
not made or distributed for profit or commercial advantage and that are considered most important to the user.
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists, This paper is organized as follows. Section 2 presents discussion
requires prior specific permission and/or a fee. of the related work. Section 3 presents the proposed solution for
single-document Amharic text summarization. In section 4, we
MEDES12, October 28-31, 2012, Addis Ababa, Ethiopia.
Copyright 2012 ACM 978-1-4503-1755-9/10/10...$10.00.

-8-
discuss the experiments carried out. Finally, section 5 gives beginning of the document. The three scores are then summed up
conclusion to the work done and suggests directions for further to give the overall score for the sentence. The sentences with the
research. highest score are then selected for inclusion in the summary.

2. RELATED WORK Bhandari et al. [2] present a summarization system using PLSA
Text summarization systems based on latent variable models have for the creation of extractive generic summary of single
gained significant momentum recently. Central to all systems is documents. The authors propose four different algorithms using
the decomposition of documents and words into topics that are PLSA: PROC1, PROC2, PROC3 and PROC4 where the best
smaller in number when compared to the number of sentences in performing system is PROC3. PROC3 makes use of the fact that
the documents. After the topics of the documents have been PLSA divides the document into a number of topics. The
identified, a variety of ranking algorithms can be applied to algorithm creates score for sentences so that sentences which have
extract the most informative sentences. good influence ranging over several topics score better. The
sentence score is given by p(s) = z p(s|z)p(z).
Gong and Liu [8] create a term by sentence matrix, A, of a single
document to be summarized. They use LSA to decompose the As discussed above, TopicLSA first selects some important words
resulting matrix into A = UVT , where U and V are the left and of the document and uses them to rank sentences. Thus, the
right singular vectors respectively and is a diagonal matrix. The approach can be named a keyword approach. The other
authors argue that a good generic text summary of a document approaches discussed above used only the sentence features that
consists of the main topics of the document with minimum result from the topic modeling procedure used. We call them
redundancy. Thus, by taking one sentence from each right sentence-based approaches. In this work, we plan to extend the
singular vector that has the largest eigenvalue, they reach their keyword approach for text summarization.
objective of selecting the representative sentences of the The term selection procedure of TopicLSA selects the top ranking
document. Their approach however has the drawback of possibly terms from each column of U and concatenates them to form a
selecting sentences from topics that may not be important with topic vector. The selected terms from a single column/topic best
respect to the document. represent that topic but these terms may not rank highly with
Steinberger et al. [18] improve Gong and Lius work by following respect to other topics/columns. Other topics are represented by
a more elaborate approach. They create the matrix 2 V and get selecting terms from other columns. The sentence ranking
a new representation of the sentences different from the usual procedure of TopicLSA then folds the resulting topic vector into
V . The authors then calculate the Euclidean distance of each V  and compares it with the sentences of the document. This
folding process is not theoretically sound with respect to the terms
sentence from the origin of the resulting reduced vector space
representation. This distance is used as a score for each sentence. chosen previously. The folding process results in the
Sentences with the highest score are then selected for inclusion in representation of a word of the topic vector in the sense of all
the summary. This enables the algorithm to choose sentences topics of the document rather than representing it in the sense of
with the largest combined weight across all topics, resulting in the the specific topic from which it was selected during the term
possible inclusion of more than one sentence about an important selection procedure. Thus, all keywords will be represented in
topic. V  in the sense of topics for which they were not selected for.
This will give additional points for sentences that contain these
Melese [15] has shown that previous LSA-based approaches for keywords in the sense of the topics for which they were not
the summarization of single-documents can be improved by selected for, negatively affecting the performance of the
taking advantage of the term by concept matrix that results from algorithm.
applying LSA to the document. He developed TopicLSA, an
Amharic summarization system based on LSA. He starts by In contrast to this we intend to create summarization algorithms
constructing the term by sentence matrix A. After applying LSA whose keyword selection and sentence ranking algorithms are in
to matrix A , dimensionality reduction is followed to give a tune.
reduced representation of A given by A = U V . Unlike 3. KEYWORD APPROACH FOR SINGLE-
previous works that use V to rank sentences of the document, he
proposes to use U to extract important concepts of the document. DOCUMENT AMHARIC TEXT
He argues that the term by concept matrix U has richer SUMMARIZATION
representation of the concepts of the document than the concept The main argument to consider the keyword approach or in
by sentence matrix V . For each column of the matrix U the general using words in our analysis and not just sentence scores is
top m terms that have high index value in the column are selected that, to decide relevance of a sentence for the summary, focusing
and all the resulting terms are concatenated to form a topic on its smaller units (words) could give us a better insight into its
vector P. The topic vector is then folded into the latent space of importance. While the sentence features that result from topic
V for similarity comparison with the sentences of the modeling give us overall judgment of the sentence, only a few
document. Sentences of the document have three scores. The first selected constituent words could be effective to decide relevance
one is the cosine similarity of a sentence with P in the latent of sentences for inclusion in the summary. By following the
space. The second one is the cosine similarity of the sentence with keyword approach, we can switch to units such as words and
the title of the document in the latent space. The third score phrases that have smaller granularity of statistical salience than
is 1n , where n denotes the position of the sentence from the whole sentences. Selected keywords based on some criteria will
represent the document. Because the approach we follow is

-9-
extractive summarization, the summarization process will consist We also introduce the requirement that a minimum of five
of choosing sentences that best contain the selected keywords. iterations be made. This makes sure that the resulting PLSA
The summarization process consists of the following steps: text model will not suffer from undertraining.
pre-processing, topic identification and sentence ranking.
If we start with an m n term by sentence matrix, then we arrive
3.1. Text Pre-processing at the following topic-based features of the PLSA model:
Text pre-processing is involved in preparing the text document x A 1 n normalized sentence length matrix p(s).
into a format that is suitable for the text summarization process. x An m k term by topic matrix, p(w|z), where k is the
Because our sentence ranking algorithms do not make use of the number of topics of the PLSA model.
title of the document for ranking purposes, the titles are removed
x A k n topic by sentence matrix p(z|s).
from the documents before the text pre-processing stage begins.
The pre-processing stage consists of four steps that are executed The parameter p(s) is the number of non-stop-words contained in
sequentially in the given order: text segmentation, normalization, a sentence divided by the total number of non-stop-words in the
stop-word removal, and stemming. This work makes use of the document. For a fixed latent topic z , p(w|z) represents a
pre-processing module of [15]. probability distribution that gives the degree to which words of
the document represent that particular topic. For a fixed sentence
3.2. Topic Identification s, p(z|s) represents a probability distribution that gives the degree
We considered three of the most widely used topic modeling to which a topic z is present in that sentence.
approaches: LSA, PLSA and LDA. We prefer the probabilistic
models to LSA because of the problems that LSA has for discrete Other parameters of the PLSA model, p(s|z) and p(z), can be
data such as text documents arising due to its unsatisfactory derived from the asymmetric model by using marginalization and
statistical foundations. Bayes rule.
LDA differs from PLSA in that the topic distributions are 3.3. Sentence Ranking
assumed to have Dirichlet prior. This puts a limit on the number We explore the keyword approach for sentence ranking by
of parameters that are used to describe the resulting topic model to proposing a total of six algorithms: JWTS (joint word topic
km + k, where k is the number of topics and we have an m n sentence), CWTS (conditional word topic sentence), JWS (joint
term by document matrix [3]. In case of PLSA, because word sentence), CWS (conditional word sentence), KITS
documents do not have a generative model, the number of (keywords in topic simplex) and KIVSM (keywords in vector
parameters becomes km + kn. space model). All of the six algorithms have four phases: keyword
selection, sentence scoring, score averaging, and sentence
In information retrieval applications, this gives an advantage for
selection.
LDA in terms of overfitting issues. When we come to our task of
text summarization of documents, the trained probabilistic topic During the keyword selection phase, words of the document that
model is not used to predict topic distributions for future are judged to represent the most important points/topics are
sentences as in the case of information retrieval applications. Its selected using a PLSA model of the document. After this the
purpose is simply to get the latent semantic property of the given degree to which each sentence contains the selected keywords is
or training document. In this regard, PLSA has an advantage used to give scores to sentences. The above two steps are then
over LDA. LDAs approach may result in a less accurate repeated a number of times. This results in different scores for the
identification of the topics of the document because it has smaller sentences of the document. These scores are then averaged to give
number of parameters. Due to the above advantage of PLSA over the final score to the sentences. This procedure of score averaging
LDA, we prefer to apply PLSA for our text summarization task. is used to counteract poor models that might result from
By making use of PLSA, we plan to overfit our model to get unfavorable local maxima [5, 10]. Finally, the sentence selection
accurate representation of the topics of the document. stage uses the average score to select the top ranking sentences. In
addition, the sentence selection stage always selects the first
We wrote a Java code that implements the asymmetric model of
sentence of the document if the average score assigned to it does
PLSA that was proposed by [11], which takes as input the term by
not make it one of the top ranking sentences. This is carried out to
sentence matrix of the document to be summarized. The
take advantage of the property of news articles (the documents we
algorithm implements the tempered Expectation Maximization
used to evaluate our summarization algorithms) where the first
(TEM) algorithm while generating the PLSA model to help avoid
few sentences of a document hold important information [16].
unfavorable local maxima.
We have considered two ways of selecting keywords. In the first
The function that we actually maximize is the predictive log-
approach (adopted by JWTS and CWTS), L top ranking keywords
likelihood, p (p = sS wW n(s, w) log p(w|s) , where is a
are selected from each topic using the probability distribution
sentence and a word of the document), rather than the joint log-
p(w|z). In the second approach (adopted by JWS, CWS, KITS
likelihood. Maximizing using either is equivalent [12]. To decide
and KIVSM), L top ranking keywords of the document are
when to stop the Expectation Maximization (EM) iterations, we
calculate a value called convergence criterion (%), ccp. It is given selected that are dominant across all topics via score(w) =
by equation 1 below z [p(z)] p(w|z). p(w|z) gives us the score given by each topic
for each word. We sum these scores by first weighting them with
p(previous) p(current) the importance of each topic [p(z)] . If = 1, then the score
ccp = 100 (1) simply becomes p(w), which is the total number of appearances
p(previous)

- 10 -
of a word in a document divided by the total number of words in K

the document. score(sn ) = p(zk |sn ) (5)


k=1
We have considered two ways of estimating the presence of
keywords in sentences [11, 6, 19]. In the first approach (adopted 3.3.3. JWS Algorithm
by JWTS, CWTS, JWS, and CWS), sentences that maximize the For each keyword wl that has been selected let us define the
joint/conditional probability of the keywords/topics given a events (wl , sn ) for a sentence sn , where l = 1,2, , L . The
sentence are selected. In the second approach (adopted by KITS probability that all events happen for sentence sn gives us a
and KIVSM), the keywords are concatenated to form a query and measure of sentence sn containing all of the selected keywords as
are represented in the same manner as the sentences in a space shown in equation 6 below
(the topic simplex in the case of KITS and the VSM in the case of
L
KIVSM) and sentences that are most similar to the keywords
(based on a similarity measure such as cosine) are selected. score(sn ) = p(AllEventsn ) = p(wl , sn )
l=1
3.3.1. JWTS Algorithm L

For each keyword wg that is selected from a topic zg , let us define = [p(sn )p(wl | sn )] (6)
the events (wg , zg , sn )for a sentence sn , where g = 1,2, , L K. l=1
If we consider the probability that all events happen for sentence 3.3.4. CWS Algorithm
sn , then this gives us a measure of sentence sn containing all of For each keyword wl that has been selected, let us define the
the selected keywords in the sense of the topics from which they events (wl , sn ) for a sentence sn conditioned on the appearance of
were selected from as indicated in equation 2 below sn , where l = 1,2, , L. The probability that all events happen for
LK sentence sn gives us a normalized measure of sentence sn
score(sn ) = p(AllEventsn ) = p(wg , sn , zg ) containing all of the selected keywords as indicated in equation 7
g=1 below
LK
L
= p(zg )p(wg |zg )p(sn |zg ) (2) score(sn ) = p(AllEventsn ) = p(wl | sn ) (7)
g=1
l=1
where we have used the symmetric model of PLSA to write
p(w, s, z) = p(z)p(w|z)p(s|z) . The two terms p(zg ) and 3.3.5. KITS Algorithm
p(wg |zg ) of equation 2 are common to all sentences for each Here, we represent the keywords in the topic simplex of the PLSA
topic zg and for each corresponding word wg . This causes the model by concatenating them as one sentence and treating them as
score to degenerate to equation 3 below, a query, q. Then this query is folded into the topic simplex to give
a representation of the query, p(w|q). The score for a sentence
K
would then be its similarity with q as measured by any one of the
score(sn ) = p(sn |zk ) (3) three similarity measures KL (Kullback-Leibler) divergence, JS
k=1 (Jensen-Shannon) divergence and cosine.
3.3.2. CWTS Algorithm 3.3.6. KIVSM Algorithm
For each keyword wg that is selected from a topic zg , let us define KIVSM uses the vector space model (VSM) to score sentences.
the events (wg , zg , sn ) for a sentence sn conditioned on the We construct a VSM from the original term by sentence matrix of
appearance of sn , where g = 1,2, , L K. If we consider the the document taking the terms as the axes of the space. Since the
probability that all events happen for sentence sn , then this gives sentences and the keywords are found in a VSM space, we
us a normalized measure of sentence sn containing all of the calculate the similarity between a sentence and the keywords
selected keywords in the sense of the topics from which they were query using either the cosine (KIVSM-cos) or dot product
selected from as shown in equation 4 below (KIVSM-dot) similarity measures. This similarity between a
sentence and the query will serve as the sentence score.
LK

score(sn ) = p(AllEventsn ) = p(wg , zg |sn ) 4. EVALUATION


g=1
LK 4.1. Experimental Settings
= p(wg |zg )p(zg |sn ) (4) 60 Amharic news articles, from the Ethiopian Reporter website1,
g=1 are used for the experiment. Table 1 below shows the statistics of
where we have used the asymmetric graphical model of PLSA to the corpus.
write
Three independent human evaluators are used to generate
p(w, z|s) =
independent ideal summaries for the three extraction rates of 30%,
p(w, z, s)p(s) = p(s)p(z|s)p(w|z)p(s) = p(z|s)p(w|z) . The
25% and 20%. We have used the precision/recall/F-measure
term p(wg |zg ) of equation 4 is common to all sentences for each
topic zg and for each corresponding word wg . This causes the
score to degenerate to equation 5 below 1
http://www.ethiopianreporter.com

- 11 -
metric. Because the number of sentences chosen by both humans 4.1.2. Setting System Parameters for the Experiments
and the system are always the same, precision, recall and F- Our summarization algorithms have a number of parameters. To
measure have all identical values. A given summarization account for this, we have designed our experiments so that they
algorithm will have a single average score against all the 60 would be manageable in size.
documents and the three evaluators.
When a given algorithm is run with a set of selected parameters,
Table 1: Corpus statistics. summaries at all extraction rates are produced at once by selecting
Corpus Attribute Value the required number of sentences for each extraction rate.
Number of documents 60 4.1.2.1. Scoring Method
Minimum number of sentences per document 10 We define scoring method to be the models of PLSA that result
from different runs of the EM algorithm which are used to create
Maximum number of sentences per document 45 the average score for a sentence. We have explored 13 different
Average number of sentences per document 27 scoring methods (see table 3 below). Top 10 of 20 restarts means
that we generate 20 different PLSA models, then rank them by
Minimum number of words per document 382
their likelihood values, , finally we choose the top 10 models
Maximum number of words per document 933 with the heighest likelihood. 5 restarts means that all of the 5
Average number of words per document 585 generated PLSA models will be used. The 13 scoring methods
help us to explore which PLSA models, with respect to , perform
4.1.1. Compared Systems well on our summarization tasks.
For the sake of comparison, we have implemented and tested
Table 3: Scoring Methods
some of the relevant summarizer algorithms using our Amharic
data set. Position Number of Models Scoring Method

4.1.1.1. Random Summarizer 1 1 random start


Following the work of [17, 14], we report the average of 10 runs 5 5 restarts
of the random summarizer.
All 10 10 restarts
4.1.1.2. First n Sentences
We chose this summarizer because it is considered a very strong 15 15 restarts
baseline in the domain of news articles [16]. 20 20 restarts
4.1.1.3. TopicLSA 1 Top 1 of 20 restarts
Experiments by [15] have shown that TopicLSA outperforms
5 Top 5 of 20 restarts
other LSA based single-document summarization systems. We
have run the algorithm for the 16 term weighting functions and six Top 10 Top 10 of 20 restarts
number of terms (3, 5, 6, 8, 10, and 15) as specified in [15].
15 Top 15 of 20 restarts
4.1.1.4. PROC3/PROC3+ 20 Top 20 of 20 restarts
This is the best reported system using PLSA for single-document
summarization. We have also created a new algorithm by 1 Bottom 1 of 20 restarts
modifying the original algorithm in such a way that it always
includes the first sentence of the document and we named it 5 Bottom 5 of 20 restarts
PROC3+. Table 2 below gives the precision/recall value for each Bottom 10 Bottom 10 of 20 restarts
of the systems discussed above.
15 Bottom 15 of 20 restarts
Table 2: Precision/Recall values for compared systems.
20 Bottom 20 of 20 restarts
Precision/Recall
Compared System
4.1.2.2. Tempering Parameter, , for Generating the
Extraction Rate (%)
PLSA Model
20 25 30 According to [11], falls in the range of 0.6 to 1(inclusive). The
Random Summarizer 0.20345 0.26208 0.30973 actual values of used in our experiments are 0.6, 0.65, 0.7, 0.75,
0.8, 0.85, 0.9, 0.95, and 1.
First n Sentences 0.46689 0.45675 0.45770
4.1.2.3. Tempering Parameter, , for Folding
TopicLSA 0.37202 0.41611 0.43946 Because the folding operation faces only one maximum, the value
PROC3 0.30935 0.39045 0.44956 of f has been fixed to 1, i.e. we will not use TEM during the
folding operation.
PROC3+ 0.42990 0.46147 0.51383

- 12 -
4.1.2.4. Convergence criterion (%), ccp, for 4.1.2.7. Term Ranking Parameter,
Generating the PLSA Model For this parameter we have used the values of 2, 3, 4, and 5.
The iterative EM algorithm performs a maximization process Higher values are still possible but this increases the number of
while finding parameters of the PLSA model. The question here is our experiments by a big margin.
when do we judge to have attained enough maximum value of .
4.1.2.8. Number of Keywords (%)
We have referred to the literature and found only one work that This parameter tells us what percentage of the total number of
discusses the value for the convergence criterion [5]. They report unique words in a document to use as keywords. Detailed
that an iteration of 20 of the EM algorithm usually suffices (they discussion of how these values are chosen based on our
use a value of 1 for througout their experiments). experiments of TopicLSA is found in [21].The values we have
used are 11, 14, 17, 20, 23, 26, 29, and 32.
We conducted our own experiments to take a closer look at the
behavior of the EM algorithm while iterating. We have identified 4.1.2.9. Similarity Measure
three points from the graph of (characteristic points): turning This parameter is used for the KITS (JS, KL, and cosine) and
point, stabilization point, and overtraining point. The turning point KIVSM (dot product and cosine) algorithms during sentence
represents the fact that the graph is beyond the undertraing point scoring.
(very small number of iterations) and is about to level. When the
stabilization point is reached, the graph has almost reached 4.2. Experiments
convergence. It is close to the value used by [5]. System parameters are fixed in some of the algorithms in order to
make the number of experiments carried out manageable in size.
As discussed in section 3.2, the convergence criterion is measured
We experiment JWTS and CWTS for all values of the scoring
in percentage. The main reason for the adoption of % value is that
method, , and ccp. We then select a maximum of three triples of
for different documents and different parameters of the PLSA
these parameters from each algorithm that best represent the three
model, the number of iterations is not a reliable indicator of when
extraction rates and use these as fixed parameter values for the
the characteristic points appear. For example, for low values of
rest of the algorithms. Our assumption here is that values of
such as 0.65, the stabilization point of may appear when the
scoring method, , and ccp that have performed best for the two
number of iterations has reached in the 40s.
algorithms are likely to represent the creation of good PLSA
Based on our experiments with three sample documents from our models, which can also perform well with the other algorithms.
data sets, we have assigned the convergence criterion for the That is, when compared to the other parameters, they are less
various points: likely to have a direct influence on the generation of sentence
scores.
x 0.9% for the turning point
x 0.03% for the stabilization point KITS creates sentence scores by measuring similarity between the
x 0.007% for the overtraining point. folded query and each sentence. To create sentence scores, we can
either compare the word distributions, p(w|q) and p(w|s), or we
4.1.2.5. Convergence criterion (Number of can compare the topic distributions, p(z|q) and p(z|s) . Both
Iterations), cci, for Folding methods have shown almost identical performance. We will only
Based on our own experiments on three documents, we concluded report the former variant of KITS since it has the best
that the number of iterations rather than % value for measurement performance for all of the three extraction rates. Figure 1 below
of the convergence criterion gives quite accurate identification of gives the performance of the six algorithms for each of the three
properties of the log-likelihood used durig the folding operation. extraction rates. For a detailed discussion of the experimental set
up and results, the reader is refered to [21].
In the information retrieval [4] and document segmentation [5]
papers, the authors write that only a very small number of
iterations are sufficient for the folding operation. At about five
iterations, we achieve convergence value. But, from our
experiments, we saw that, to achieve a very precise point of the
local maximum it usually requires a very large number of
iterations, such as 100. The values we chose for experimentation
are: 5, 10, 15, 20, 30, 40, 60, 100, and 200.
4.1.2.6. Number of Topics
For the case of single-document text summarization, where the
term by sentence matrix is quite sparse and the topics discussed
are usually small, we believe that the number of topics 2, 3, 4, 5
and 6 give us a reasonable exploration of the effect of this
parameter.

- 13 -
4.2.2. Comparison at 25% Extraction Rate
0.54
The situation at 25% extraction rate is largely the same as that of
30%
25%
at 30% except for a few differences:
0.52012
20%
x JWS now performs much better than JWTS, as expected
0.5 from the fact that JWTS ignores the influence of the
selected keywords.
Precision/Recall

0.48499
x KIVSM-dot now performs better than JWTS. This
0.46689 means that the keywords counting procedure of
KIVSM-dot 3 compensates for the algorithmss use of
the vector space model.
0.44 x JWS also outperforms KIVSM-dot. The use of the
concept-based sentence ranking procedure of JWS
0.42 seems too much for KIVSM-dot at lower and hence
more demanding extraction rate. Certainly, we expect
0.4
JWTS CWTS JWS CWS KITS KIVSM PROC3+ First n
that more thinking is needed to produce shorter
Summarizer summaries than longer ones.
x PROC3+ now gets outperformed more significantly
Figure 1: Precision/Recall values for the proposed algorithms than previously by all of the proposed approaches that
at 20%, 25% and 30% extraction rates. favor long sentences. This is expected because as the
The various parameters of the summarization system play a extraction rate goes lower, the chance that a longer
crucial role in helping algorithms achieve their full potential. In sentence is important decreases. Hence, more analysis is
particular, a small number of iterations of the EM algorithm required to get better precision/recall results than just
(0.9%) that generates the PLSA model is found to be quite selecting the longest sentences.
effective in creating high quality PLSA models. Lower values of x The first n sentences summarizer now performs better
the tempering parameter (0.6, 0.65 and 0.7), have also produced than KITS and it has now performance that is
quite stable and good performance. The scoring method of top 15 comparable to the algorithms that favor long sentences.
of 20 restarts coupled with = 0.7 has performed quite well in But it is still outperformed by the algorithm that uses
most algorithms, especially at lower extraction rates. This shows the longest sentences heuristic (PROC3+).
that the problem of local maxima that affects the EM algorithm
can be tackled well consistently by a single scoring method. This 4.2.3. Comparison at 20% Extraction Rate
is an improvement over previous proposed approaches that use the The way the different algorithms compare at 20% extraction rate
method of 5 restarts to solve the same problem [4, 10]. We have mirrors the one at 25% rate, the differences being:
also found out that large numbers of keywords are required in all
algorithms to produce the best summaries. x The first n sentences algorithm now performs better
than all of the algorithms. This confirms previous
4.2.1. Comparison at 30% Extraction Rate results in the literature that this heuristic is a strong
While JWTS is the best performing system at 30% rate, it can be baseline for text summarization in the news domain.
seen that systems that have the best performance are the x CWS now has performance better than CWTS by a
algorithms that favor long sentences; they are JWTS, JWS, and bigger margin than at 30% and 25% extraction rates.
KIVSM-dot. This can be explained by the argument that the This shows that the lack of use of keywords by CWTS
longer the sentence, the higher the chance that it contains useful affects its performance strongly at lower extraction
concepts. PLSA gives a formal proof that the longest sentences rates.
are the likely candidates for inclusion in summary. This has been
shown by [2] in their experiment on a standard English data set of 5. CONCLUSION AND FUTURE WORK
the DUC2 2002 where PROC3 is the best performing system. This paper investigated the application of topic modeling to the
task of concept-based single-document Amharic text
Algorithms CWTS, CWS and KITS use some form of summarization. Text summarization involves the identification of
normalization of sentence length. This resulted in decreased topics of a document and the use of these topics to select
precision/recall performance. All of the proposed algorithms that sentences that best summarize the document. The identification of
favor longest sentences perform better than PROC3+. This shows topics using the topic modeling approach results in word-based
us that PROC3+ can benefit from the inclusion of shorter features and sentence-based features. From the extensive literature
sentences once in a while. review we carried out, all but one of the previous works made use
of the sentence-based features to form sentence scores. Only one
work used word-based features for sentence ranking but the
ranking algorithm it used is not quite sound from a theoretical

3
Keyword counting because it uses the dot-product as a similarity
2
http://duc.nist.gov measure and sentence scores are always integers.

- 14 -
point of view. The research highlighted the advantages of using [5] Brants, T., F. Chen, and I. Tsochantaridis. 2002. Topic-based
the word-based features (keyword approach) over the sentence- document segmentation with probabilistic latent semantic
based features (sentence-based approach). analysis. In Proc. of CIKM 02. 211218.
[6] Buntine, W., Lfstrm, J., Perki, J., Perttu, S., Poroshin, V.,
We proposed and investigated six new approaches that use the Silander, T., Tirri, H., Tuominen, A., & Tuulos, V. (2004). A
word-based features that result from a PLSA model. We carried
Scalable Topic-Based Open Source Search Engine. In:
out experiments to determine which of the proposed algorithms Proceedings of the IEEE/WIC/ACM Conference on Web
works best for single-document Amharic text summarization. Intelligence. 228-234.
Experiments were also carried out to compare the performance of [7] Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer,
the proposed algorithms with those proposed by previous works. and R. Harshman. 1990. Indexing by Latent Semantic
The algorithms were evaluated for precision/recall on three Analysis. In Journal of the American Society for Information
different extraction rates: 20%, 25% and 30%. Of the proposed Science. 41, 6, 391-407.
algorithms, the best performing are those that do not normalize [8] Gong, Y., and X. Liu. 2001. Generic text summarization
sentence length when creating sentence scores.
using relevance measure and latent semantic analysis. In
When compared to previous summarization approaches using Proc. of SIGIR 01. 1925.
topic modeling, the best of the proposed systems gave improved [9] Harabagiu, S. M., and F. Lacatusu. 2002. Generating single
results. This shows that the PLSA-based keyword approach to and multi-document summaries with GISTEXTER. In
single-document text summarization represents a step forward in Document Understanding Conference.
building better summarization systems using topic modeling. [10] Hennig, Leonhard. 2009. Topic-based multi-document
summarization with probabilistic latent semantic analysis. In
Applying the proposed algorithms for the task of multi-document, Recent Advances in Natural Language Processing,RANLP.
query-focused and update summarization tasks is one possible [11] Hofmann, T. 2001. Unsupervised Learning by Probabilistic
future work. It is quite common that LSA based systems make use Latent Semantic Analysis. In Machine Learning. 42, 177
of term weighting to help the summarization algorithms perform 196.
better. However, all of the summarization systems we reviewed [12] Hofmann, T., and Jan Puzicha. 1998. Unsupervised learning
that use probabilistic topic models, including ours, do not use from dyadic data.Technical Report TR-98-042. International
term weighting. Thus, future works should investigate the Computer Science Institute, Berkeley, California.
appropriate term weighting techniques for probabilistic topic [13] Kamil Nuru. 2004. Automatic Amharic News Text
models. Local languages like Amharic suffer from lack of Summarization. MSc thesis. Addis Ababa University, Addis
linguistic resources that can help in representing the various Ababa, Ethiopia.
concepts of documents. Thus, it becomes necessary to maximize [14] Ledeneva, M. en C. Yulia Nikolaevna. 2008. Automatic
the use of the co-occurrence statistics available for documents. Language-Independent Detection of Multiword Descriptions
One way of achieving this is to consider the use of larger units of for Text Summarization. PhD diss. National Polytechnic
co-occurrence counts such as bigrams, trigrams; their skip Institute Mexico.
versions (e.g. skip bi-grams); and grams that are considered the [15] Melese Tamiru. 2009. Automatic Amharic Text
same if the order of the words they contain is ignored. The Summarization Using Latent Semantic Analysis. MSc thesis.
establishment of a large-scale standard data set for evaluation of Addis Ababa University, Addis Ababa, Ethiopia.
Amharic text summarization systems should also be considered in [16] Nenkova, A. 2005. Automatic text summarization of
the future. newswire: Lessons learned from the document understanding
conference. In Proceedings of AAAI 2005. Pittsburgh, USA.
6. ACKNOWLEDGEMENT [17] 6WHLQEHUJHU-DQG.-HHN8VLng latent semantic
Special thanks to Addis Ababa University for the financial analysis in text summarization and summary evaluation. In
support to undertake this research. Proc. ISIM 0. 93100.
[18] 6WHLQEHUJHU-03RHVLR0NDEDGMRYDQG.-HHN
7. REFERENCES Two Uses of Anaphora Resolution in Summarization. In
[1] Abraham Adefris. 2007. Automatic Multi-Source Amharic Information Processing and Management: an International
News Summarization. MSc thesis. Graduate School of
Journal. 43, 6.
Telecommunications & Information Technology, Addis [19] Steyvers, M., and T. Griffiths. 2007. Probabilistic Topic
Ababa, Ethiopia. Models. In Latent Semantic Analysis: A Road to Meaning,
[2] Bhandari, H., M. Shimbo, T. Ito, and Y. Matsumoto. 2008. edited by T. Landauer, D McNamara, S. Dennis, and W.
Generic text summarization using probabilistic latent Kintsch. Laurence Erlbaum.
semantic indexing. In Proc. of IJCNLP. [20] Teferi Andargie. 2005. The Application of Machine Learning
[3] Blei, D. M., A. Y. Ng, M. I. Jordan, and J. Lafferty (editor). Technique (NAVE BAYES) for Automatic Text
2003. Latent Dirichlet Allocation., In Journal of Machine summarization (The Case of Amharic News Texts). MSc
Learning Research, 3, 993-1022. thesis. Addis Ababa University, Addis Ababa, Ethiopia.
[4] Brants, T. 2002. Test data likelihood for PLSA models. In [21] Yirdaw, Eyob Delele. 2011. Topic-based Amharic Text
ACM SIGIR Workshop on Mathematical/Formal Methods in Summarization. MSc thesis. Addis Ababa University, Addis
Information Retrieval. Tampere, Finland. Ababa, Ethiopia.

- 15 -

You might also like