Biomedical Text Summarization Using Conditional Generative Adversarial Network (CGAN)
Biomedical Text Summarization Using Conditional Generative Adversarial Network (CGAN)
Biomedical Text Summarization Using Conditional Generative Adversarial Network (CGAN)
Abstract
Text summarization in medicine can help doctors for reducing the time to access important information from
countless documents. The paper offers a supervised extractive summarization method based on conditional
generative adversarial networks using convolutional neural networks. Unlike previous models, which often use
greedy methods to select sentences, we use a new approach for selecting sentences. Moreover, we provide a
network for biomedical word embedding, which improves summarization. An essential contribution of the paper
is introducing a new loss function for the discriminator, making the discriminator perform better. The proposed
model achieves results comparable to the state-of-the-art approaches, as determined by the ROUGE metric.
Experiments on the medical dataset show that the proposed method works on average 5% better than the
competing models and is more similar to the reference summaries.
Keywords: Medical text summaries, extractive summarization, conditional generative adversarial networks,
convolutional neural networks
1. Introduction
Today, the information required by physicians is provided by various sources. These resources include databases
of scientific articles, patient medical records storage systems, web-based documents, e-mail reports, and
multimedia documents [1, 2]. With the rapid growth of the Internet, the volume of biomedical articles is
increasing. PubMed, for example, contains more than 32 million citations to the biomedical articles [3]. Text
summarization is an efficient tool for extracting the required information and managing large textual resources.
The purpose of text summarization is to produce a shorter version of the document so that the summary covers all
the contents of the document. Summarization approaches are mainly divided into two categories: abstractive and
extractive. In extractive summarization, the sentences extracted from the original text are put together and returned
as the final summary. In contrast, in abstractive summarization, natural language processing methods are utilized
to examine the input text, and a summary containing important contents of the original text is generated. These
sentences are very similar to the input text sentences, but they are not the same [4]. From another perspective,
summarization can be defined as single-document or multi-document. In single-document summarization, each
document is processed independently, and the output is a summary for the given document. In multi-document
summarization, multiple documents with the same topic produce the summary [5]. Summarization can be public
or query-based. In query-based summarization, the user specifies a query, and system generates a summary which
tries to answer the user's query [1]. This paper proposed a public extractive single-document summarization
model.
Different methods have been proposed to address the challenges of summarization. These methods mainly use
graph [6-8] or machine learning techniques [9-11]. Machine learning methods consider summarization as a
classification problem which outputs whether to include a sentence in summary or not. Graph-based methods
model the text as a graph and then summarize it by analyzing the nodes. Deep learning has attracted a lot of
attention in the fields of natural language processing in different applications, such as machine translation [12,
13], question answering [14, 15], and text classification [16, 17].
Recently, abstractive methods use generative models to generate text. In [18], a generative model based on the
reinforcement learning for abstractive summarization is presented. In [19], a method based on auto-encoder and
recurrent neural network is proposed. On the other hand, extractive summarization is more practical because it
can ensure semantic relevance and grammatical correctness between sentences [20]. The most important challenge
in extractive summarization is to select the most valuable sentences of the text so that it covers the important
contents of the text.
There are two main components in the extractive summarization techniques: sentence ranking and sentence
selection. In most of the proposed methods, including [20-25] for selecting summary sentences, the selected
previous sentences are not considered. In other words, they are greedy in choosing the sentences.
The main contributions of this article are as follows:
1- We use a conditional generative adversarial network for summarizing the text with a novel training method.
2- We provide suitable features for use in the text summarization applications.
3- One of the important contributions of the proposed model is the way of scoring the sentences, in which each
sentence is scored by considering all the sentences. Another advantage of the model is the production of different
summaries for the text. Due to exist of randomness in the generative model, several summaries are generated for
each document, in which this article uses the voting system to select proper sentences.
4- In generative adversarial networks, the generator purpose is to produce data close to the real data and the
purpose of the discriminator is to distinguish between real and fake data. Training is done under a two-player
game. After training, the generator can generate real data. In this paper, many fake summaries are extracted for
each document, leading to a new loss function for the discriminator.
The proposed model is evaluated on the medical articles [3]. Articles include biomedicine and health fields and
related disciplines such as chemical sciences and bioengineering. Each article has a real summary which is used
to evaluate the system. The first proposed method is called GAN-Sum, which uses hand engineering features for
summarizing. The second proposed method is E-GAN-Sum, which uses embedding features for this purpose.
Experiments show that the proposed method achieves better performance based on the ROUGE metric than the
competing algorithms.
The summary of the article is as follows. Section 2 deals with the related works. The backgrounds of the proposed
framework are stated in the section 3. The proposed method is introduced in the section 4. Section 5 presents the
dataset, evaluation metric, and results. Finally, the article concludes in the section 6.
2. Related Works
Recently, many works are presented based on deep learning for text summarization. In this section, we discuss
these methods, including those which are compared to our model.
Zhong et al. [21] considered the subject of the document for summarization. In this method, after learning the
features of the sentences by an auto-encoder, the importance matrix of document words is created. Then the score
of each sentence is calculated using the scores of its words. Finally, the sentences with the highest score are chosen
as the summary sentences. Yousefi-Azarand and LenHamey [22] used the same idea. The difference is that instead
of using the importance matrix, the cosine similarity between the sentences and the subject is used to score the
sentences.
Wu and Hu [20] used reinforcement learning(RL) to summarize the document. They considered coherence as a
reward. Their method was the first method which considers the coherence between sentences. The policy function
is implemented by a multilayer perceptron (MLP) to select the most valuable sentences for the summary.
Cao et al. [23] suggested recurrent neural networks(RNN) to score sentences. Each sentence is represented as a
tree in which the words form the leaves of the tree. Representation of each sentence (tree root) is obtained as a
recursive and non-linear process from the leaves of the tree, and the score of each sentence is determined based
on its leaves.
Gonzalez et al. [24] proposed a text summarization method based on Siamese neural network (SNN). They used
a word and sentence level attention mechanism to rate words and sentences. In this method, the recurrent neural
network extracts the features of the word, and the importance of the word in the sentence is estimated using the
obtained features. Then the features of the sentence are extracted using the level of attention and the features of
the words. The same process is repeated to obtain the features of the document and the reference summary. Finally,
using the classifier, the similarity of the summary and the document is obtained. Only the sentence attention level
is used to select the summary sentences.
Nallapati et al. [25] proposed a supervised method for single-document summarization using recurrent neural
networks. They considered summarization as a classification. First, the embedded vector of words is given to a
recurrent neural network, and the hidden vectors of words are extracted. The average of vectors is considered as
the vector of sentence features. Finally, logistic regression is used for the binary classification of sentences.
3. Backgrounds
3.1. Generative Adversarial Network(GAN)
The generative adversarial networks were first proposed by Goodfellow et al. [26]. These networks are one way
to create generative models in which two networks are trained simultaneously: a generator which produces fake
data and a discriminator which separates real data from the fake data. The generator is trained to deflect the
discriminator by producing data which resembles real data.
To learn the generator distribution 𝑝𝑔 , a mapping function 𝐺(𝑧, 𝜃𝑔 ) is used, where 𝑧 is a noise vector from
probability distribution 𝑝𝑧 , and 𝜃𝑔 is the generator parameters which must be trained. The function 𝐺(𝑧, 𝜃𝑔 ) maps
the distribution of 𝑝𝑧 to the data space. In addition, the discriminator is equivalent to the function 𝐷(𝑥, 𝜃𝑑 ), where
𝜃𝑑 is the learning parameters, and 𝑥 is the input data. 𝐷(𝑥) indicates the probability that the data 𝑥 is from the
𝑝𝑔 distribution and has a value between zero and one. The output of the discriminator is going to be one for real
and zero for fake. Mathematically, 𝐷 and 𝐺 play a two-player min-max game with the following value 𝑉 (𝐺, 𝐷):
where 𝐺 and 𝐷 are the generator and discriminator, respectively. 𝑝𝑑𝑎𝑡𝑎 (𝑥) and 𝑝𝑧 (𝑧) represent the real data
distribution and input noise, respectively. 𝐸 is the mathematical expectation.
3.2. Conditional Generative Adversarial Network(CGAN)
Generative adversarial networks can be developed to a conditional model, in which the generator and
discriminator are conditioned based on additional information. Additional information is any kind of auxiliary
information which is given as input to the discriminator and generator [27]. Fig. 1 demonstrates an example of a
conditional generative adversarial network. Suppose we want to generate handwritten digits. In this case, the input
of the generator includes noise and, the other input is the number that the generator must produce. The output of
the generator is the numerical image given at the input as a condition.
where Data is the set of documents at the time of training, 𝑦𝑖 is the feature matrix of each document, and 𝐸 is the
mathematical expectation. The loss function for the discriminator network is calculated based on the generator
output, real and fake vectors as follows:
𝐿𝑜𝑠𝑠𝐷 = 𝐸𝑖~𝐷𝑎𝑡𝑎 [ 𝐸𝑧~𝑝𝑧(𝑧) [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧|𝑦𝑖 )| 𝑦𝑖 ))] + 𝐸𝑘~𝑝𝐹𝑎𝑘𝑒 [𝑙𝑜𝑔(1 − 𝐷(𝑘|𝑦𝑖 ))]
𝑖
+ 𝐸𝑘~𝑝𝑅𝑒𝑎𝑙 [𝑙𝑜𝑔(𝐷(𝑙|𝑦𝑖 ))]] (3)
𝑖
Where 𝑝𝑅𝑒𝑎𝑙𝑖 and 𝑝𝐹𝑎𝑘𝑒𝑖 represent the distribution of real and fake vectors for the 𝑖-th document, respectively.
Equation 3 causes the discriminator to learn the best and worst summary for each document and forces the
generator to produce the best summary. The generator tries to be similar to the sentences given to the discriminator
as a real vector.
4.3. Summarization
After training, only the generator is used to select the sentences. The selection of sentences in a document is based
on voting. First, the feature matrix introduced in the previous sections is extracted for each document and given
as the input to the generative network. The probability vector of the sentences is obtained for different noises.
After the generator calculates this vector, the sentences which are more likely than average are selected. Finally,
the sentences are ranked based on the number of selections, and the sentences with the highest rank are selected
as summary sentences.
5. Results and discussion
5.1. Preprocessing
Pre-processing improves results, reduces computations, increases speed and accuracy. In this article, two
preprocessing techniques are used: 1- Stop word removal 2- Stemming.
Table 1
Extracted features in the proposed model.
Feature Description
Common Word The number of occurrences (regardless of repetition) N common words in the
dataset, divided by the sentence length.
Position The position of the sentence. Supposing there are N sentences in the document,
for j the sentence, the position is computed as 1-(j-1)/(N-1).
Length The number of words in the sentence, divided by the length of the largest
sentence.
Number Raito The number of digits, divided by the sentence length.
Named entity ratio The number of named entities, divided by the sentence length.
Tf/Isf Term frequency over the sentence, divided by the largest term frequency.
Sentence Similarity The number of occurrences (regardless of repeating) of words in the sentence
with the highest Tf/Isf in the sentence divided by the length of the sentence.
Stop word The number of stopwords in the sentence. Supposing there is N number of stop
words, for sentence s, the value is computed as 1-N/length(s).
None phrase The number of None phrases, divided by the sentence length.
Pos Ratio A 4-dimensional vector containing the number of nouns, verbs, adjectives, and
adverbs. Each cell is divided by the sentence length.
SweSum [31] is based on the statistical method. This system considers the frequency of the keywords in the
document, position of sentences, numerical values, first paragraph tag. SweSum is available for many languages,
including English, Persian, Spanish.
BGSumm [32] employs the Unified Medical Language System [33](UMLS) to identify the main textual concepts.
In this method, concepts are used to build a graph in which the sentences form the graph nodes. The model scores
the nodes based on their centrality, and the sentences with the highest score are selected.
TextRank [34] is a graph-based summarizer in which sentences form graph nodes. TextRank considers common
signs between sentences as a measure of similarity. The obtained similarity values are utilized as the weights of
the connections between the nodes. Finally, using graph node rankings, the strongest nodes are selected as
summary sentences.
LexRank [35] is another graph-based method. It models the sentences as nodes and uses Term Frequency-Inverse
Document Frequency (TFIDF) and cosine similarity as a measure of similarity between the sentences and assign
weights to the edges. Finally, the sentences with the most centrality are put together as summary sentences.
TexLexAn [36] is an open-source summarizer which utilizes features such as keywords to rate sentences.
CNN is a basic model which uses only the generator. The purpose of using this model is to investigate the
importance of the discriminator network in our model. The training of this basic network is done using the target
of each document.
GAN-Sum and E-GAN-Sum are our proposed model. GAN-Sum uses handcrafted features, and E-GAN-Sum
uses features embedding for summarization.
Table 2 shows that deep learning methods performed better than graph-based approaches. In addition, our models
work better than other deep learning methods. The reason for the superiority of the proposed method can be due
to the way the sentences are scored, where each sentence is scored considering all the sentences. Another reason
can be the use of the voting system. Because voting can increase the similarity of the production summary and
the reference summary. On the other hand, increasing the number of features and using embedding can improve
the proposed model. Fig. 6 shows three sentences extracted and a score assigned to each sentence by the generator.
Bold words are shared with the reference summary. As we can see, the sentences with the highest scores contain
the most information about the real summary.
Table 2
Evaluation of the proposed model, according to ROUGE.