Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
38 views5 pages

NLP Sentence Simplification Guide

The document describes a paper presented at the 2019 4th International Conference on Information Systems and Computer Networks held in Mathura, India from November 21-22, 2019. The paper was presented by Avishek Garain, Arpan Basu, Rudrajit Dawn, and Sudip Kumar Naskar from Jadavpur University. It proposes a rule-based approach to simplify English sentences from complex and compound sentences using syntactic parse trees. The approach consists of two separate algorithms to simplify complex and compound sentences into simple forms.

Uploaded by

pratik kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views5 pages

NLP Sentence Simplification Guide

The document describes a paper presented at the 2019 4th International Conference on Information Systems and Computer Networks held in Mathura, India from November 21-22, 2019. The paper was presented by Avishek Garain, Arpan Basu, Rudrajit Dawn, and Sudip Kumar Naskar from Jadavpur University. It proposes a rule-based approach to simplify English sentences from complex and compound sentences using syntactic parse trees. The approach consists of two separate algorithms to simplify complex and compound sentences into simple forms.

Uploaded by

pratik kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2019 4th International Conference on Information Systems and Computer Networks (ISCON)

GLA University, Mathura, UP, India. Nov 21-22, 2019

Sentence Simplification using Syntactic Parse trees

Avishek Garain Arpan Basu Rudrajit Dawn


Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Jadavpur University Jadavpur University Jadavpur University
avishekgarain@[Link] arpan0123@[Link] rudrajitdawn@[Link]

Sudip Kumar Naskar


Computer Science and Engineering
Jadavpur University
[Link]@[Link]

Abstract—Text simplification is one of the domains in Natural applications such as Natural Language Processing, Query
Language Processing which offers great promise for exploration. Processing and Speech Processing. In Text Summarization,
Simplifying sentences offer better results, as compared dealing sentence simplification is used to shorten the original Text
with complex/compound sentences, in many language processing without losing the meaning of the content. Also, it has been
applications as well. Recently, Neural Networks have been used proved that training Machine Translation systems, with simple
in simplifying texts, be it by state of the art LSTM’s and GRU
cells or by Reinforcement learning models. In contrast, in this
sentences only, lead to better translation outputs [7].
work, we present a classical approach consisting of two separate Recently, many splitting technique has been developed
algorithms, for simplification of complex and compound sentences for English language, mainly employing the usage of Neural
to their corresponding simple forms. Networks, be it by state of the art LSTM’s and GRU cells
Keywords- Syntactic Parse tree, SBAR, BLEU, Anytree or by Reinforcement learning models. In this work, we have
I. I NTRODUCTION developed a classical, rule based approach, that can simplify
English sentences, from complex and compound sentences.
On a trivial basis, a complex or a compound sentence can Since, the work is based on simple syntactic rules, the system
be differentiated from simple sentences through the presence is fast ans modular as well.
of conjunctions (coordinating or subordinating).
The rest of the paper has been organized as follows. Section
A compound sentence can be defined as an amalgamation II will specify the state-of-the-art that has been developed so
of two or more different independent clauses based on a similar far in this domain. Section III will define the data, on which the
idea. The independent clauses can be joined by a coordinating work has been done. This will be followed by the methodology
conjunction (for, and, nor, but, or, yet, so) or by a semicolon, of developing our prototype in Section IV. The paper will end
as we can see in the compound sentence examples below. with the results and concluding remarks of the work in Section
V and VI, respectively.
• She did not cheat on the test, for it was the wrong
thing to do. II. R ELATED W ORK
• I really need to go to work, but I am too sick to drive. Zhang and Lapata [14], proposed a model called Deep
• The sky is clear; the stars are twinkling. Reinforcement Sentence Simplification, where they use an
encoder-decoder model along with Deep Reinforcement frame-
On the other hand, a complex sentence is similar to a work to carry on the task of simplification and reported good
compound sentence, but varying in the fact that it consists of accuracy scores.
at least one dependent clause, related to a independent clause. Vu et al. [13], in their work used an architecture with
An independent clause has the ability to stand alone as a augmented memory capabilities named Neural Semantic En-
sentence. It always makes a complete thought. A dependent coders and have reported Bleu scores as high as 92.02. The
clause cannot stand alone, even though it has a subject and a model gives unrestricted access to the entire source sequence
verb. Lets take a look at some examples of complex sentences. (complex sentence) stored in the memory and thus, through
attention model, gives importance to each and every word of
• I was snippy with him because I was running late for
the sentence keeping priorities for arrangement too.
work.
Guo et al. [3] introduced a sequence to sequence based
• Because I was running late for work, I was snippy learning model with improvement using paraphrasing capa-
with him. bilities via multi-task learning. They proposed a multi-level
layered soft sharing approach where each auxiliary task shares
Sentence simplification is necessary for dealing with prob-
different level layers of the sentence simplification model,
lems related with long sentences. These include embedded
depending on the tasks semantic versus lexico-syntactic nature.
clauses such as relative clause. To overcome this problem, the
long sentence is simplified into smaller sentences. Sentence Poornima et al. [10] used rule based Sentence Simplifica-
simplification can be used as a pre-processing tool in several tion on their English to Tamil Machine Translation System.

978-1-7281-3651-6/19/$31.00 ©2019 IEEE 672

Authorized licensed use limited to: North Eastern Hill University. Downloaded on April 25,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
Root
They made use of dependency parse tree for the method of
simplification. It is interesting to note that they have reported S

significance increase in accuracy due to simplification of S


,
S
CC
Complex and Compound sentences to simple sentences and NP NP VP
also decrease in error rates. This report is promising for
the immediate usage of our algorithm in several other fields PRP ADVP VP PRP VBP ADJP

in Natural Language Processing and validating variation in RB VBP S RB JJ S


performance. VP VP

Jonnalagadda et al. [4] presented an effective summariza- TO VP TO VB


tion system on basis of text simplification and results were
quite promising. VB S

V et al. [11] used the strategies like a extractive summa- VP

rization system which is focussed on tasks, simplification of TO VP


sentences, and lexical expansion of topic words for increasing VB
content responsiveness. I
really need to go to work , but I am too sick to drive

III. DATA Fig. 2: Compound sentence parse tree sample.


As such no such state of the art corpus consisting of aligned
data on Complex and Compound sentences to corresponding
Simple sentences is available to test on for our algorithm.
So we developed our own dataset by collecting Compound split the constructions joined together by the conjunction. The
and Complex sentences from various digital resources. Hence, algorithm is recursively applied to split sentences into simple
10,000 sentences were collected using this method, where 5200 sentences, till no CC tag remains.
were complex sentences and 4800 were compound sentences.
2) Algorithm: For the input sentence, initially, a syntactic
IV. M ETHODOLOGY parse-tree is generated. Thereafter, the node corresponding to
the CC tag is found. Let it be called X. Then we find the parent
Our sentence simplification algorithm consists of two parts; of the parent of the node with the tag. Let it be called Y. For
first part pertaining to simplification of compound sentences all nodes which are siblings of X, we attach each such node
which where clauses are connected using coordinating con- to node Y to get distinct parse-trees. The tag and the related
junctions, and the second part pertains to simplification of punctuation (mainly commas) are removed.
compound sentences which where clauses are connected using
subordinating conjunctions. The collected sentences after POS
Each generated parse-tree is a simplified sentence devoid
tagging by NLTK[6] POS tagger, were shallow parsed using
of the conjunction which was being considered. We note that
Stanford Parser[8], before hand, to obtain the parse tree struc-
each parse-tree is not necessarily a correct parsing; we only use
ture. For both the simplification module, the algorithms works
them to construct the simplified sentences. The above approach
on the same parse tree structure, obtained earlier. Structure of
is then applied recursively on each simple sentence thus
a parse tree for sample complex and compound sentences are
produced. The algorithm stops when there are no remaining
shown in Figure 1 and 2, respectively.
conjunctions.
Root

S SBAR S

NP VP NP VP

PRP VBD ADVP PP PRP VBD VP ADVP PP NP

RB IN NP VBG RB IN NN

PRP

I was snippy on him because I was running late for work

Fig. 1: Complex sentence parse tree sample.

Fig. 3: Figure representing the conjunction removal algorithm


A. Simplification of Compound Sentences
1) Outline: The first part of the algorithm is mainly for
splitting a sentence based on the coordinating conjunctions Figure 3 pictorially represents the conjunction removal
present in the parse tree. This type of sentences generally algorithm. The blue links are the newly formed links, each
contain a CC tag present in them. The algorithm takes advan- blue circle forming an independent sentence. The red circles
tage of this tag and the corresponding syntactic parse-tree to denote the parts being discarded.

673

Authorized licensed use limited to: North Eastern Hill University. Downloaded on April 25,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE I: Table showing the simple sentences generated on
an example input
Input sentence
Melanie bought a Batman game, a strategy game and a Superman game.
Output simple sentences
Melanie bought a Batman game .
Melanie bought a strategy game .
Melanie bought a Superman game .

B. Simplification of Complex Sentences Fig. 5: Syntactic parse tree

1) Outline: We feed a complex sentence to our algorithm


as input. Then it will return a list of sentences without SBAR
tag in any of the sentences . All of the sentences may not If the SBAR part does not contain NP then NP and VBZ from
be simple . Only SBAR part will be removed and equivalent non-SBAR part is copied to SBAR part to make sentences
simpler sentences will be returned . inside SBAR part. This is shown in Figure 5.
Example:
While eating food Ram is singing a song.
2) Algorithm: First the shallow parsed tree is converted
to ANYTREE1 . This is done because of easy handability of
ANYTREE inside code. Example of ANYTREE parse tree is
shown in Figure 4.

Fig. 6: Syntactic parse tree

If CC is inside SBAR then different sub-sentences (S) inside


SBAR are considered as different SBAR’s and previous points
are applied to each of the SBAR’s. This is shown in Figure 6.
Example:
While eating food and drinking water Ram is singing a song .

Fig. 4: Example of Anytree parse tree

Initially, the subtree rooted at SBAR node is detected. If the


SBAR part contains NP, then sentences can be made inside
SBAR part. Some examples of the same is shown below.

Example:
Because I was late, I became angry.

Fig. 7: Syntactic parse tree


1 [Link]

674

Authorized licensed use limited to: North Eastern Hill University. Downloaded on April 25,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
One sentence from non-SBAR part can be made. This is shown
in Figure 7.
Example:
If he comes I will go. Parsing of this is shown in Figure 8.

Fig. 10: Syntactic parse tree

TABLE II: Inter Annotator agreement for complex sentences,


Annotator B
Correct Incorrect
Correct 4798 20
Annotator A
Incorrect 30 352
Kappa 0.929
Fig. 8: Syntactic parse tree

For the automated evaluation part, the annotators, manually


Some more parse trees fitting perfectly in our algorithm are as converted 100 complex sentences and 100 compound sentences
follows: to simple sentences. The same set of complex and compound
Complex sentence: After Shyam came , Ram went . sentences were converted by our algorithm, as well. The simple
Simple sentences: [’Ram went .’, ’Shyam came .’] sentences produced by our algorithm, were then compared with
the manually converted ones. The automatic evaluation metric
used here was BLEU [9]. BLEU, or the Bilingual Evaluation
Understudy, is a score for comparing a candidate text to one or
more reference texts. BLEU score came as 0.89 for complex
sentences and 0.91 for compund sentences. This, again, is
highly promising.
Also, statistics of complex and compound sentences when
converted to simple sentences, is given in Table IV.

VI. C ONCLUSION
Our approach is a different approach as compared to the
current practice of using deep learning, and hence has large
scope of improvement and modifications as a part of future
work. Since the algorithm is mainly dependent on the syntactic
parse tree generated, so wrong parse trees may hinder the
efficiency of our algorithm. Correct parse trees as seen above
Fig. 9: Syntactic parse tree gives perfect results. In the meantime, we came to identify
simple sentences to consist of rule: NP− →VP. These simple
sentences are easier for other Language Processing systems to
Complex sentence: After she ate the cake , Emma visited handle, and may help in increasing their efficiency in feature
Tony in his room . extraction and accuracy. This work is left to be discovered as
Simple Sentences: [’Emma visited Tony in his room .’, ’she part of future work.
ate the cake .’]
ACKNOWLEDGMENT
V. R ESULTS The authors would like to thank Mr. Sainik Kumar Mahata
for his help and suggestions in organizing the paper. We would
Our system was tested using manual and automatic eval- also like to thank Mr. Sourav Kumar Mandal, who inspired us
uation metrics. For the manual evaluation part, two human with his mathematical prowess, to carry on this work and give
annotators, A and B, fluent in English, were asked to classify it a proper direction.
the results of our algorithm into two classes; correct and
incorrect. The inter annotator agreement for both complex
R EFERENCES
and compound sentences are shown in Table II and III. The
Standard Error (SE) for the agreement ranges between 0.009- [1] C P, Dhanalakshmi V, Kumar M, Kp S (2011) Rule
0.010, which is very promising. based sentence simplification for english to tamil machine

675

Authorized licensed use limited to: North Eastern Hill University. Downloaded on April 25,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.
TABLE III: Inter Annotator agreement for compound sen- for computational linguistics, Association for Computa-
tences, tional Linguistics, pp 311–318
[10] Poornima C, Dhanalakshmi V, Anand K, Soman K
Annotator B (2011) Rule based sentence simplification for english to
Correct Incorrect
Correct 4296 26
tamil machine translation system. International Journal of
Annotator A Computer Applications 25(8):38–42
Incorrect 32 446
Kappa 0.932 [11] V L, Suzuki H, Brockett C (2006) Microsoft research
at duc2006: Task-focused summarization with sentence
simplification and lexical expansion
TABLE IV: Statistics of converted sentences. [12] Vickrey D, Koller D (2008) Sentence simplification
Complex Compound
for semantic role labeling. In: Proceedings of ACL-08:
5,200 4,800 HLT, Association for Computational Linguistics, Colum-
Simple 9,800 10,200 bus, Ohio, pp 344–352, URL [Link]
anthology/P08-1040
[13] Vu T, Hu B, Munkhdalai T, Yu H (2018) Sentence
translation system. International Journal of Computer simplification with memory-augmented neural networks.
Applications 25, DOI 10.5120/3050-4147 CoRR abs/1804.07445, URL [Link]
[2] Coster W, Kauchak D (2011) Simple english wikipedia: 07445, 1804.07445
A new text simplification task. In: Proceedings of the [14] Zhang X, Lapata M (2017) Sentence simplification
49th Annual Meeting of the Association for Computa- with deep reinforcement learning. In: Proceedings of
tional Linguistics: Human Language Technologies: Short the 2017 Conference on Empirical Methods in Nat-
Papers - Volume 2, Association for Computational Lin- ural Language Processing, Association for Computa-
guistics, Stroudsburg, PA, USA, HLT ’11, pp 665–669, tional Linguistics, Copenhagen, Denmark, pp 584–594,
URL [Link] DOI 10.18653/v1/D17-1062, URL [Link]
[3] Guo H, Pasunuru R, Bansal M (2018) Dynamic multi- org/anthology/D17-1062
level multi-task learning for sentence simplification.
CoRR abs/1806.07304, URL [Link]
07304, 1806.07304
[4] Jonnalagadda S, Tari L, Hakenberg J, Baral C, Gonzalez
G (2009) Towards effective sentence simplification for
automatic processing of biomedical text. In: Proceedings
of Human Language Technologies: The 2009 Annual
Conference of the North American Chapter of the Associ-
ation for Computational Linguistics, Companion Volume:
Short Papers, Association for Computational Linguistics,
Stroudsburg, PA, USA, NAACL-Short ’09, pp 177–180,
URL [Link]
[5] Li T, Li Y, Qiang J, Yuan YH (2018) Text Simplification
with Self-Attention-Based Pointer-Generator Networks:
25th International Conference, ICONIP 2018, Siem Reap,
Cambodia, December 1316, 2018, Proceedings, Part V,
pp 537–545. DOI 10.1007/978-3-030-04221-9 48
[6] Loper E, Bird S (2002) Nltk: The natural language
toolkit. In: Proceedings of the ACL-02 Workshop on
Effective Tools and Methodologies for Teaching Natural
Language Processing and Computational Linguistics -
Volume 1, Association for Computational Linguistics,
Stroudsburg, PA, USA, ETMTNLP ’02, pp 63–70, DOI
10.3115/1118108.1118117, URL [Link]
1118108.1118117
[7] Mahata SK, Mandal S, Das D, Bandyopadhyay S (2018)
Smt vs nmt: A comparison over hindi & bengali simple
sentences. arXiv preprint arXiv:181204898
[8] Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard
SJ, McClosky D (2014) The Stanford CoreNLP natural
language processing toolkit. In: Association for Com-
putational Linguistics (ACL) System Demonstrations,
pp 55–60, URL [Link]
P14-5010
[9] Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a
method for automatic evaluation of machine translation.
In: Proceedings of the 40th annual meeting on association

676

Authorized licensed use limited to: North Eastern Hill University. Downloaded on April 25,2023 at [Link] UTC from IEEE Xplore. Restrictions apply.

You might also like