Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

2020 Wildre-1 8

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation, pages 39–44

Language Resources and Evaluation Conference (LREC 2020), Marseille, 11–16 May 2020
c European Language Resources Association (ELRA), licensed under CC-BY-NC

A Fully Expanded Dependency Treebank for Telugu


Sneha Nallani, Manish Shrivastava, Dipti Misra Sharma
Kohli Center on Intelligent Systems (KCIS),
International Institute of Information Technology, Hyderabad (IIIT-H)
Gachibowli, Hyderabad, Telangana-500032, India
sneha.nallani@research.iiit.ac.in, {m.shrivastava, dipti}@iiit.ac.in

Abstract
Treebanks are an essential resource for syntactic parsing. The available Paninian dependency treebank(s) for Telugu is annotated only
with inter-chunk dependency relations and not all words of a sentence are part of the parse tree. In this paper, we automatically annotate
the intra-chunk dependencies in the treebank using a Shift-Reduce parser based on Context Free Grammar rules for Telugu chunks.
We also propose a few additional intra-chunk dependency relations for Telugu apart from the ones used in Hindi treebank. Annotating
intra-chunk dependencies finally provides a complete parse tree for every sentence in the treebank. Having a fully expanded treebank is
crucial for developing end to end parsers which produce complete trees. We present a fully expanded dependency treebank for Telugu
consisting of 3220 sentences. In this paper, we also convert the treebank annotated with Anncorra part-of-speech tagset to the latest
BIS tagset. The BIS tagset is a hierarchical tagset adopted as a unified part-of-speech standard across all Indian Languages. The final
treebank is made publicly available.

Keywords: Dependency Treebank, Intra-chunk dependencies, Low resource Language, Telugu

1. Introduction the Indian Language Treebanking project. These treebanks


are annotated in Shakti Standard Format(SSF)(Bharati et
Treebanks play a crucial role in developing parsers as well
al., 2007). Each sentence is annotated at word level with
as investigating other linguistic phenomena. Which is why
part of speech tags, at morphological level with root, gen-
there has been a targeted effort to create treebanks in several
der, number, person, TAM, vibhakti and case features and
languages. Some such notable efforts include the Penn tree-
the dependency relations are annotated at a chunk level.
bank (Marcus et al., 1993), the Prague Dependency tree-
The dependency relations within a chunk are left unanno-
bank (Hajičová, 1998). A treebank is annotated with a
tated. Intra-chunk dependency annotation has been done
grammar. The grammars used for annotating treebanks can
on Hindi(Kosaraju et al., 2012) and Urdu(Bhat, 2017) tree-
be broadly categorized into two types, Context Free Gram-
banks previously. Annotating intra-chunk dependencies
mars and dependency grammars. A Context Free Gram-
leads to a complete parse tree for every sentence in the tree-
mar consists of a set of rules that determine how the words
bank. Having completely annotated parse trees is essential
and symbols of a language can be grouped together and
for building robust end to end dependency parsers or mak-
a lexicon consisting of words and symbols. Dependency
ing the treebanks available in CoNLL (Buchholz and Marsi,
grammars on the other hand model the syntactic relation-
2006) format and thereby making use of readily available
ship between the words of a sentence directly using head-
parsers. In this paper, we extend one of those approaches
dependent relations. Dependency grammars are useful in
for the Telugu treebank to annotate intra-chunk dependency
modeling free word order languages. Indian languages are
relations. Telugu is a highly inflected morphologically rich
primarily free word order languages. There are few differ-
language and has a few constructions like classifiers etc that
ent dependency formalisms that have been developed for
do not occur in Hindi which makes the expansion task chal-
different languages. In recent years, Universal dependen-
lenging. The fully expanded Telugu treebank is made pub-
cies(Nivre et al., 2016) have been developed to arrive at a
licly available 1 .
common dependency formalism for all languages. Paninian
The part-of-speech and chunk annotation of the Telugu
dependency grammar(Bharati et al., 1995) is specifically
treebank is done following the Anncorra (Bharati et al.,
developed for Indian languages which are morphologically
2009b) tagset developed for Indian languages. In the recent
rich and free word order languages. Case markers and post-
years, there has been a co-ordinated effort to develop a Uni-
positions play crucial roles in these languages and word or-
fied Parts-of-Speech (POS) Standard that can be adopted
der is considered only at a surface level when required.
across all Indian Languages. This tagset is commonly re-
Most Indian languages are also low resource languages. ferred to as the BIS 2 (Bureau of Indian standards) tagset.
ICON-2009 and 2010 tools contests made available the ini- All the latest annotation of part of speech tagging of Indian
tial dependency treebanks for Hindi, Telugu and Bangla. languages is done using the BIS tagset. In this paper, we
These treebanks are small in size and are annotated using convert the existing Telugu treebank from Anncorra to BIS
the Paninian dependency grammar. Further efforts are be- standard. BIS tagset is a fine grained hierarchical tagset
ing taken to build dependency annotated treebanks for In-
dian languages. Hindi and Urdu multi-layered and multi- 1
https://github.com/ltrc/telugu_treebank
representational (Bhatt et al., 2009) treebanks have been 2
The BIS tagset is made available at http://tdil-dc.
developed. Treebanks are also being developed for Ben- in/tdildcMain/articles/134692Draft%20POS%
gali, Kannada, Hindi, Malayalam and Marathi as part of 20Tag%20standard.pdf

39
and many Anncorra tags diverge into finer grained BIS cat-
egories. This makes the conversion task challenging.
The rest of the paper is organised as follows. In section
2, we describe the Telugu Dependency Treebank, section
3 describes the part of speech conversion from Anncorra
to BIS standard, section 4 describes the intra-chunk depen-
dency relations annotation for the Telugu and we conclude
the paper in section 5.

2. Telugu Treebank
An initial Telugu treebank consisting of around 1600 sen-
tences is made available in ICON 2009 tools contest. This Figure 1: Inter-chunk dependency annotation in SSF format
treebank is combined with HCU Telugu treebank contain-
ing approximately 2000 sentences similarly annotated and
another 200 sentences annotated at IIIT Hyderabad. We
clean up the treebank by removing sentences with wrong
format or incomplete parse trees etc. The final treebank
consists of 3220 sentences. Details about the treebank are
listed in Table 1.
Figure 2: Inter-chunk dependency tree.
No. of sentences 3222
Avg. sent length 5.5 words
Avg. no of chunks in sent 4.2 inter-chunk annotation alone does not provide a fully con-
Avg. length of a chunk 1.3 words structed parse tree for the sentence. Hence it is important
to determine and annotate intra-chunk relations accurately.
Table 1: Telugu treebank stats In this paper, we expand the Telugu treebank by annotating
the intra-chunk dependency relations.

The treebank is annotated using Paninian dependency


grammar(Bharati et al., 1995). The paninian dependency
3. Part-of-Speech Conversion
relations are created around the notion of karakas, various The newly annotated 200 sentences in the treebank are an-
participants in an action. These dependency relations are notated with the BIS tagset while the rest are annotated us-
syntacto-semantic in nature. There are 40 different depen- ing Anncorra tagset. We convert the sentences with An-
dency labels specified in the panianian dependency gram- ncorra POS tags to BIS tags so that the treebank is uni-
mar. These relations are hierarchical and certain relations formly annotated and adheres to the latest standards.
can be under-specified in cases where a finer analysis is not Anncorra tagset Bharati et al. (2009a) propose the POS
required or when in certain cases the decision making is standard for annotating Indian Languages. This standard
more difficult for the annotators(Bharati et al., 2009b). Be- has been developed as part of the guidelines for annotating
gum et al. (2008) describe the guidelines for annotating de- corpora in Indian Languages for the Indian Language Ma-
pendency relations for Indian languages using paninian de- chine Translation (ILMT) project and is commonly referred
pendencies. The treebank is annotated with part-of-speech to as Anncorra POS tagset. The tagset consists of a total of
tags and morphological information like root, gender, num- 26 tags.
ber, person, TAM, vibhakti or case markers etc at word
level. The dependency relations are annotated at chunk BIS tagset The BIS (Bureau of Indian standards) tagset
level. The treebank is made available in SSF format(Bharati is a unified POS Standard in Indian Languages developed
et al., 2007). An example is shown in Figure 1. The depen- to standardize the POS tagging of all the Indian Languages.
dency tree for the sentence is shown in Figure 2. This tagset is hierarchical and at the top most level consists
In the example sentence, the intra-chunk dependencies, i.e of 11 POS categories. Most of these categories are fur-
dependency labels for cAlA (many) and I (this) are not an- ther divided into several fine-grained POS tags. The anno-
notated. Only the chunk heads, xeSAllo (countries-in) and tators can choose the level of coarseness required. They can
parisWiwi (situation) are annotated as the children of lexu use the highest level tags for a coarse grained tagset or go
(is-not-there). deeper down the hierarchy for more fine-grained tags. The
The dependency treebanks are manually annotated and it fine-grained tags automatically contain the information of
is a time consuming process. In AnnCorra formalism for the parent tags. For example, the tag V VM VF specifies
Indian languages, a chunk is defined as a minimal, non re- that the word is a verb (V), a main verb(V VM) and a finite
cursive phrase consisting of correlated, inseparable words main verb (V VM VF).
or entities (Bharati et al., 2009a). Since the dependen-
cies within a chunk can be easily and accurately identified 3.1. Converting Anncorra to BIS
based on a few rules specific to a language, these depen- For most tags present in the the Anncorra tagset, there is
dencies have not been annotated in the initial phase. But a direct one on one mapping to a BIS tag. However, there

40
are a few tags in Anncorra which diverge in to many fine- is followed by a noun it is marked as DM DMQ, else it is
grained BIS categories. Those tags are shown in Table 2. marked as PR PRQ.
It should be noted that one to many mapping exists only Verbs Another distinction between the two tagsets lies in
with fine grained tags. There is still a one to one mapping the annotation of verb finiteness. In Anncorra, it is anno-
between the Anncorra tag and the corresponding parent BIS tated only at chunk level. In BIS schema, the finiteness can
tag in all cases except question words. be annotated at word level. While resolving Verbs (V VM),
we look at the verb chunk. There is a one to one map-
Anncorra POS tag BIS POS tag ping between Anncorra chunk types and the fine-grained
PRP (Pronoun) PR PRP, PR PRF, PR PRL, BIS verb categories.
PR PRC, PR PRQ
Compounds and reduplicatives In Anncorra schema,
DEM (Demonstrative) DM DMD, DM DMR,
there are separate tags for identifying reduplicatives(RDP)
DM DMQ
and part of compounds(*C). For example a noun compound
VM (Main verb) V VM VF, V VM VNF,
consisting of two words is tagged as NNC and NN. Exam-
V VM VINF, V VM VNG,
ples of reduplicative and noun compound constructions in
N NNV
Telugu are shown below.
CC (Conjunct) CC CCD, CC CCS
WQ (Question word) DM DMQ, PR PRQ Anncorra: maMci (good) JJ maMci (good) RDP cIralu
SYM (Symbol) RD SYM, RD PUNC (sarees) NN
RDP (Reduplicative) - BIS: maMci JJ maMci JJ cIralu N NN
*C (Compound) -
Anncorra: boVppAyi (papaya) NNC kAya (fruit) NN
Table 2: Fine grained BIS tags corresponding to Anncorra BIS: boVppAyi N NN kAya N NN
tags.
These two tags are done away with in the BIS schema.
Reduplicatives (RDP) are marked with POS tag of the word
During conversion, we aim to annotate with the most fine preceding it and Compounds(*C) are marked with the POS
grained BIS tag. When the fine-grained tag cannot be de- tag of the word following it.
termined we go the parent tag. We use a tagset converter
that maps various tags in Anncorra schema to the tags in
4. Annotating Intra-chunk Dependencies
BIS schema. In case of tags having multiple possibilities, a The intra-chunk annotation in SSF format for the sentence
list based approach is used. Most Anncorra tags diverging in Figure 1 is shown in Figure 4 and the fully expanded
into fine grained BIS tags are for function words which are dependency tree is shown in Figure 3.
limited in number. Separate lists consisting of words be-
longing to fine grained BIS categories are created. A word
is annotated with fine grained BIS tag if it is present in the
corresponding tag word list, otherwise it is annotated with
the parent tag.
Pronouns One of the main distinctions between the two Figure 3: Intra-chunk dependency tree.
tagsets is in the annotation of pronouns. In Anncorra, all
pronouns are annotated with a single tag, PRP. BIS schema
contains separate tags for annotating personal (PR PRP) It can be seen that, in this case, unlike in Figure 2, cAlA
pronouns, reflexive (PR PRF), relative (PR PRL), recip- (many) is attached to its chunk head, xeSAllo (countries-in)
rocal (PR PRC) pronouns and question words (PR PRQ). and I (this) is attached its chunk head parisWiwi (situation).
Pronouns in a language are generally limited in number. In The parse tree for the sentence is now complete. Com-
Telugu however, pronouns can be inflected with case mark- plete parse trees are useful for creating end to end parsers
ers and there can be a huge number of them. When a pro- which do not require intermediate pipeline tools like POS
noun is not found in any word list it is annotated with the taggers, morphological analyzers and shallow parsers. This
parent tag PR. is a huge advantage, especially for low resource languages
Demonstratives In Anncorra, there is a single tag for an- like Telugu.
notating demonstratives where as BIS tagset distinguishes Kosaraju et al. (2012) first proposed the guidelines for an-
between diectic, relative and question-word demonstra- notating intra-chunk dependency relations in SSF format
tives. Demonstratives are limited in number and the same for Hindi. They propose a total of 12 intra-chunk depen-
list based approach used for pronouns is applied here. dency labels mentioned in Table 2. lwg refers to local word
group and pof refers to part of.
Symbols Symbols are separated into symbols and punc- They also propose two approaches, one rule based and an-
tuations. other statistical for automatically annotating intra-chunk
Question words They are separated into pronoun ques- dependencies in Hindi. In the rule based approach sev-
tion words and demonstrative question words in BIS tagset. eral rules are created constrained upon the POS, chunk
Demonstrative question words are always followed by a name or type and the position of the chunk head with re-
noun. While resolving question words (WQ), if the word spect to the child node. The intra-chunk dependencies are

41
Figure 4: Intra-chunk dependency annotation in SSF format.

marked based on these rules. In the statistical approach intf Intensifiers (RP INTF) can modify both adjectives
Malt Parser(Nivre et al., 2006) is used to identify the intra- and adverbs. So we replace the jjmod intf with intf and
chunk dependencies. A model is trained on a few manually use the same dependency label when an intensifier modi-
annotated chunks with Malt parser and the same model is fies an adverb or adjective.
used to predict the intra-chunk dependencies for the rest of
the treebank.

nmod adj adjectives modifying nouns or pronouns


lwg psp post-positions
lwg neg negation
lwg vaux verb auxiliaries nmod wq This dependency relation is used when ques-
lwg rp particles tion words modify nouns inside a chunk.
lwg uh interjection
lwg cont continuation
pof redup reduplication
pof cn compound nouns
pof cv compound verbs
jjmod intf adjectival intensifier
rsym symbols adv This dependency relation is used when adverbs mod-
ify a verb inside a chunk.
Table 3: Intra-chunk dependencies proposed for Hindi

Bhat (2017) propose a different approach for annotating


intra-chunk dependencies for Hindi and Urdu by combin-
ing both rule based and statistical approaches. Instead of
a completely rule based system, they create a Context Free pof cv Compound verbs are combined together in Tel-
Grammar(CFG) for identifying intra-chunk dependencies. ugu. So this dependency relation is not seen in Telugu. An
The dependencies within a chunk are annotated based on example of compound verb is kOsEswAnu. It is a com-
the CFG using a shift reduce parser. pound of kOsi and vEs-wAnu. In cases like ceyyAlsi vac-
cindi, vaccindi is annotated as an auxiliary verb.
4.1. Intra-chunk dependency annotation for
Telugu treebank lwg rp This dependency label is used to annotate par-
ticles like gAru, kUdA etc. It is also used for classifiers.
In addition to the twelve dependency labels proposed
Telugu contains classifiers and a commonly used classifier
for Hindi, we also introduce a few more labels, nmod,
is maMxi. It specifies that the noun following maMxi is hu-
nmod wq, adv and intf for annotating intra-chunk depen-
man. Sometimes the following noun can be dropped and in
dencies for Telugu treebank. nmod and adv are already
those cases maMxi is treated as a noun. Classifiers are cat-
present in the inter-chunk dependency labels (Bharati et al.,
2009b).
nmod This dependency relation is used when demon-
stratives, proper nouns, pronouns and quantifiers modify a
noun or pronoun.

egorized under particles. So, maMxi is marked as a child of


koVMwa using label lwg rp in the above example.
lwg psp In Telugu most post-positions occur as in-
flections of content words. But few of them also occur
separately. The ones occurring separately are marked as

42
lwg psp. Sometimes, spatio-temporal nouns (N NST) 5. Conclusion
also act as post-positions when occurring alongside nouns. In this paper, we automatically annotate the Telugu depen-
In these cases, they are annotated as lwg psp. dency treebank with intra-chunk dependency relations thus
finally providing complete parse trees for every sentence
in the treebank. We also convert the Telugu treebank from
AnnCorra part-of-speech tagset to the latest BIS tagset. We
make the fully expanded Telugu treebank publicly available
to facilitate further research.

6. Acknowledgements
In this paper, we follow the approach proposed by Bhat We would like to thank Himanshu Sharma for making the
(2017) that makes use of a Context Free Grammar (CFG) Hindi tagset converter code available and Parameshwari Kr-
and a shift-reduce parser for automatically annotating intra- ishnamurthy and Pruthwik Mishra for providing relevant
chunk dependencies. We use the treebank expander code input. We also thank all the reviewers for their insightful
made available by Bhat (2017) 3 and write the Context Free comments.
Grammar for Telugu. The Context Free Grammar is gen-
erated using the POS tags and creates a mapping between 7. Bibliographical References
head and child POS tags and dependency labels.
Begum, R., Husain, S., Dhwaj, A., Sharma, D. M., Bai, L.,
The intra-chunk annotation is done using a shift-reduce and Sangal, R. (2008). Dependency annotation scheme
parser which internally uses the Arc-Standard(Nivre, 2004) for indian languages. In Proceedings of the Third Inter-
transition system. The parser predicts a sequence of tran- national Joint Conference on Natural Language Process-
sitions starting from an initial configuration to a terminal ing: Volume-II.
configuration, and annotate the chunk dependencies in the
Bharati, A., Chaitanya, V., Sangal, R., and Ramakrishna-
process. A configuration consists of a stack, a buffer, and
macharyulu, K. (1995). Natural language processing: a
a set of dependency arcs. In the initial configuration, the
Paninian perspective. Prentice-Hall of India New Delhi.
stack is empty, buffer contains all the words in the chunk
Bharati, A., Sangal, R., and Sharma, D. M. (2007). Ssf:
and intra-chunk dependencies are empty. In the terminal
Shakti standard format guide. Language Technologies
configuration, buffer is empty and stack contains only one
Research Centre, International Institute of Information
element, the chunk head, and the chunk sub-tree is given
Technology, Hyderabad, India, pages 1–25.
by the set of dependency arcs. The next transition is pre-
Bharati, A., Sharma, D. M., Bai, L., and Sangal, R.
dicted based on the Context Free Grammar and the current
(2009a). Anncorra : Annotating corpora guidelines for
configuration.
pos and chunk annotation for indian languages. LTRC,
IIIT Hyderabad.
4.1.1. Results
Bharati, A., Sharma, D. M., Husain, S., Bai, L., Begam,
We evaluate intra-chunk dependency relations annotated by R., and Sangal, R. (2009b). Anncorra: Treebanks for in-
the parser for 106 sentences. The test set evaluation results dian languages, guidelines for annotating hindi treebank.
are shown in Table 4. LTRC, IIIT Hyderabad.
Bhat, R. A. (2017). Exploiting linguistic knowledge to ad-
Test sentences LAS UAS dress representation and sparsity issues in dependency
106 93.7 95.8 parsing of indian languages. Phd thesis, IIIT Hyderabad.
Table 4: Intra-chunk dependency annotation accuracies. Bhatt, R., Narasimhan, B., Palmer, M., Rambow,
O., Sharma, D., and Xia, F. (2009). A multi-
representational and multi-layered treebank for
Hindi/Urdu. In Proceedings of the Third Linguistic
Almost all of the wrongly annotated chunks are because of Annotation Workshop (LAW III), pages 186–189, Suntec,
POS errors or chunk boundary errors. Since the Context Singapore, August. Association for Computational
Free Grammar rules are written using POS tags, errors in Linguistics.
annotation of POS tags automatically lead to errors in intra-
Buchholz, S. and Marsi, E. (2006). CoNLL-x shared task
chunk dependency annotation. The dependency relations
on multilingual dependency parsing. In Proceedings of
are annotated within the chunk boundaries. So any errors
the Tenth Conference on Computational Natural Lan-
in the chunk boundary identification also lead to errors in
guage Learning (CoNLL-X), pages 149–164, New York
intra-chunk dependency annotation.
City, June. Association for Computational Linguistics.
Telugu is an agglutinative language and the chunk size Hajičová, E. (1998). Prague dependency treebank: From
rarely exceeds three words. The CFG grammar based ap- analytic to tectogrammatical annotations. Proceedings
proach works accurately provided there are no errors in of 2nd TST, Brno, Springer-Verlag Berlin Heidelberg
POS or chunk annotation. New York, pages 45–50.
Kosaraju, P., Ambati, B. R., Husain, S., Sharma, D. M.,
3 and Sangal, R. (2012). Intra-chunk dependency annota-
https://github.com/ltrc/
Shift-Reduce-Chunk-Expander tion : Expanding Hindi inter-chunk annotated treebank.

43
In Proceedings of the Sixth Linguistic Annotation Work-
shop, pages 49–56, Jeju, Republic of Korea, July. Asso-
ciation for Computational Linguistics.
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A.
(1993). Building a large annotated corpus of En-
glish: The Penn Treebank. Computational Linguistics,
19(2):313–330.
Nivre, J., Hall, J., and Nilsson, J. (2006). MaltParser:
A data-driven parser-generator for dependency parsing.
In Proceedings of the Fifth International Conference on
Language Resources and Evaluation (LREC’06), Genoa,
Italy, May. European Language Resources Association
(ELRA).
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y.,
Hajič, J., Manning, C. D., McDonald, R., Petrov, S.,
Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D.
(2016). Universal dependencies v1: A multilingual tree-
bank collection. In Proceedings of the Tenth Interna-
tional Conference on Language Resources and Evalu-
ation (LREC’16), pages 1659–1666, Portorož, Slove-
nia, May. European Language Resources Association
(ELRA).
Nivre, J. (2004). Incrementality in deterministic depen-
dency parsing. In Proceedings of the Workshop on In-
cremental Parsing: Bringing Engineering and Cognition
Together, pages 50–57, Barcelona, Spain, July. Associa-
tion for Computational Linguistics.

44

You might also like