Noun Group and Verb Group Identification For Hindi: Smriti Singh, Om P. Damani, Vaijayanthi M. Sarma
Noun Group and Verb Group Identification For Hindi: Smriti Singh, Om P. Damani, Vaijayanthi M. Sarma
Noun Group and Verb Group Identification For Hindi: Smriti Singh, Om P. Damani, Vaijayanthi M. Sarma
2492
focus of our work is on Hindi, the analysis and implementation methods discussed here can be
applied straightforwardly to other Indian languages. The linguistic features exploited here are
drawn from a range of well-understood grammatical features and are not peculiar to Hindi alone.
b) us ko roko
He-obl ACC stop
‘Stop him’
The first word in 1a should be tagged as Dem while the tagger incorrectly tags it as Pronoun
because of a lack of representative training data. Similarly, sentences 2a and 2b are ambiguous
for the system. In1a-b and 2a-b, ‘us’ and ‘ve’ are valid candidates for both DEM and PRON tags.
The ambiguity arises for the system because it seeks to resolve it by looking at the words in the
immediate vicinity. Note that these sentences are not ambiguous for a native speaker.
2) a) vo kāl-e ghod-e ko rok r əh-ā hai
he black-obl horse-obl ACC stop prog-masc,sg be-pres
‘He is stopping the black horse’
The TAM (tense, aspect and modality) information of the verbs in the two sentences can also
help in resolving the ambiguity but requires subject-object information in the sentence along with
a syntactic analysis of the sentence.
b) Adjective-Noun ambiguity: An adjective may function as a head noun if the noun is
dropped, and bears the same inflection as the nominal head, as may be seen in 3 and 4.
2493
4) əcch-e kā nətijā əcchā nikəl-t-ā hai
good-obl of result good turn-hab,masc,sg be-pres
‘Do good have good’
If the case marker orpostposition immediately follows the adjective, it is treated as a nominal
head. Since the occurrence of əcche (or any other adjective) as an adjective is more likely than its
occurrence as a noun in any learning corpus, this will result in incorrect learning and
consequently, in incorrect tagging. NG identification rules help in resolving such ambiguity by
using the featural information of the NG constituents.
d) Noun-Verb ambiguity: Many nouns may appear as verbs (even when inflected) and
vice versa in Hindi1. Verbs may appear as verbal nouns in their infinitival form and may function
as nouns. Nevertheless, a verbal noun retains many of its verbal properties. While functioning as
a noun, it appears only in the ‘singular, oblique’ and the ‘singular, direct’ cases and inflects like
other /ā/ ending masculine nouns in the language.
5) tair-nā bəhut lābhkārī hai
swim-Inf very beneficial be-pres
‘Swimming is very beneficial’
1 For example, in the sentence ‘merekəīkhātehaῖ’’ the token‘khāte’ isambiguous for NOUN (pl, direct) and VERB
2494
by the words that belong to pre-nominal categories. NGs are formed around a noun or a pronoun
Both items may appear together - vo tumhārī mīthī bātẽ or tumhārī vo mīthī bātẽ
One item appears without the other - vo mīthī bātẽ or tumhārī mīthī bātẽ
Set 2 includes intensifiers and numerals. A numeral may be of the type - approximate, fractional,
universal quantifier, indefinite quantifier, multiplicative, aggregative, ordinal, cardinal and
measure word. Kachru (2006:133) provides the ordering among Hindi quantifiers as in 10.
10) approximate-cardinal-collective-ordinal-multiplicative/fractional-measure
We modified this ordering to capture the arrangement of numerals in a more elaborated way (in
Figure 1 below).
2495
Figure 1: Ordering of quantifiers in a Hindi NG
The categories within curly braces are mutually exclusive while those separated by a hyphen ‘−’
Approximate quantifier and ordinal (e.g., *ləgbhəg dūsrā vyəkti ‘around second man’),
can appear one after another in a sequence. The ordering suggests that:
universal and indefinite quantifier (e.g., *səbhī kuch log ‘all few people’), cardinal and
aggregative (e.g., *do donõ log), aggregative and multiplicative (e.g., *donõ dugunā), and
fractional and aggregative/multiplicative quantifier (e.g., *ādhā donõ, *ādhā dugunā) are
An intensifier may precede an indefinite quantifier but not a universal quantifier (e.g. , bəhut
mutually exclusive
kəm log ‘very few people’ (intensifier-indefinite quantifier), b əhut səbhī log ‘very all people’
Fractionals do not appear with aggregative or multiplicative quantifier (e.g., *ādhā donõ,
(intensifier-universal quantifier))
*ādhā dugunā)
Set 3 includes adjectives (including imperfective or perfective verbal adjectives) and are optional
in an NG. Many adjectives may appear inside an NG recursively. Examples include, bhāgtā huā
kālā ghoɽā (running blackhorse), bhāgtā huā ghoɽā (running horse), kālā ghoɽā (black horse),
thəkā huā kālā ghoɽā (tired black horse), thəkā huā ghoɽā (tired horse).
2496
3.1 Procedure for Noun Group Identification
The Noun Group identification module attempts to isolate the basic non-recursive NG that
includes only one head and its specifiers and modifiers. The input to the algorithm is the output
of a morphological analyser. For each word, the morphological analyser gives the stem,andthe set
of suffixes along with the associated morphological properties. A look-up is performed in a
lexicon to retrieve the set of possible POS tags for each stem. The NG is built from right to left in
a given sentence. As discussed in the previous section, we formulated five sets of constituents
that contain different lexical categories that combine in various ways to form NGs in Hindi. Set 4,
the head marks the right end of an NG (neglecting any postpositions and particles).
Sets 1, 2 and 3 contain categories which mark the left end of an NG. Processing from right to left,
once the system encounters a Set 4 element, it starts to look for Set 3, Set 2 and Set 1 elements
appearing to the left of the head in that order. By ‘finding a Set X element’, we mean ‘finding a
stem whose potential POS tag list in the lexicon contains a POS tag belonging to Set X.’ The
potential candidates are considered to be members of the NG. As soon as any word of a lexical
category other than those mentioned in Sets 1, 2 and 3 is encountered, the NG is considered
closed. The previous word marks the left end of the NG in such a case. The number, gender and
case information for nouns, demonstratives and pronouns are required at each step to select or
reject a potential POS tag. This information is extracted from the output of the morphological
analyser. The pseudo code for NG identification is given below.
Steps for NG Identification
1. For all tokens, processing goes from right to left
1a. Look for a post-position or a Set 4 element to start an NG
1b. If Set 5 member, i.e., a postposition is found
1b (i) Oblique NG has started
1c. If Set 4 element is found
1c (i) Direct NG has started
1d. If a Demonstrative pronoun is found
1d (i) Consider it as a Pronoun (head)
2. If oblique NG has just started with a Set 5 element, i.e., with a postposition
2a. Look for a Set 4 element
2b. If Set 4 element is not found; find the list of possible POS tags for the
current word
2c. If a POS Tag appears in the possible POS Tags’ list and also in Set 4
2c (i) Assign the tag which is common to both.
2d. If there is no common element in the list and Set 4s
2d (i) Assign the tag other than PP to the next word using
the list of possible tags for it.
3. If any NG has started
3a. Look for a Set 3 and/or Set 2 and/or Set 1 element
3b. If Set 3, 2 and 1 elements are found
3b (i) The NG includes the current word
3c. If set 3, 2 and/or 1 elements are not found
3c (i) The NG has already ended with the previous word
4. If any NG is completely identified
4a. Apply rules to check the agreement between modifiers/qualifiers and their head and
do corrections if necessary
5. Start looking for the next NG
In what follows, we give an example of how the NGI helps correct a POS Tag error.
2497
13) ve pūr-e māml-e ko suljhā-nā cāh-te haĩ
They whole-obl matter-obl ACC solve-Inf want be-pres-pl
‘they want to solve the whole matter’
For 13 the tagger produces the output as [DEM ADJ NN PP VM VAUX]. ‘ve’ is tagged as a
DEM instead of PRON. Scanning right to left, the NG identified is (ve pūre māmle ko). Now the
computational rules are applied to make any POS corrections required. By the first rule for
oblique NG,reading the rule from right to left, we find a PP ‘ko’ followed by a noun ‘māmle’ in
the oblique case. ‘pūre’ is allowed in the NG as its category and features warrant its being a Set
2 member. ‘ve’may be a Set 1 member and may mark the left end of the NG and may be a
demonstrative or a pronoun. The tagger may tag ‘ve’as DEM ‘demonstrative’ but, as a
demonstrative, it does not concord with the head noun for the relevant case feature. Thus, the tag
DEM is rejected and PRON is selected.
2498
Hindi verb group must always begin with a main verb root with or without a suffix. Once the
main verb is identified, the verb group is assumed to have begun. Scanning from left to right, the
main verb may be followed by a string of intermediate verbal suffixes and auxiliaries until a
must-end VG marker is encountered. These elements broadly follow the linear order in 14,
though with co-occurrence constraints that are listed towards the end of this section.
14) Verb Root−Infinitive/Passive−Modal Auxiliary−Aspect−Tense−Mood
The three kinds of morphemes are called Start markers, Intermediate markers and Must-end
Markers and are shown in Figure 3. Particles and negation markers are also allowed to appear
inside a VG.
b) Intermediate markers
These markers include two kinds of morphemes, 1) possible-end markers and 2) ‘must continue’
markers. Possible end markers are those which may end a VG such as the perfective marker or
the modal auxiliary for necessity (preceded by an infinitive-gender, number sequence). These
morphemes, however, may be followed by other morphemes to further extend the VG. For
example, the perfective marker may be followed by the past or the present tense auxiliary as in vo
āyā ‘he came’, vo āyā hai ‘he has come’ and vo āyā thā ‘he had come’. Similarly, the modal
auxiliary for necessity may be followed by the past tense auxiliary, such as usko ānā cāhiye (thā)
‘he should (have) come’ and the subjunctive marker may be followed a future-person, number
marker, such as khā-ũ-g-ā ‘eat-subjunctive-will-person, number’. The ‘must-continue’ markers,
2499
however, must be followed by other verbal morphemes in order to complete the VG. Details of
such markers are given below in Table 1 along with their inflections.
Possible End-Markers
Modal Auxiliary चाहिए (cāhie) ‘should’
Aspect: Perf+gen-num -या (-yā), -ाा (-ā), -आ (-ā), -ा (-ī), -ा (-e), -ए (-e), -ई (-ī), -ा ा (-ĩ), -ं (-ĩ)
Subjunctive -ाा (-ũ), -ऊ (-ũ), -ा (-e), -ए (-e), -ा (-ŋ), -ाा (-ẽ), -ए (-ẽ), -ा (-o) ,-ओ (-o)
Must-Continue Markers
Aspect: Habitual -त (-t), Progressive रि (r əh), Completive चुक (cuk)
Modal Auxiliaries: सक (sək), a bility: पा (pā), obligation: प़ (pəɽ), permission: द (de)
Ability/probability
Passive या (-yā)/य (-yī) /य (-ye)/जा (-jā)
Must-End Markers
Future+gen-num -गा (-gā), -ग (-gī), -ग (-ge)
Mood:Imperative null, -ा (-o) ,-ओ (-o),िाए (-ie), इए (-ie), िजए (-jie), –ना (-nā)
Tense Auxiliary: िै (hai), िं (haĩ), Past: था (thā), थ (the), थ (thī), थ (thĩ)
Present
Mood:Conditional -त- (-t-)
Table 1: Intermediate Markers
As shown in 14, the verbal elements appear in a specific order. This ordering issubject to a
number of constraints as listed below:
Specific Constraints within a Hindi VG
a) The modal auxiliary chāhie must be preceded by an infinitive (with gender-number) marker,
such as khā-nā cāhie (खा-ना चाहिए). It may be followed neither by an aspect marker (17) nor by a
present tense or future tense marker (18). It may only be followed by a past tense auxiliary (19).
17) *chāhie rəh/cuk (aspect)
18) *chāhie hai/ chāhie-gā (pres, future)
19) chāhie thā (past)
c) The modal auxiliary sək cannot be followed by the perfective marker, the progressive auxiliary
rəh and the completive auxiliary cuk as shown below in 23-25. It can only be followed by a
habitual aspect marker or by a subjunctive marker as in 26 and 27.
23) *khā sək rəhā hai (progressive)
2500
24) *khā sək-ā hai (perfective)
25) *khā sək cukā hai (completive)
26) khā sək-tā hai (habitual)
27) khā sək-e (subjunctive)
e) Infinitive marker –न-(n), all aspectual markers, past tense auxiliary, conditional mood marker -
त -(t) and future marker -ग-(g) must be followed a gender-number marker as in ता (tā), त (tī), त
(te), रिा (rǝhā), रिी (rǝhī), रि (rǝhe), चुका (cukā), चुकी (cukī), चुक (cuke), ना (nā), न (nī), न (ne),
गा (gā), ग (gī), ग (ge).
2501
33) kər [cuk-ā de-g-ā]
tax pay-masc,sg give-fut-masc,sg
‘will pay the tax’
In 31, rəh appears as the progressive aspectual auxiliary as well as a main verb (‘live’). Often a
POS tagger is unable to resolve this ambiguity in the absence of contextual information. In32, kər
is ambiguous between being a verb and a noun. As a main verb, it means ‘do’ and as a noun, it
means ‘tax’. In order to resolve this POS ambiguity, the system requires the information that
when cuk appears as an auxiliary and is followed by a tense auxiliary, it requires preceding main
verb. This information rules out the possible tag Noun and leaves Main Verb as the correct one.
This information may yield a faulty analysis for the expression in 33. The system will consider
kərto be a part of the VG and will output the VG as kərcukādegā. We require a morphotactical
constraint that prevents the completive aspectual auxiliary cuk from being followed by the modal
auxiliary de,. We must note that these constraints are ad-hoc and may not always produce correct
POS tags.
Secondly, suffixes too may be ambiguous. For example,‘-t’ attached to the stem ā(come) may
indicate either the habitual aspect or the conditional mood. This ambiguity may be resolved by
using the regular expression and by looking at the next morpheme. For example, in 34, the suffix
-t- is rejected as being a conditional mood marker as it belongs to the category of must-end
markers and cannot be followed by any other verb morpheme (except for the gender-number
marker). On the other hand, the habitual -t- may be followed by a tense auxiliary.
34) bādəl roz [ ā-te the]
Clouds everyday come-hab be-past
‘Clouds used to form everyday’
During the process of VG identification, feature agreement among elements of the group is also
checked. Many invalid sequences are rejected using feature combination rules. For example,
‘bhāīthā’ in 36 unlike in 35) cannot be a verb group. It is instead a noun-verb sequence since the
masculine gender of the tense auxiliary thā does not agree in gender-number with main verb (bhā
‘like’) marked for feminine gender using -ī. On the other hand, bhāī (brother) may be a noun with
which the gender of the verb (masculine) agrees. The VG identifier thus rejects the Verb tag for
the word bhāī and retags it as a Noun.
35) ‘vo merā bhāī thā’
he my brother be-past-masc
‘He was my brother ’
2502
37) un-kī yojnā shāntipūrnə uddeshy-õ ke liye hai
their plan peaceful aims-obl for be-pres
‘Their plan is for peaceful aims’
5 Performance Evaluation
We use a CRF based POS Tagger. Without NGI/VGI, the features used for the POS-Tagger
include (a) Tag ambiguity scheme from the dictionary, (b) suffix given by the stemmer, (c) prefix
and suffix character streams of size one and two, (d) previous word’s suffix and (e) tag ambiguity
scheme for previous and next word. We tried NG and VG identification at two different places,
before and after CRF. When the NGI/VGI module is run before CRF, its output is used as
features supplied to CRF and the tags assigned by CRF are considered final. The tag ambiguity
scheme of the NG/VG members is simply replaced by the tags given by NGI/VGI modules. On
the other hand, when NGI/VGI follows CRF, then NGI/VGI overwrites the tags assigned by CRF.
The Hindi POS Tagger was tested on a corpus of 66,990 words, which is a subset of the BBC
Hindi news corpus (downloaded from http://www.bbc.co.uk/hindi) and the IIIT Hyderabad
corpus. We partitioned the corpus into four testing folds. The accuracy of the CRF based POS
Tag system using Verb Group and Noun Group Identification rules for the four folds are as
follows:
Experiment Average Accuracy of 4 folds
CRF 95.18%
2503
Here, the verb group is identified as ( diyā gəyā hai). (kər ) is marked as a verb whereas in 40, it
appears as a noun.
40) mætʃ kā kər diyā gəyā hai
match of tax give-past has been
‘Tax has been given/paid for the match’
The verb group is identified as (kehnāhai), whereas (kehnā) is a noun which should co-occur with
the preceding possessive. But the possessive pronoun (unkā) is not adjacent to (kehnā).
42) tīm ne sp ænish līg lā līg kā khitāb jītā
Team-ERG Spanish League Lā Liga of prize win-past
‘The team won the Spanish League La Liga title’
In 42 (La Liga) is a proper name but as per the morphological analysis ( lā) only qualifies to be a
verb.
In summary, even with detailed rules for NG and VG identification, there is little improvement in
the accuracy of the tagger as (1) our Morphological Analyzer is not able to analyze Compounds
(both Verbs and Nouns) and Conjunct verbs as single units unless they are stored in the
lexicon,(2) because some of the tags show real ambiguity in a given sentence and 3) because the
MA fails to recognize and analyse unknown or foreign words that are not listed in the lexicon.
Our results compare favourably with the 93.45% accuracy reported in Singhet al. (2006) for a
CN2 based tagger forthe Hindi BBC news corpus. Guneet al. (2010) report 94% accuracy for
CRF on Marathi using a corpus of size 20K. They did not implement NGI but only VGI. They
found that use of VGI did not improve the accuracy since not much VM-VAUX ambiguity (their
main focus) remained after applying CRF.
6 Conclusions
We have presented algorithms to identify Hindi Noun and Verb Groups by using morphotactical
information and the constraints that apply to the constituents of these groups. We also provided
the list of grammatical categories and their markers that may appear inside a group and discussed
ways in which these markers may be arranged. Group Identification enabled the resolution of
major POS ambiguities. The identified groups may also be used at a later stage, i.e., in parsing or
in language generation. We cannot handle all the POS ambiguous cases (that involve scrambling
or those that are structurally ambiguous) where immediate contextual rules do not help. However,
using the ordering among the major categories and their possible combinations, we have tried to
present ways that can be applied to other languages equally well. The methods are especially
beneficial for languages with meagre corpora or other NLP resources. Since a system will not be
able to learn patterns that might be absent in small training corpora, with the useof morphological
patterns that govern the ordering of the elements inside a group, a large number of ambiguities
and errors may be avoided at a first pass.
Acknowledgements: We would like to thank Nikhilesh Sharma and Neha Gupta for
implementing the group identification system discussed here.
2504
References
Abney, S. (1994). “Parsing by Chunks.” In Principle-Based Parsing, eds. B. Berwick, S. Abney,
and C. Tenny, 257-278. Dordrecht: Kluwer Academic Publishers.
Baskaran S. (2006). “Hindi POS Tagging and Chunking.” In the Proceedings of NLPAI Machine
Learning Contest. Mumbai, India, June.
Begum, R., Jindal K., Jain A., Husain S., Sharma D. (2011). “Identification of Conjunct verbs in
Hindi and its effect on Parsing Accuracy.” 12th International Conference on Intelligent Text
Processing and Computational Linguistics (CICLing).
Bharati, A., Chaitanya V. and Sangal R. (1995). “Natural Language Processing: A Paninian
Perspective.” New Delhi: Prentice-Hall of India.
Chakrabarti, D., Mandalia H., Priya R., Sarma V. and Bhattacharyya P. (2008). “Hindi
Compound Verbs and their Automatic Extraction”, In Proc. of Computational Linguistics
Conference (COLING), Manchester, UK.
Chakrabarty, D., Sarma V. and Bhattacharyya P. (2007). Complex Predicates in Indian Language
Wordnets, Lexical Resources and Evaluation Journal, 40 (3-4).
Dalal, A., Nagaraj K., Sawant U. and Shelke S. (2006). “Hindi Part-of-Speech Tagging and
Chunking: A Maximum Entropy Approach.” In the Proceedings of the NLPAI Machine Learning
Workshop on Part Of Speech and Chunking for Indian Languages . Mumbai, India.
Grover, C. and Tobin R. (2006). “Rule-based Chunking and Reusability.” In the Proceedings of
LREC 2006 , 873-878. Genoa, Italy.
Gune H., Bapat M., Khapra M. and Bhattacharyya P. (2010). “Verbs are where all the Action
Lies: Experiences of Shallow Parsing of a Morphologically Rich Language”, Computational
Linguistics Conference (COLING), Beijing, China.
Halliday, M. A. K. (1977). “Text as Semantic Choice in Social Contexts.” In Grammar and
Descriptions (Studies in Text Theory and Text Analysis), eds. T. A. van Dijk and J. Petofi, 176-
225. New York: Walter de Gruyter.
2505
Vijay S. and Sobha D. (2010), "Noun Phrase Chunker Using Finite State Automata for an
Agglutinative Language", In the Proceedings of the Tamil Internet Conference.
2506