Urdu Grmar
Urdu Grmar
Urdu Grmar
153
Proceedings of the 8th Workshop on Asian Language Resources, pages 153–160,
Beijing, China, 21-22 August 2010.
2010
c Asian Federation for Natural Language Processing
makes it possible to parse text in one language syntax shows how these words (parts of speech)
and translate it to multiple languages. are grouped together to build well formed
Grammars in GF can be roughly classified into phrases. In this section we show how this works
two kinds: resource grammars and application and is implemented for Urdu.
grammars. Resource grammars are general
purpose grammars (Ranta, 2009a) that try to 4.1 Noun Phrases (NP)
cover the general aspects of a language
linguistically and whose abstract syntax encodes When nouns are to be used in sentences as part
syntactic structures. Application grammars, on of speech, then there are several linguistic
the other hand, encode semantic structures, but details which need to be considered. For
in order to be accurate they are typically limited example other words can modify a noun, and
to specific domains. However, they are not nouns have characteristics such as gender,
written from scratch for each domain, but they number etc. When all such required details are
use resource grammars as libraries (Ranta grouped together with the noun, the resulting
2009b). structure is known as noun phrase (NP). The
Previously GF had resource grammars for 16 basic structure of Urdu noun phrase is, “(M) H
languages: English, Italian, Spanish, French, (M)” according to (Butt M., 1995), where (M) is
Catalan, Swedish, Norwegian, Danish, Finish, a modifier and (H) is the head of a NP. Head is
Russian, Bulgarian, German, Interlingua (an the word which is compulsory and modifiers can
artificial language), Polish, Romanian and or cannot be there. In Urdu modifiers are of two
Dutch. Most of these languages are European types pre-modifiers i.e modifiers that come
languages. We developed resource grammar for before the head for instance ( kali: bli:
Urdu making it the 17th in total and the first “black cat”), and post-modifiers which come
south Asian language. Resource grammars for after the head for instance (
tm sb “you
several other languages (e.g. Arabic, Turkish, all”). In GF resource library we represent NP as
Persian, Maltese and Swahili) are under a record
construction.
lincat NP : Type = {s : NPCase => Str ; a :
3. Morphology Agr} ;
154
through morphological suffixes and are thus the corresponding forms of adjective and
handled at syntactic level (Butt et el., 2002). common noun from their inflection tables using
Here we create different forms of a noun phrase selection operator (‘!’). Since CN does not
to handle case markers for Urdu nouns. Here is a inflect in degree but the adjective does, we fix
short description of the different cases of NP : the degree to be positive (Posit) in this
construction. Other modifiers include possibly
• NPC Case: this is used to retain the adverbs, relative clauses, and appositional
original case of Noun attributes.
• NPErg: Ergative case with case marker A CN can be converted to a NP using different
‘ne: ’
ﮯ functions: common nouns with determiners;
• NPAbl: Ablative with case marker ‘se: proper names; pronouns; and bare nouns as mass
’ ﮯ terms:
• NPIns: Instrumental case with case
marker ‘se: ’ ﮯ fun DetCN : Det -> CN -> NP (e.g the boy)
• NPLoc1: Locative case with case fun UsePN : PN -> NP (e.g John)
fun UsePron : Pron -> NP (e.g he)
marker ‘mi: ɳ ’ﮟ
fun MassNP : CN -> NP (e.g milk)
• NPLoc2: Locative case with case
marker ‘pr ’ These different ways of building NP’s, which
• NPDat: Dative case with case marker are common in different languages, are defined
‘kʋ ’ in the abstract syntax of the resource grammar,
• NPAcc: Accusative case with case but the linearization of these functions is
marker ‘kʋ ’ language dependent and is therefore defined in
the concrete syntaxes.
And ‘a’ (Agr in the code sample given in
previous column) is the agreement feature of the 4.2 Verb Phrases (VP)
the noun that is used for selecting the
appropriate form of other categories that agree A verb phrase is a single or a group of words
with nouns. that act as a predicate. In our construction Urdu
A noun is converted to an intermediate category verb phrase has following structure
common noun (CN; also known as N-Bar)
which is then converted to NP category. CN lincat VP = {
deals with nouns and their modifiers. As an s : VPHForm => {fin, inf: Str} ;
example consider adjectival modification: obj : {s : Str ; a : Agr} ;
vType : VType ;
fun AdjCN : AP -> CN -> CN ; comp : Agr => Str;
embComp : Str ;
lin AdjCN ap cn = { ad : Str } ;
s = \\n,c =>
ap.s ! n ! cn.g ! c ! Posit ++ cn.s ! n ! c ; where
g = cn.g
}; VPHForm =
VPTense VPPTense Agr
The linearization of AdjCN gives us common | VPReq HLevel | VPStem
nouns such as (
ﭨﮉاʈȹn ɖa pani: “cold
water”) where a CN (
pani: “water”) is and
modified by an AP ( ﭨﮉا, ʈȹn ɖa “cold”). VPPTense = VPPres |VPPast |VPFutr;
Since Urdu adjectives also inflect in number, HLevel = Tu |Tum |Ap |Neutr
gender, case and degree, we need to concatenate
the appropriate form of adjective that agrees
with common noun. This is ensured by selecting
155
In GF representation a VP is a record with question is the complement of the verb. In that
different fields. The most important field is ‘s’ case complement of the verb comes at the very
which is an inflectional table and stores different end of clause e.g (ʋo khta he: kh ʋo dʋɽti: he: وﮦ
forms of Verb. “ ﮩ ﮨﮯ ہ وﮦ دوڑ ﮨﮯhe says that she runs”).
At VP level we define Urdu tenses by using a We have two different fields named ‘compl’ and
simplified tense system, which has only three ‘embCompl’ in the VP to deal with these
tenses, named VPPres, VPPast, VPFutr. In case different situations.
of VPTense for every possible combination of ‘vType’ field is used to store information about
VPPTense and agreement (gender, number, type of a verb. In Urdu a verb can be transitive,
person) a tuple of two string values {fin, inf : intransitive or double-transitive (Schmidt R. L.,
Str} is created. ‘fin’ stores the coupla (auxiliary 1999). This information is important when
verb) , and ‘inf’ stores corresponding form of dealing with ergativity in verb agreement. The
verb. VPStem is a special tense which stores the information about the object of the verb is stored
root form of verb. This form is used to create the in ‘obj’ field. All this information that a VP
full set of Urdu tenses at clause level (tenses in carries is used when a VP is used in the
which the root form of verb is used, i.e. construction of a clause.
perfective and progressive tenses). Handling A distinguishing feature of Urdu verb agreement
tenses at clause level rather than at verb phrase is ‘ergativity’. Urdu is one of those languages
level simplifies the VP and results in a more that shows split ergativity at verb level. Final
efficient grammar. verb agreement is with direct subjective except
The resource grammar has a common API in the transitive perfective tense. In transitive
which has a much simplified tense system, perfective tense verb agreement is with direct
which is close to Germanic languages. It is object. In this case the subject takes the ergative
divided into tense and anteriority. There are only construction (subject with addition of ergative
four tenses named as present, past, future and case marker (ne: )
ﮯ.
conditional, and two possibilities of anteriority However, in the case of the simple past tense,
(Simul , Anter). This means it creates 8 verb shows ergative behavior, but in case of
combinations. This abstract tense system does other perfective tenses (e.g immediate past,
not cover all the tenses in Urdu. We have remote past etc) there are two different
covered the rest of tenses at clause level, even approaches, in first one auxiliary verb (tʃka $)
though these tenses are not accessible by the is used to make clauses. If (tʃka $) is used,
common API, but still can be used in language verb does not show ergative behavior and final
specific modules. verb agreement is with direct subjective.
Other forms for verb phrases include request Consider the following example
form (VPReq), imperative form (VPImp). There
are four levels of requests in Urdu. Three of ﮨﮯ$ %&' ﮍ ب
them correspond to (tʋ , tm
, a:p ) پhonor lɽka Direct ktab Direct xri:d Root tʃka aux_verb he:
levels and the fourth is neutral with respect to The boy has bought a book
honorific levels. .
The Urdu VP is a complex structure that has The second way to make the same clause is
different parts: the main part is a verb and then
there are other auxiliaries attached to verb. For * ﮨﮯ%&' ﮍﮯ
ﮯ ب
example an adverb can be attached to a verb as a
lɽke: ne: Erg ktab Direct_Fem xri:di: Direct_Fem he:
modifier. We have a special field ‘ad’ in our VP
The boy has bought a book
representation. It is a simple string that can be
attached with the verb to build a modified verb.
In Urdu the complement of a verb precedes the In the first case the subject (lɽka, “ ﮍboy”) is
actual verb e.g ( وﮦ دوڑ
ﮨ ﮨﮯʋo dʋɽna tʃahti: in direct case and auxiliary verb agrees to
subject, but in second case verb is in agreement
he: “she want to run”), here ( ﮨtʃahna “want”) with object and ergative case of subject is used.
is complement of verb (
دوڑdʋɽna “run”), However, in the current implementation we
except in the case where, a sentence or a follow the first approach.
156
In the concrete syntax we ensure this ergative When a comparative AP is created from an
behavior through the following code segment in adjective and a NP, constant “se: ” ﮯis used
GF. However the code given here is just a between oblique form of noun and adjective. For
segment of the code that is relevant. example linearization of above function is
where V is the morphological category and VP lincat Clause : Type = {s : VPHTense =>
is the syntactic category. There are other ways to Polarity => Order => Str} ;
make a VP from other categories, or
combinations of categories. For example Here VPHTense represents different tenses in
Urdu. Even though current abstract level of
fun AdvVP : VP -> Adv -> VP ; common API does not cover all tenses of Urdu,
we cover them at clause level and can be
An adverb can be attached to a VP to make an accessed through language specific module. So,
adverbial modified VP. For example (i:haɳ &ﮩں VPHTense is of following type
)
VPHTense = VPGenPres | VPPastSimple
4.3 Adjective Phrases (AP) | VPFut | VPContPres
| VPContPast | VPContFut
Adjectives (A) are converted into the much | VPPerfPres | VPPerfPast
richer category adjectival phrases (AP) at syntax | VPPerfFut | VPPerfPresCont
level. The simplest function to convert is | VPPerfPastCont
| VPPerfFutCont | VPSubj
fun PositA : A -> AP ;
Polarity is used to make positive and negative
Its linearization is very simple, since in our case sentences; Order is used to make simple and
AP is similar to A e.g. interrogative sentences. These parameters are of
following forms
fun PositA a = a ;
Polarity = Pos | Neg
There are other ways of making AP for example Order = ODir | OQuest
fun ComparA : A -> NP -> AP ; PredVP function will create clauses with
variable tense, polarity and order which are
157
fixed at sentence level by different functions, fun IdetQuant : IQuant -> Num -> IDet ;
one is. fun PrepIP : Prep -> IP -> IAdv ;
Here Temp is syntactic category which is in the As an example consider the translation of
form of a record having field for Tense and following sentence from English to Urdu, to see
Anteriority. Tense in the Temp category refers how our proposed system works at different
to abstract level Tense and we just map it to levels.
Urdu tenses by selecting the appropriate clause.
This will create simple declarative sentence, He drinks hot milk.
other forms of sentences (e.g Question
sentences) are handled in Questions categories Figure 1 shows the parse tree for this sentence.
of GF which follows next. As a resource grammar developer our goal is to
provide correct concrete level linearization of
4.5 Question Clauses and Question this tree for Urdu.
Sentences
158
A NP is constructed from this CN by one of the implemented in a different but related
NP construction rules (see section 4.1 for formalism.
details). A VPSlash (object missing VP) is build Like the GF resource library, Pargram project
from a two place verb ( pi:ta “drinks”). This (Butt et el., 2007) aims at building a set of
VPSlash is then converted to VP through parallel grammars including Urdu. The
function grammars in Pargram are connected with each
other by transfer functions, rather than a
fun ComplSlash : VPSlash -> NP -> VP ; common representation. Further, the Urdu
grammar is still one of the least implemented
Resulting VP and NP are grouped together to grammars in Pargram at the moment. This
make a VP (م دوده ﮨﮯ. ʈgrm dʋdȺ pi:ta he: project is based on the theoretical framework of
“drinks hot milk”). Finally clause (م دوده ﮨﮯ. lexical functional grammar (LFG).
Other than Pargram, most work is based on LFG
وﮦʋh grm dʋdȺ pi:ta he: “he drinks hot milk”) is and translation is unidirectional i.e. from
build from NP ( وﮦʋh “he”) which is build from English to Urdu only. For instance, English to
pronoun ( وﮦʋh “he”) and VP (م دوده ﮨﮯ. Urdu MT System is developed under the Urdu
grm dʋdȺ pi:ta he: “drinks hot milk”). Language Localization Project (Hussain, 2004), (Sarfraz
dependent concrete syntax assures that correct and Naseem, 2007) and (Khalid et el., 2009).
forms of words are selected from lexicon and Similarly, (Zafer and Masood, 2009) reports
word order is according to rules of that specific another English-Urdu MT system developed
language. While, morphology makes sure that with example based approach. On the other
correct forms of words are built during lexicon hand, (Sinha and Mahesh, 2009) presents a
development. strategy for deriving Urdu sentences from
English-Hindi MT system. However, it seems to
6. An application: Attempto be a partial solution to the problem.
______________________________
3
http://www.grammaticalframework.org/lib/doc/synopsis.html
159
building domain specific application grammars Grammar Development. Proceedings of the
including multilingual dialogue systems, Conference on Language & Technology 2009.
controlled language translation, software Masica C., 1991. The Indo-Aryan Languages,
localization etc. Since a common API for Cambridge, Cambridge University Press, ISBN
multiple languages is provided, this grammar is 9780521299442.
useful in applications where we need to parse
and translate the text from one to many other Ranta A., Grammatical Framework: A Type-
languages. Theoretical Grammar Formalism. The Journal of
However our approach of common abstract Functional Programming 14(2) (2004) 145–189.
syntax has its limitations and does not cover all Ranta A. The GF Resource Grammar Library
aspects of Urdu language. This is why it is not A systematic presentation of the library from the
possible to use our grammar for arbitrary text linguistic point of view. to appear in the on-line
parsing and generation. journal Linguistics in Language Technology,
2009a.
10. References Ranta A. Grammars as Software Libraries. From
Semantics to Computer Science, Cambridge
Angelov K. and Ranta A. 2010. Implementing University Press, Cambridge, pp. 281-308, 2009b.
controlled Languages in GF. Controlled Natural Rizvi, S. M. J. 2007. Development of Algorithms and
Language (CNL) 2009, LNCS/LNAI Vol. 5972 Computational Grammar of Urdu. Department of
(To appear) Computer & Information Sciences/ Pakistan
Attempto 2008. Project Homepage. Institute of Engineering and Applied Sciences
attempto.ifi.uzh.ch/site/ Nilore Islamabad. Pakistan.
Butt M., 1995. The Structures of Complex Predicate Sarfraz H. and Naseem T., 2007. Sentence
in Hindi Stanford: CSLI Publications Segmentation and Segment Re-Ordering for
English to Urdu Machine Translation. In
Butt M., Dyvik H., King T. H., Masuichi H., and Proceedings of the Conference on Language and
Rohrer C. 2002. The Parallel Grammar Project. Technology, August 07-11, 2007, University of
In Proceedings of COLING-2002 Workshop on Peshawar, Pakistan.
Grammar Engineering and Evaluation. pp. 1-7.
Schmidt R. L., 1999. Urdu an Essential
Butt, M. and King, T. H. 2007. Urdu in a Parallel Grammar,Routledge Grammars.
Grammar Development Environment'. In T.
Takenobu and C.-R. Huang (eds.) Language Sinha R., and Mahesh K., 2009. Developing English-
Resources and Evaluation: Special Issue on Asian Urdu Machine Translation Via Hind., Third
Language Processing: State of the Art Resources Workshop on Computational Approaches to
and Processing 41:191-207. Arabic Script-based Languages (CAASL3) in
conjunction with The twelfth Machine Translation
Forsberg M., and Ranta A., 2004. Functional Summit. Ottawa, Ontario, Canada.
Morphology. Proceedings of the Ninth ACM
SIGPLAN International Conference of Functional Zafar M. and Masood A., 2009. Interactive English
Programming, Snowbird, Utah. to Urdu Machine Translation using Example-
Based Approach. International Journal on
Humayoun M., Hammarström H., and Ranta A. Computer Science and Engineering Vol.1(3),
Urdu Morphology, Orthography and Lexicon 2009, pp 275-282.
Extraction. CAASL-2: The Second Workshop on
Computational Approaches to Arabic Script-based
Languages, July 21-22, 2007, LSA 2007
Linguistic Institute, Stanford University. 2007
Hussain, S. 2004. Urdu Localization Project.
COLING:WORKSHOP ON Computational
Approaches to Arabic Script-based Languages,
Geneva. pp. 80-81
Khalid, U., Karamat, N., Iqbal, S. and Hussain, S.
2009. Semi-Automatic Lexical Functional
160