Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
Morphological analyzer is a fundamental tool in Natural Language Processing (NLP) that generates the morphological analyses of a given word-form. It can be used in enhancing the accuracy of POS-Tagging, Chunking, Syntactic Parsing, Word... more
Morphological analyzer is a fundamental tool in Natural Language Processing (NLP) that generates the morphological analyses of a given word-form. It can be used in enhancing the accuracy of POS-Tagging, Chunking, Syntactic Parsing, Word Sense Disambiguation (WSD), Information Retrieval (IR) & Machine Translation (MT) Systems. This paper describes an ongoing effort to develop Nepali morphological analyzer, using an open source platform-Apertium (LT-Toolbox). Since, it is the initial stage of this projectwe have confined our work to inflectional morphology. So far, we have covered all the possible categories, as per LDC-IL1 POS tag-set of Nepali. Currently, the coverage of Nepali Morph-Analyzer is 20,000 words, classified into 219 paradigms
Research Interests:
The present paper explores passives in Kashmiri, a Northwestern Dardic language of the Indo-Aryan family. Though Kashmiri has some special features like V-2 phenomenon, pronominal clitics etc. it has an analytic passive construction like... more
The present paper explores passives in Kashmiri, a Northwestern Dardic language of the Indo-Aryan family. Though Kashmiri has some special features like V-2 phenomenon, pronominal clitics etc. it has an analytic passive construction like its Indo-Aryan counterparts. The internal argument surfaces as the subject of the passive, where the participial/infinitival verbal form - nI is added to the verb root followed by a periphrastic auxiliary yun ‘to come’ in perfective form. The agent of the action is in the form of athi or zaryi ( by / through ) and is preferably omitted. This optionality casts a doubt on its status – whether it is an adjunct or an argument. The promotion of the internal argument to the subject position is another key issue. The present paper investigates the above issues and claims that the Kashmiri passive construction is also a kind of ACTIVE-Passive and not really passive as in English. It is argued that in Kashmiri passives, the underlying subject remains an active subject and the underlying object does not become the surface subject. To prove this claim, some tests based on anaphora binding, pronominal co-reference, control, etc. are applied.
Treebank is a basic language resource for training and testing syntactic parser which forms a key module in various NLP systems like machine translation system. This paper reports an ongoing research of building dependency treebank for... more
Treebank is a basic language resource for training and testing syntactic parser which forms a key module in various NLP systems like machine translation system. This paper reports an ongoing research of building dependency treebank for Kashmiri (KashTreeBank) and discusses some main annotation issues. The paper is based on the pilot annotation of 500 sentences.
POS-tagging is the process of labeling words in the running corpus with their grammatical categories and optionally with their associated grammatical features. It is essentially a classification problem but for languages with... more
POS-tagging is the process of labeling words in the running corpus with their grammatical categories and optionally with their associated grammatical features. It is essentially a classification problem but for languages with split-orthography, it is also a mapping-problem which involves mapping of the arrays of tokens (words, chunks or sentences) on the arrays of tags in proper agreement with the syntactic structure of a language. While POS-tagging is an established technology in European languages and even in some South Asian Languages like Arabic and Chinese, it is an emerging field in Indian languages where little work has been done so far, particularly, in those languages which use Persio-Arabic script (e.g. Urdu, Kashmiri, Shina, Balti, and Purki). It has been argued that such languages are real challenge to the already complex NLP-tasks like tokenization, POS-tagging and chunking due to their split-orthography. The problem of script needs to be addressed tactfully so that such languages would not lag behind in the progressing scenario of Indian language-technology. Since, Kashmiri is one of such languages with severe split-orthography; this paper is an initiative to put the problem in the right perspective and to develop a versatile, fine-grained, hierarchical tag-set for Kashmiri that can handle script related issues as well as other linguistic issues. It also ensures maximum facilitation of POS-tagging at the level of parsing. The tag-set will be strictly morpho-syntactic in nature as per the guidelines of Expert Advisory Group for Language Engineering Standards (henceforth EAGLES) for morpho-syntactic annotation (Leech, and Wilson, 1999). Therefore, morpho-syntactic availability of the grammatical features would be the governing principle for the present tag-set. Capturing of semantically or lexically available grammatical features is out of the scope of the present tag-set and will be handled in the future work.
Research Interests:
The author, while working on a corpus linguistics project, has realized the importance of language resources, mainly the syntactically annotated text corpus, but has felt a severe vacuum in terms of such resources and the related research... more
The author, while working on a corpus linguistics project, has realized the importance of language resources, mainly the syntactically annotated text corpus, but has felt a severe vacuum in terms of such resources and the related research in his native language, Kashmiri. Considering this scenario, he has tried to address this problem in his PhD research. He started his work from scratch and developed a small-scale dependency treebank for Kashmiri (KashTreebank). This book is based on his PhD dissertation. It provides the necessary information about the theoretical and practical issues of developing a treebank, particularly in a resource-poor scenario. Since, these days, treebanks are in high demand, not only for training and testing syntactic parsers but also for promoting empirical research in linguistics, this book can serve as a basic source of information for developing large-scale treebanks. It will be more helpful for resource creation projects, aspiring computational linguists and language engineers, especially those interested in syntactic parsing, corpus linguistics and experimental syntax.