ACM Transactions on Asian and Low-Resource Language Information Processing, Mar 31, 2021
Pāṇini’s grammar is an important milestone in the Indian grammatical tradition. Unlike grammars o... more Pāṇini’s grammar is an important milestone in the Indian grammatical tradition. Unlike grammars of other languages, it is almost exhaustive and together with the theories of śābdabodha (verbal cognition), this grammar provides a system for language analysis as well as generation. The theories of śābdabodha describe three conditions necessary for verbal cognition. They are ākāṅkṣā (expectancy), yogyatā (meaning congruity), and sannidhi (proximity). We examine them from a computational viewpoint and provide appropriate computational models for their representation. Next, we describe the design of a parser following the theories of śābdabodha and present three algorithms for solving the constraints imposed by the theories of śābdabodha . The first algorithm is modeled as a constraint satisfaction problem, the second one as a vertex-centric graph traversal, and the third one as an edge-centric binary join, each one being an improvement over the previous one.
Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. V... more Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. ...
Edited volume featuring the proceedings of the Computational Sanskrit & Digital Humanities sectio... more Edited volume featuring the proceedings of the Computational Sanskrit & Digital Humanities section of the 17th World Sanskrit Conference. Contents: 1. Preface / Gérard Huet and Amba Kulkarni 2. A Functional Core for the Computational Aṣṭādhyāyī / Samir Sohoni and Malhar A. Kulkarni 3. PAIAS: Pāṇini Aṣṭādhyāyī Interpreter As a Service / Sarada Susarla, Tilak M. Rao and Sai Susarla 4. Yogyatā as an absence of non-congruity / Sanjeev Panchal and Amba Kulkarni 5. An 'Ekalavya' Approach to Learning Context Free Grammar Rules for Sanskrit Using Adaptor Grammar / Amrith Krishna, Bodhisattwa Prasad Majumder, Anil Kumar Boga and Pawan Goyal 6. A user-friendly tool for metrical analysis of Sanskrit verse / Shreevatsa Rajagopalan 7. Improving the learnability of classifiers for Sanskrit OCR corrections / Devaraja Adiga, Rohit Saluja, Vaibhav Agrawal, Ganesh Ramakrishnan, Parag Chaudhuri, K. Ramasubramanian and Malhar Kulkarni 8. A Tool for Transliteration of Bilingual Texts involving S...
In this paper we describe a sentence generator for Sanskrit. Pāṇini’s grammar provides the essent... more In this paper we describe a sentence generator for Sanskrit. Pāṇini’s grammar provides the essential grammatical rules to generate a sentence from its meaning structure. The meaning structure is an abstract representation of the verbal import. It is the intermediate representation from which, using Pāṇini’s rules, without appealing to the world knowledge, the desired sentence can be generated. At the same time, this meaning structure also represents the dependency parse of the generated sentence.
Sentence parser is an essential component in the mechanical analysis of natural language texts. B... more Sentence parser is an essential component in the mechanical analysis of natural language texts. Building a parser for Sanskrit text is a challenging task because of its free word order and the dominance of verse style in Sanskrit literature in comparison to prose style. In this paper, we describe our efforts to build a parser which parses both prose as well as verse texts. It employs an Edge-Centric Binary Join method using various constraints following traditional rules of verbal cognition. We also propose a Daṇḍa-anvaya-janaka which converts the parsed verse form to its canonical prose order.
The last decade has seen rigorous activities in the field of Sanskrit computational linguistics p... more The last decade has seen rigorous activities in the field of Sanskrit computational linguistics pertaining to word level and sentence level analysis. In this paper we point out the need of special treatment for Sanskrit at discourse level owing to specific trends in Sanskrit in the production of its literature ranging over two millennia. We present a tagset for inter-sentential analysis followed by a brief account of discourse level relations accounting the sub-topic and topic level analysis, as discussed in the Indian literature ...
Sanskrit, the classical language of India, presents specific challenges for computational linguis... more Sanskrit, the classical language of India, presents specific challenges for computational linguistics: exact phonetic transcription in writing that obscures word boundaries, rich morphology and an enormous corpus, among others. Recent international cooperation has developed innovative solutions to these problems and significant resources for linguistic research. Solutions include efficient segmenting and tagging algorithms and dependency parsers based on constraint programming. The integration of lexical ...
The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological an... more The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging. But inconsistencies in morphological analysis, and in providing crucial information like the segmented word, urges the need for standardization and validation of this corpus. Automating the validation process requires efficient analyzers which also provide the missing information. The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses. Aligning these systems would help us in recording the linguistic differences, which can be used to update these systems to produce standardized results and will also provide a Gold corpus tagged with complete morphological and lexical information along with the segmented words. Krishna et al. (2017) aligned 115,000 sentences, considering some of the linguistic differences. As both these systems have evolved significantly, the alignment is done again considering all the remain...
Last decade has seen introduction of several parsers for English ranging from rule based to stati... more Last decade has seen introduction of several parsers for English ranging from rule based to statistical based. In recent years there is also a growing trend towards producing dependency output in addition to the constituency trees. The dependency format is preferred over the ...
The anusaaraka system makes text in one Indian language accessible in another Indian language. In... more The anusaaraka system makes text in one Indian language accessible in another Indian language. In the anusaaraka approach, the load is so divided between man and computer that the language load is taken by the machine, and the interpretation of the text is left to the man. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents) spill over to the output. Some special notation is also devised. The user after some training learns to read and understand the output. Because the Indian languages are close, the learning time of the output language is short, and is expected to be around 2 weeks. The output can also be post-edited by a trained user to make it grammatically correct in the target language. Style can also be changed, if necessary. Thus, in this scenario, it can function as a human assisted translation system. Currently, anusaarakas are being built from Te...
ACM Transactions on Asian and Low-Resource Language Information Processing, Mar 31, 2021
Pāṇini’s grammar is an important milestone in the Indian grammatical tradition. Unlike grammars o... more Pāṇini’s grammar is an important milestone in the Indian grammatical tradition. Unlike grammars of other languages, it is almost exhaustive and together with the theories of śābdabodha (verbal cognition), this grammar provides a system for language analysis as well as generation. The theories of śābdabodha describe three conditions necessary for verbal cognition. They are ākāṅkṣā (expectancy), yogyatā (meaning congruity), and sannidhi (proximity). We examine them from a computational viewpoint and provide appropriate computational models for their representation. Next, we describe the design of a parser following the theories of śābdabodha and present three algorithms for solving the constraints imposed by the theories of śābdabodha . The first algorithm is modeled as a constraint satisfaction problem, the second one as a vertex-centric graph traversal, and the third one as an edge-centric binary join, each one being an improvement over the previous one.
Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. V... more Sanskrit processing has seen a surge in the use of data-driven approaches over the past decade. Various tasks such as segmentation, morphological parsing, and dependency analysis have been tackled through the development of state-of-the-art models despite working with relatively limited datasets compared to other languages. However, a significant challenge lies in the availability of annotated datasets that are lexically, morphologically, syntactically, and semantically tagged. While syntactic and semantic tags are preferable for later stages of processing such as sentential parsing and disambiguation, lexical and morphological tags are crucial for low-level tasks of word segmentation and morphological parsing. The Digital Corpus of Sanskrit (DCS) is one notable effort that hosts over 650,000 lexically and morphologically tagged sentences from around 250 texts but also comes with its limitations at different levels of a sentence like chunk, segment, stem and morphological analysis. ...
Edited volume featuring the proceedings of the Computational Sanskrit & Digital Humanities sectio... more Edited volume featuring the proceedings of the Computational Sanskrit & Digital Humanities section of the 17th World Sanskrit Conference. Contents: 1. Preface / Gérard Huet and Amba Kulkarni 2. A Functional Core for the Computational Aṣṭādhyāyī / Samir Sohoni and Malhar A. Kulkarni 3. PAIAS: Pāṇini Aṣṭādhyāyī Interpreter As a Service / Sarada Susarla, Tilak M. Rao and Sai Susarla 4. Yogyatā as an absence of non-congruity / Sanjeev Panchal and Amba Kulkarni 5. An 'Ekalavya' Approach to Learning Context Free Grammar Rules for Sanskrit Using Adaptor Grammar / Amrith Krishna, Bodhisattwa Prasad Majumder, Anil Kumar Boga and Pawan Goyal 6. A user-friendly tool for metrical analysis of Sanskrit verse / Shreevatsa Rajagopalan 7. Improving the learnability of classifiers for Sanskrit OCR corrections / Devaraja Adiga, Rohit Saluja, Vaibhav Agrawal, Ganesh Ramakrishnan, Parag Chaudhuri, K. Ramasubramanian and Malhar Kulkarni 8. A Tool for Transliteration of Bilingual Texts involving S...
In this paper we describe a sentence generator for Sanskrit. Pāṇini’s grammar provides the essent... more In this paper we describe a sentence generator for Sanskrit. Pāṇini’s grammar provides the essential grammatical rules to generate a sentence from its meaning structure. The meaning structure is an abstract representation of the verbal import. It is the intermediate representation from which, using Pāṇini’s rules, without appealing to the world knowledge, the desired sentence can be generated. At the same time, this meaning structure also represents the dependency parse of the generated sentence.
Sentence parser is an essential component in the mechanical analysis of natural language texts. B... more Sentence parser is an essential component in the mechanical analysis of natural language texts. Building a parser for Sanskrit text is a challenging task because of its free word order and the dominance of verse style in Sanskrit literature in comparison to prose style. In this paper, we describe our efforts to build a parser which parses both prose as well as verse texts. It employs an Edge-Centric Binary Join method using various constraints following traditional rules of verbal cognition. We also propose a Daṇḍa-anvaya-janaka which converts the parsed verse form to its canonical prose order.
The last decade has seen rigorous activities in the field of Sanskrit computational linguistics p... more The last decade has seen rigorous activities in the field of Sanskrit computational linguistics pertaining to word level and sentence level analysis. In this paper we point out the need of special treatment for Sanskrit at discourse level owing to specific trends in Sanskrit in the production of its literature ranging over two millennia. We present a tagset for inter-sentential analysis followed by a brief account of discourse level relations accounting the sub-topic and topic level analysis, as discussed in the Indian literature ...
Sanskrit, the classical language of India, presents specific challenges for computational linguis... more Sanskrit, the classical language of India, presents specific challenges for computational linguistics: exact phonetic transcription in writing that obscures word boundaries, rich morphology and an enormous corpus, among others. Recent international cooperation has developed innovative solutions to these problems and significant resources for linguistic research. Solutions include efficient segmenting and tagging algorithms and dependency parsers based on constraint programming. The integration of lexical ...
The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological an... more The Digital Corpus of Sanskrit records around 650,000 sentences along with their morphological and lexical tagging. But inconsistencies in morphological analysis, and in providing crucial information like the segmented word, urges the need for standardization and validation of this corpus. Automating the validation process requires efficient analyzers which also provide the missing information. The Sanskrit Heritage Engine's Reader produces all possible segmentations with morphological and lexical analyses. Aligning these systems would help us in recording the linguistic differences, which can be used to update these systems to produce standardized results and will also provide a Gold corpus tagged with complete morphological and lexical information along with the segmented words. Krishna et al. (2017) aligned 115,000 sentences, considering some of the linguistic differences. As both these systems have evolved significantly, the alignment is done again considering all the remain...
Last decade has seen introduction of several parsers for English ranging from rule based to stati... more Last decade has seen introduction of several parsers for English ranging from rule based to statistical based. In recent years there is also a growing trend towards producing dependency output in addition to the constituency trees. The dependency format is preferred over the ...
The anusaaraka system makes text in one Indian language accessible in another Indian language. In... more The anusaaraka system makes text in one Indian language accessible in another Indian language. In the anusaaraka approach, the load is so divided between man and computer that the language load is taken by the machine, and the interpretation of the text is left to the man. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents) spill over to the output. Some special notation is also devised. The user after some training learns to read and understand the output. Because the Indian languages are close, the learning time of the output language is short, and is expected to be around 2 weeks. The output can also be post-edited by a trained user to make it grammatically correct in the target language. Style can also be changed, if necessary. Thus, in this scenario, it can function as a human assisted translation system. Currently, anusaarakas are being built from Te...
Uploads