The synsets in Assamese Wordnet play a significant role in the enrichment of Assamese language. T... more The synsets in Assamese Wordnet play a significant role in the enrichment of Assamese language. These synsets are built depending on the intuition the native speakers of the language. There is no fixed rule in the arranging the positions of each synset. The present paper mainly aims to make a quantitative comparison of every synset position of Wordnet seeing the occurrences of these synsets in corpus of Assamese (approximately 1.5 million words). The experimental result of this comparison is represented with the help of diagrams. Again, it is an attempt to highlight the timeline of each synsets of Wordnet based on the corpus. It is dealt with the change of the synonymous word forms in course of times.
Machine Translation is a task to trans-late the text from a source language to a target language ... more Machine Translation is a task to trans-late the text from a source language to a target language in an automatic man-ner. Here, we describe a system that trans-late the English language to Assamese lan-guage text which is based on Phrase based statistical translation technique. To over-come the translation problem related with highly open word class like Proper Noun or the Out Of Vocabulary words we de-velop a transliteration system which is also embedded with our translation sys-tem. We enhance the translation output by replacing words with their most appro-priate synonymous word for that particular context with the help of Assamese Word-Net Synset. This Machine Translation sys-tem outcomes with a reasonable transla-tion output when analyzed by linguist for Assamese language which is a less com-putationally aware language among the In-dian languages. 1
Machine Translation is a task to translate the text from a source language to a target language i... more Machine Translation is a task to translate the text from a source language to a target language in an automatic manner. Here, we describe a system that translate the English language to Assamese language text which is based on Phrase based statistical translation technique. To overcome the translation problem related with highly open word class like Proper Noun or the Out Of Vocabulary words we develop a transliteration system which is also embedded with our translation system. We enhance the translation output by replacing words with their most appropriate synonymous word for that particular context with the help of Assamese WordNet Synset. This Machine Translation system outcomes with a reasonable translation output when analyzed by linguist for Assamese language which is a less computationally aware language among the Indian languages.
ADBU Journal of Engineering Technology (AJET), 2019
This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR)... more This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR) system for an Indo-Aryan language- Assamese. In a multilingual country like India, where 23 official languages exist, the task of digitizing local language contents is growing tremendously. To meet the need of each individual’s relevant information, monolingual Information Retrieval in own language is very essential. The work aims to develop a search engine that retrieves relevant information for the fired query in one's respective language. Various Linguists, Researchers collaborated with the work, provided valuable information and developed various important resources. Many informative resources, language resources, tools & technologies were research, analyze, develop and applied in implementing the overall pipeline. The search engine is frame worked on open search platforms- Solr and Nutch with NLP applications embedded in it. Computational Linguistics or Natural Language Process...
Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to dis... more Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to distinguish the correct sense among the set of senses for an ambiguous term depending on the particular context automatically. It plays the vital role as it acts as an intermediate phase to many Natural Language Processing (NLP) applications like Machine Translation, Information Retrieval, Speech Processing, Hypertext navigation, Parts-of -Speech tagging. Existing literature reveals that there are various approaches for lexical ambiguity resolution-Knowledge based, Corpus based. In recent years, many WSD systems is being developed in Indian languages like Hindi, Malayalam, Manipuri, Nepali, Kannada but no such automated system has yet emerged for the Indo-Aryan language- Assamese. Our future work aims to develop a model for the WSD problem which is fast, optimal and efficient in terms of accuracy and scalability. This paper presents a survey report made in this research topic discussing the...
Stemming is a technique that reduces any inflected word to its root form. Assamese is a morpholog... more Stemming is a technique that reduces any inflected word to its root form. Assamese is a morphologically rich, scheduled Indian language. There are various forms of suffixes applied to a word in various contexts. Such inflected words if normalized will help improve the performance of various Natural Language Processing applications. This paper basically tries to develop a Look-up and rule-based suffix stripping approach for the Assamese language using WordNet. The authors prepare the dictionary with the root words extracted from Assamese WordNet and Named Entities. Appropriate stemming rules for the inflected nouns, verbs have been set to the rule engine and later tested the stemmed output with the morphological root words of Assamese WordNet and Named Entities by computing hamming distance. This developed stemmer for the Assamese language achieves accuracy of 85%. Also, the authors reported the IR system’s performance on applying the Assamese stemmer and proved its efficiency by ret...
This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR)... more This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR) system for an Indo-Aryan language-Assamese. In a multilingual country like India, where 23 official languages exist, the task of digitizing local language contents is growing tremendously. To meet the need of each individual's relevant information, monolingual Information Retrieval in own language is very essential. The work aims to develop a search engine that retrieves relevant information for the fired query in one's respective language. Various Linguists, Researchers collaborated with the work, provided valuable information and developed various important resources. Many informative resources, language resources, tools & technologies were research, analyze, develop and applied in implementing the overall pipeline. The search engine is frame worked on open search platforms-Solr and Nutch with NLP applications embedded in it. Computational Linguistics or Natural Language Processing (NLP) enhances the performance of the IR system. Each phase of the system is being elaborately described in this paper and explained step-wise. This work is a remarkable contribution to Assamese language technology and an important application of NLP.
Stemming is a technique that reduces any
inflected word to its root form. Assamese
is a morpholog... more Stemming is a technique that reduces any inflected word to its root form. Assamese is a morphologically rich, scheduled Indian language. There are various forms of suffixes applied to a word in various contexts. Such inflected words if normalized will help improve the performance of various Natural Language Processing applications. This paper basically tries to develop a Look-up and rule-based suffix stripping approach for the Assamese language using WordNet. The authors prepare the dictionary with the root words extracted from Assamese WordNet and Named Entities. Appropriate stemming rules for the inflected nouns, verbs have been set to the rule engine and later tested the stemmed output with the morphological root words of Assamese WordNet and Named Entities by computing hamming distance. This developed stemmer for the Assamese language achieves accuracy of 85%. Also, the authors reported the IR system’s performance on applying the Assamese stemmer and proved its efficiency by retrieving sense oriented results based on the fired query. Thus, Morphological Analyzer will embark the research wing for developing various Assamese NLP applications.
Word Sense Disambiguation (WSD) aims to disambiguate the words which have multiple sense in a con... more Word Sense Disambiguation (WSD) aims to disambiguate the words which have multiple sense in a context automatically. Sense denotes the meaning of a word and the words which have various meanings in a context are referred as ambiguous words. WSD is vital in many important Natural Language Processing tasks like MT, IR, TC, SP etc. This research paper attempts to propose a supervised Machine Learning approach-Decision Tree for Word Sense Disambiguation task in Assamese language. A Decision Tree is decision model flow-chart like tree structure where each internal node denotes a test, each branch represents result of a test and each leaf holds a sense label. J48 a Java implementation of C4.5 decision tree algorithm is taken for experimentation in our case. A few polysemous words with different real occurrences in Assamese text with manual sense annotation was collected as the training and test dataset. DT algorithm produces average F-measure of .611 when 10-fold crossvalidation evaluation was performed on 10 Assamese
Word Sense Disambiguation (WSD) is the process of identifying the proper sense of an ambiguous wo... more Word Sense Disambiguation (WSD) is the process of identifying the proper sense of an ambiguous word depending on the particular context. It is to find the accurate sense si among the set of senses {s1, s2, …, sn}. This task was motivated by its interpretation in various Natural Language Processing (NLP) applications like IR, MT, QA, TC, SP etc. In this paper, machine learning technique - Naïve Bayes Classifier was used for automatic disambiguation task. Training data was prepared with sense annotated features. For preparing sense annotated data we took help of the sense inventory. Currently, about 160 ambiguous words are present in the sense inventory derived from 18K and 25K words from Assamese Corpus and WordNet. The system is implemented in two phases. In the first phase, a total of 2.7K sense annotated training data and 800 test data were taken and a result of 71% accuracy was found. Analyzing the result depicts that accuracy improves as the training data size gradually increases and by the learned model generated in the previous iteration. In second phase we manually validate the outcomes of first-phase and we add those clean sense tagged data to previous training data set. Than we train our system with our incresing training data (3.5K) which enhance the result accuracy. An iterative learning is adopted by the system and more accuracy of 7% is achieved. This paper aims to implement Assamese WSD system by NB classifier using lexical features and enhancement of the baseline method turns out in improving the classifier accuracy to 78%.
Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to dis... more Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to distinguish the correct sense among the set of senses for an ambiguous term depending on the particular context automatically. It plays the vital role as it acts as an intermediate phase to many Natural Language Processing (NLP) applications like Machine Translation, Information Retrieval, Speech Processing, Hypertext navigation, Parts-of-Speech tagging. Existing literature reveals that there are various approaches for lexical ambiguity resolution-Knowledge based, Corpus based. In recent years, many WSD systems is being developed in Indian languages like Hindi, Malayalam, Manipuri, Nepali, Kannada but no such automated system has yet emerged for the Indo-Aryan language-Assamese. Our future work aims to develop a model for the WSD problem which is fast, optimal and efficient in terms of accuracy and scalability. This paper presents a survey report made in this research topic discussing the WSD problem, various approaches along with their algorithms. Moreover it also list out the various NLP applications which would be efficient when disambiguation system is merged. Evaluation measures used to determine the WSD performance are also discussed here.
The synsets in Assamese Wordnet play a significant role in the enrichment of Assamese language. T... more The synsets in Assamese Wordnet play a significant role in the enrichment of Assamese language. These synsets are built depending on the intuition the native speakers of the language. There is no fixed rule in the arranging the positions of each synset. The present paper mainly aims to make a quantitative comparison of every synset position of Wordnet seeing the occurrences of these synsets in corpus of Assamese (approximately 1.5 million words). The experimental result of this comparison is represented with the help of diagrams. Again, it is an attempt to highlight the timeline of each synsets of Wordnet based on the corpus. It is dealt with the change of the synonymous word forms in course of times.
Machine Translation is a task to trans-late the text from a source language to a target language ... more Machine Translation is a task to trans-late the text from a source language to a target language in an automatic man-ner. Here, we describe a system that trans-late the English language to Assamese lan-guage text which is based on Phrase based statistical translation technique. To over-come the translation problem related with highly open word class like Proper Noun or the Out Of Vocabulary words we de-velop a transliteration system which is also embedded with our translation sys-tem. We enhance the translation output by replacing words with their most appro-priate synonymous word for that particular context with the help of Assamese Word-Net Synset. This Machine Translation sys-tem outcomes with a reasonable transla-tion output when analyzed by linguist for Assamese language which is a less com-putationally aware language among the In-dian languages. 1
Machine Translation is a task to translate the text from a source language to a target language i... more Machine Translation is a task to translate the text from a source language to a target language in an automatic manner. Here, we describe a system that translate the English language to Assamese language text which is based on Phrase based statistical translation technique. To overcome the translation problem related with highly open word class like Proper Noun or the Out Of Vocabulary words we develop a transliteration system which is also embedded with our translation system. We enhance the translation output by replacing words with their most appropriate synonymous word for that particular context with the help of Assamese WordNet Synset. This Machine Translation system outcomes with a reasonable translation output when analyzed by linguist for Assamese language which is a less computationally aware language among the Indian languages.
ADBU Journal of Engineering Technology (AJET), 2019
This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR)... more This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR) system for an Indo-Aryan language- Assamese. In a multilingual country like India, where 23 official languages exist, the task of digitizing local language contents is growing tremendously. To meet the need of each individual’s relevant information, monolingual Information Retrieval in own language is very essential. The work aims to develop a search engine that retrieves relevant information for the fired query in one's respective language. Various Linguists, Researchers collaborated with the work, provided valuable information and developed various important resources. Many informative resources, language resources, tools & technologies were research, analyze, develop and applied in implementing the overall pipeline. The search engine is frame worked on open search platforms- Solr and Nutch with NLP applications embedded in it. Computational Linguistics or Natural Language Process...
Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to dis... more Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to distinguish the correct sense among the set of senses for an ambiguous term depending on the particular context automatically. It plays the vital role as it acts as an intermediate phase to many Natural Language Processing (NLP) applications like Machine Translation, Information Retrieval, Speech Processing, Hypertext navigation, Parts-of -Speech tagging. Existing literature reveals that there are various approaches for lexical ambiguity resolution-Knowledge based, Corpus based. In recent years, many WSD systems is being developed in Indian languages like Hindi, Malayalam, Manipuri, Nepali, Kannada but no such automated system has yet emerged for the Indo-Aryan language- Assamese. Our future work aims to develop a model for the WSD problem which is fast, optimal and efficient in terms of accuracy and scalability. This paper presents a survey report made in this research topic discussing the...
Stemming is a technique that reduces any inflected word to its root form. Assamese is a morpholog... more Stemming is a technique that reduces any inflected word to its root form. Assamese is a morphologically rich, scheduled Indian language. There are various forms of suffixes applied to a word in various contexts. Such inflected words if normalized will help improve the performance of various Natural Language Processing applications. This paper basically tries to develop a Look-up and rule-based suffix stripping approach for the Assamese language using WordNet. The authors prepare the dictionary with the root words extracted from Assamese WordNet and Named Entities. Appropriate stemming rules for the inflected nouns, verbs have been set to the rule engine and later tested the stemmed output with the morphological root words of Assamese WordNet and Named Entities by computing hamming distance. This developed stemmer for the Assamese language achieves accuracy of 85%. Also, the authors reported the IR system’s performance on applying the Assamese stemmer and proved its efficiency by ret...
This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR)... more This paper engulfs the activities involved in developing a Monolingual Information Retrieval (IR) system for an Indo-Aryan language-Assamese. In a multilingual country like India, where 23 official languages exist, the task of digitizing local language contents is growing tremendously. To meet the need of each individual's relevant information, monolingual Information Retrieval in own language is very essential. The work aims to develop a search engine that retrieves relevant information for the fired query in one's respective language. Various Linguists, Researchers collaborated with the work, provided valuable information and developed various important resources. Many informative resources, language resources, tools & technologies were research, analyze, develop and applied in implementing the overall pipeline. The search engine is frame worked on open search platforms-Solr and Nutch with NLP applications embedded in it. Computational Linguistics or Natural Language Processing (NLP) enhances the performance of the IR system. Each phase of the system is being elaborately described in this paper and explained step-wise. This work is a remarkable contribution to Assamese language technology and an important application of NLP.
Stemming is a technique that reduces any
inflected word to its root form. Assamese
is a morpholog... more Stemming is a technique that reduces any inflected word to its root form. Assamese is a morphologically rich, scheduled Indian language. There are various forms of suffixes applied to a word in various contexts. Such inflected words if normalized will help improve the performance of various Natural Language Processing applications. This paper basically tries to develop a Look-up and rule-based suffix stripping approach for the Assamese language using WordNet. The authors prepare the dictionary with the root words extracted from Assamese WordNet and Named Entities. Appropriate stemming rules for the inflected nouns, verbs have been set to the rule engine and later tested the stemmed output with the morphological root words of Assamese WordNet and Named Entities by computing hamming distance. This developed stemmer for the Assamese language achieves accuracy of 85%. Also, the authors reported the IR system’s performance on applying the Assamese stemmer and proved its efficiency by retrieving sense oriented results based on the fired query. Thus, Morphological Analyzer will embark the research wing for developing various Assamese NLP applications.
Word Sense Disambiguation (WSD) aims to disambiguate the words which have multiple sense in a con... more Word Sense Disambiguation (WSD) aims to disambiguate the words which have multiple sense in a context automatically. Sense denotes the meaning of a word and the words which have various meanings in a context are referred as ambiguous words. WSD is vital in many important Natural Language Processing tasks like MT, IR, TC, SP etc. This research paper attempts to propose a supervised Machine Learning approach-Decision Tree for Word Sense Disambiguation task in Assamese language. A Decision Tree is decision model flow-chart like tree structure where each internal node denotes a test, each branch represents result of a test and each leaf holds a sense label. J48 a Java implementation of C4.5 decision tree algorithm is taken for experimentation in our case. A few polysemous words with different real occurrences in Assamese text with manual sense annotation was collected as the training and test dataset. DT algorithm produces average F-measure of .611 when 10-fold crossvalidation evaluation was performed on 10 Assamese
Word Sense Disambiguation (WSD) is the process of identifying the proper sense of an ambiguous wo... more Word Sense Disambiguation (WSD) is the process of identifying the proper sense of an ambiguous word depending on the particular context. It is to find the accurate sense si among the set of senses {s1, s2, …, sn}. This task was motivated by its interpretation in various Natural Language Processing (NLP) applications like IR, MT, QA, TC, SP etc. In this paper, machine learning technique - Naïve Bayes Classifier was used for automatic disambiguation task. Training data was prepared with sense annotated features. For preparing sense annotated data we took help of the sense inventory. Currently, about 160 ambiguous words are present in the sense inventory derived from 18K and 25K words from Assamese Corpus and WordNet. The system is implemented in two phases. In the first phase, a total of 2.7K sense annotated training data and 800 test data were taken and a result of 71% accuracy was found. Analyzing the result depicts that accuracy improves as the training data size gradually increases and by the learned model generated in the previous iteration. In second phase we manually validate the outcomes of first-phase and we add those clean sense tagged data to previous training data set. Than we train our system with our incresing training data (3.5K) which enhance the result accuracy. An iterative learning is adopted by the system and more accuracy of 7% is achieved. This paper aims to implement Assamese WSD system by NB classifier using lexical features and enhancement of the baseline method turns out in improving the classifier accuracy to 78%.
Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to dis... more Resolution of lexical ambiguity, commonly known as Word Sense Disambiguation (WSD) task is to distinguish the correct sense among the set of senses for an ambiguous term depending on the particular context automatically. It plays the vital role as it acts as an intermediate phase to many Natural Language Processing (NLP) applications like Machine Translation, Information Retrieval, Speech Processing, Hypertext navigation, Parts-of-Speech tagging. Existing literature reveals that there are various approaches for lexical ambiguity resolution-Knowledge based, Corpus based. In recent years, many WSD systems is being developed in Indian languages like Hindi, Malayalam, Manipuri, Nepali, Kannada but no such automated system has yet emerged for the Indo-Aryan language-Assamese. Our future work aims to develop a model for the WSD problem which is fast, optimal and efficient in terms of accuracy and scalability. This paper presents a survey report made in this research topic discussing the WSD problem, various approaches along with their algorithms. Moreover it also list out the various NLP applications which would be efficient when disambiguation system is merged. Evaluation measures used to determine the WSD performance are also discussed here.
Uploads
inflected word to its root form. Assamese
is a morphologically rich, scheduled Indian
language. There are various forms of
suffixes applied to a word in various contexts.
Such inflected words if normalized
will help improve the performance of various
Natural Language Processing applications.
This paper basically tries to develop
a Look-up and rule-based suffix stripping
approach for the Assamese language using
WordNet. The authors prepare the
dictionary with the root words extracted
from Assamese WordNet and Named Entities.
Appropriate stemming rules for the
inflected nouns, verbs have been set to the
rule engine and later tested the stemmed
output with the morphological root words
of Assamese WordNet and Named Entities
by computing hamming distance. This
developed stemmer for the Assamese language
achieves accuracy of 85%. Also,
the authors reported the IR system’s performance
on applying the Assamese stemmer
and proved its efficiency by retrieving
sense oriented results based on the fired
query. Thus, Morphological Analyzer will
embark the research wing for developing
various Assamese NLP applications.
inflected word to its root form. Assamese
is a morphologically rich, scheduled Indian
language. There are various forms of
suffixes applied to a word in various contexts.
Such inflected words if normalized
will help improve the performance of various
Natural Language Processing applications.
This paper basically tries to develop
a Look-up and rule-based suffix stripping
approach for the Assamese language using
WordNet. The authors prepare the
dictionary with the root words extracted
from Assamese WordNet and Named Entities.
Appropriate stemming rules for the
inflected nouns, verbs have been set to the
rule engine and later tested the stemmed
output with the morphological root words
of Assamese WordNet and Named Entities
by computing hamming distance. This
developed stemmer for the Assamese language
achieves accuracy of 85%. Also,
the authors reported the IR system’s performance
on applying the Assamese stemmer
and proved its efficiency by retrieving
sense oriented results based on the fired
query. Thus, Morphological Analyzer will
embark the research wing for developing
various Assamese NLP applications.