-
Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages
Authors:
Sankalp Bahad,
Pruthwik Mishra,
Karunesh Arora,
Rakesh Chandra Balabantaray,
Dipti Misra Sharma,
Parameswari Krishnamurthy
Abstract:
Named Entity Recognition (NER) is a useful component in Natural Language Processing (NLP) applications. It is used in various tasks such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems. The research on NER is centered around English and some other major languages, whereas limited attention has been given to Indian languages. We analyze the challenges an…
▽ More
Named Entity Recognition (NER) is a useful component in Natural Language Processing (NLP) applications. It is used in various tasks such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems. The research on NER is centered around English and some other major languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian Languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. Additionally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of 0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
△ Less
Submitted 10 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Language Identification of Devanagari Poems
Authors:
Priyankit Acharya,
Aditya Ku. Pathak,
Rakesh Ch. Balabantaray,
Anil Ku. Singh
Abstract:
Language Identification is a very important part of several text processing pipelines. Extensive research has been done in this field. This paper proposes a procedure for automatic language identification of poems for poem analysis task, consisting of 10 Devanagari based languages of India i.e. Angika, Awadhi, Braj, Bhojpuri, Chhattisgarhi, Garhwali, Haryanvi, Hindi, Magahi, and Maithili. We colla…
▽ More
Language Identification is a very important part of several text processing pipelines. Extensive research has been done in this field. This paper proposes a procedure for automatic language identification of poems for poem analysis task, consisting of 10 Devanagari based languages of India i.e. Angika, Awadhi, Braj, Bhojpuri, Chhattisgarhi, Garhwali, Haryanvi, Hindi, Magahi, and Maithili. We collated corpora of poems of varying length and studied the similarity of poems among the 10 languages at the lexical level. Finally, various language identification systems based on supervised machine learning and deep learning techniques are applied and evaluated.
△ Less
Submitted 29 December, 2020;
originally announced December 2020.
-
Automatic Parallel Corpus Creation for Hindi-English News Translation Task
Authors:
Aditya Kumar Pathak,
Priyankit Acharya,
Dilpreet Kaur,
Rakesh Chandra Balabantaray
Abstract:
The parallel corpus for multilingual NLP tasks, deep learning applications like Statistical Machine Translation Systems is very important. The parallel corpus of Hindi-English language pair available for news translation task till date is of very limited size as per the requirement of the systems are concerned. In this work we have developed an automatic parallel corpus generation system prototype…
▽ More
The parallel corpus for multilingual NLP tasks, deep learning applications like Statistical Machine Translation Systems is very important. The parallel corpus of Hindi-English language pair available for news translation task till date is of very limited size as per the requirement of the systems are concerned. In this work we have developed an automatic parallel corpus generation system prototype, which creates Hindi-English parallel corpus for news translation task. Further to verify the quality of generated parallel corpus we have experimented by taking various performance metrics and the results are quite interesting.
△ Less
Submitted 24 January, 2019;
originally announced January 2019.
-
Document Clustering using K-Means and K-Medoids
Authors:
Rakesh Chandra Balabantaray,
Chandrali Sarma,
Monica Jha
Abstract:
With the huge upsurge of information in day-to-days life, it has become difficult to assemble relevant information in nick of time. But people, always are in dearth of time, they need everything quick. Hence clustering was introduced to gather the relevant information in a cluster. There are several algorithms for clustering information out of which in this paper, we accomplish K-means and K-Medoi…
▽ More
With the huge upsurge of information in day-to-days life, it has become difficult to assemble relevant information in nick of time. But people, always are in dearth of time, they need everything quick. Hence clustering was introduced to gather the relevant information in a cluster. There are several algorithms for clustering information out of which in this paper, we accomplish K-means and K-Medoids clustering algorithm and a comparison is carried out to find which algorithm is best for clustering. On the best clusters formed, document summarization is executed based on sentence weight to focus on key point of the whole document, which makes it easier for people to ascertain the information they want and thus read only those documents which is relevant in their point of view.
△ Less
Submitted 27 February, 2015;
originally announced February 2015.