The present paper brings out the issues and challenges underlying the collection and annotation of a voluminous corpus collected for Sambalpuri, a lesser-known language spoken in the eastern region of India (Kushal, 2015). It is an... more
The present paper brings out the issues and challenges underlying the collection and annotation of a voluminous corpus collected for Sambalpuri, a lesser-known language spoken in the eastern region of India (Kushal, 2015). It is an Indo-Aryan (IA) language otherwise known as Dom, Kosali, Koshal, Koshali, Western Odia1. It is spoken in the ten districts of western and south-western Odisha. The total corpus collected for it amounts to 121k tokens covering five domains: literature, sports, tourism, entertainment and miscellaneous. Two issues such as corpus collection and annotation have been taken up here. The unavailability of a corpus for a low-density language proves to bear adverse impacts on its future NLP development. If the data is available, it is either in the image form or non-Unicode encoding or not in machine readable format. Since these languages are not technologically empowered, there are not enough number of tools to convert the texts to Unicode. Owing to the facts that Sambalpuri being a lesser-known language and guidelines developed for Indian languages have been devised for only the scheduled languages, there are significant issues and challenges one faces with regard to the annotation of a voluminous corpus. These underlying issues mostly pertain to the non-incorporation of unnoticed unique linguistic features of these languages into the uniform guideline devised for the scheduled languages. The whole corpus has been annotated using the ILCIANN App. 2.0 (Kumar, et al., 2012) version following the BIS-ILCI tagset devised for Odia language since there is no tagset available for it. In addition, Sambalpuri was earlier considered to be one of the dialects of Odia and both are closely related morpho-syntactically but not syntactically. The BIS tagset (Baskaran et al., 2006) is a hierarchical set designed by the POS Standardization Committee appointed by the Department of Information and Technology, Government of India. The categories of reciprocal pronoun and foreign word have not occurred in the whole corpus during annotation job. The issues and challenges pertaining to corpus collection and annotation for Sambalpuri have been outlined below and the solutions have been proposed.