Abstract
Offensive language identification is essential to make social media a safe and clean place to share one’s view. In this work, a model is proposed to automatically classify offensive tweets into offensive and not offensive classes of low-resource language. Marathi is spoken in Western India. Marathi being a low-resource language, lacks a comprehensive list of stopwords and proper stammer. To fill this gap, we created a list of stopwords for stopword removal and a list of suffixes to identify the root word in the Marathi language. Two different methods, Label Vectorizer and term frequency-inverse document frequency (TF-IDF) Vectorizer, are used to extract features from the text and then these features are used with six different conventional machine learning classifiers to classify a Marathi tweet into offensive or non-offensive.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Athiwaratkun, B., Wilson, A.G., Anandkumar, A.: Probabilistic fasttext for multi-sense word embeddings. arXiv preprint arXiv:1806.02901 (2018)
Baruah, A., Das, K.A., Barbhuiya, F.A., Dey, K.: Iiitg-adbu@ hasoc-dravidian-codemix-fire2020: Offensive content detection in code-mixed Dravidian text. arXiv preprint arXiv:2107.14336 (2021)
Das, A., Wahi, J.S., Li, S.: Detecting hate speech in multi-modal memes. arXiv preprint arXiv:2012.14891 (2020)
Frakes, W.B., Baeza-Yates, R.: Information retrieval: data structures and algorithms. Prentice-Hall, Inc. (1992)
Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. In: ACM SIGIR Forum, vol. 37, pp. 26–30. ACM, New York(2003)
Gaikwad, S., Ranasinghe, T., Zampieri, M., Homan, C.M.: Cross-lingual offensive language identification for low resource languages: The case of Marathi. arXiv preprint arXiv:2109.03552 (2021)
Gajbhiye, D., Deshpande, S., Ghante, P., Kale, A., Chaudhari, D.: Machine learning models for hate speech identification in Marathi language. In: Forum for Information Retrieval Evaluation (Working Notes)(FIRE), CEUR-WS. org (2021)
Giri, V., et al.: Mtstemmer: a multilevel stemmer for effective word pre-processing in Marathi. Turkish J. Comput. Mathem. Educ. (TURCOMAT) 12(2), 1885–1894 (2021)
Jogin, M., Madhulika, M., Divya, G., Meghana, R., Apoorva, S., et al.: Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology (RTEICT), pp. 2319–2323. IEEE (2018)
Kumar, G., Singh, J.P., Kumar, A.: A deep multi-modal neural network for the identification of hate speech from social media. In: Conference on e-Business, e-Services and e-Society, pp. 670–680. Springer (2021)
Kumari, K., Singh, J.P.: Identification of cyberbullying on multi-modal social media posts using genetic algorithm. Trans. Emerging Telecommun. Technol. 32(2), e3907 (2021)
Kumari, K., Singh, J.P., Dwivedi, Y.K., Rana, N.P.: Multi-modal aggression identification using convolutional neural network and binary particle swarm optimization. Futur. Gener. Comput. Syst. 118, 187–197 (2021)
Kumari, K., Singh, J.P., Dwivedi, Y.K., Rana, N.P.: Towards cyberbullying-free social media in smart cities: a unified multi-modal approach. Soft. Comput. 24(15), 11059–11070 (2020)
Kuyumcu, B., Aksakalli, C., Delil, S.: An automated new approach in fast text classification (fasttext) a case study for Turkish text classification without pre-processing. In: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval, pp. 1–4 (2019)
Pathak, V., Joshi, M., Joshi, P., Mundada, M., Joshi, T.: Kbcnmujal@ hasoc-dravidian-codemix-fire2020: using machine learning for detection of hate speech and offensive code-mixed social media text. arXiv preprint arXiv:2102.09866 (2021)
Patil, H.B., Pawar, B., Patil, A.S.: A comprehensive analysis of stemmers available for indic languages. Int. J. Nat. Lang. Comput 5(1), 45–55 (2016)
Patil, R.S., Kolhe, S.R.: Inflectional and derivational hybrid stemmer for sentiment analysis: a case study with Marathi tweets. In: International Conference on Recent Trends in Image Processing and Pattern Recognition, pp. 263–279. Springer (2022). https://doi.org/10.1007/978-3-031-07005-1_23
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543 (2014)
Prajitha, U., Sreejith, C., Raj, P.R.: Lalitha: a lightweight Malayalam stemmer using the suffix stripping method. In: 2013 International Conference on Control Communication and Computing (ICCC), pp. 244–248. IEEE (2013)
Saharia, N., Konwar, K.M., Sharma, U., Kalita, J.K.: An improved stemming approach using HMM for a highly inflectional language. In: Gelbukh, A. (ed.) CICLing 2013. LNCS, vol. 7816, pp. 164–173. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37247-6_14
Saumya, S., Kumar, A., Singh, J.P.: Offensive language identification in Dravidian code mixed social media text. In: Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages, pp. 36–45 (2021)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sreelakshmi, K., Premjith, B., Soman, K.: Detection of hate speech text in Hindi-English code-mixed data. Proc. Comput. Sci. 171, 737–744 (2020)
Swaminathan, S., Ganesan, H.K., Pandiyarajan, R.: Hrs-techie@ dravidian-codemix and hasoc-fire2020: sentiment analysis and hate speech identification using machine learning deep learning and ensemble models. In: FIRE (Working Notes), pp. 241–252 (2020)
Velankar, A., Patil, H., Gore, A., Salunke, S., Joshi, R.: Hate and offensive speech detection in Hindi and Marathi. arXiv preprint arXiv:2110.12200 (2021)
Velankar, A., Patil, H., Gore, A., Salunke, S., Joshi, R.: L3cube-mahahate: a tweet-based Marathi hate speech detection dataset and BERT models. arXiv preprint arXiv:2203.13778 (2022)
Zhang, W.: Neural dependency parsing of low-resource languages: a case study on Marathi (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Kumari, A., Garge, A., Raj, P., Kumar, G., Singh, J.P., Alryalat, M. (2024). Classification of Offensive Tweet in Marathi Language Using Machine Learning Models. In: Dasgupta, K., Mukhopadhyay, S., Mandal, J.K., Dutta, P. (eds) Computational Intelligence in Communications and Business Analytics. CICBA 2023. Communications in Computer and Information Science, vol 1955. Springer, Cham. https://doi.org/10.1007/978-3-031-48876-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-48876-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48875-7
Online ISBN: 978-3-031-48876-4
eBook Packages: Computer ScienceComputer Science (R0)