Abstract
This research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bigi, B., Brun, A., Haton, J.P., Smaïli, K., Zitouni, I.: A comparative study of topic identification on newspaper and e-mail. In: Proceedings of the 8th International Symposium on String Processing and Information Retrieval - SPIRE 2001, pp. 238–241. Laguna de San Rafael, Chili (2001)
Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In: Proceedings of the Language Resources and Evaluation Conference, LREC-2014, pp. 1240–1245 (2014)
Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: LREC, pp. 241–245 (2014)
Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: EMNLP, pp. 1465–1468 (2014)
Elfardy, H., Al-Badrashiny, M., Diab, M.: AIDA: identifying code switching in informal Arabic text. In: EMNLP, p. 94 (2014)
Elfardy, H., Diab, M.: Sentence level dialect identification in Arabic. In: ACL, vol. 2, pp. 456–461 (2013)
Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)
Hetzron, R.: The Semitic Languages. Routledge language family descriptions, Routledge (1997). https://books.google.dz/books?id=nbUOAAAAQAAJ
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Li, J., Lin, X., Rui, X., Rui, Y., Tao, D.: A distributed approach toward discriminative distance metric learning. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 2111–2122 (2014)
Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 35–53. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_3
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014)
McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop) (2013). http://arxiv.org/abs/1301.3781
Pal, S., Ghosh, S., Nag, A.: Sentiment analysis in the light of LSTM recurrent neural networks. Int. J. Synth. Emot. 9(1), 33–39 (2018). https://doi.org/10.4018/IJSE.2018010103
Pasha, A., et al.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland (2014)
Rish, I., et al.: An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)
Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic dialects in social media. In: Proceedings of the First International Workshop on Social Media Retrieval and Analysis, pp. 35–40. ACM (2014)
Salameh, M., Bouamor, H.: Fine-grained Arabic dialect identification. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1332–1344. Association for Computational Linguistics (2018). http://aclweb.org/anthology/C18-1113
Samih, Y., Maier, W.: Detecting code-switching in moroccan Arabic social media. SocialNLP@ IJCAI-2016, New York (2016)
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 11–21 (1972)
Su, J., Shirab, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial naive bayes. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 97–104. Citeseer (2011)
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Watson, J.C.: Phonology and Morphology of Arabic. Phonology of the World’s Languages. Oxford University Press, New York (2007)
Zaidan, O.F., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pp. 37–41. Association for Computational Linguistics (2011)
Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 1(1), 171–202 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Harrat, S., Meftouh, K., Abidi, K., Smaïli, K. (2019). Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects. In: Smaïli, K. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2019. Communications in Computer and Information Science, vol 1108. Springer, Cham. https://doi.org/10.1007/978-3-030-32959-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-32959-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32958-7
Online ISBN: 978-3-030-32959-4
eBook Packages: Computer ScienceComputer Science (R0)