Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects

  • Conference paper
  • First Online:
Arabic Language Processing: From Theory to Practice (ICALP 2019)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1108))

Included in the following conference series:

Abstract

This research deals with Arabic dialect identification, a challenging issue related to Arabic NLP. Indeed, the increasing use of Arabic dialects in a written form especially in social media generates new needs in the area of Arabic dialect processing. For discriminating between dialects in a multi-dialect context, we use different approaches based on machine learning techniques. To this end, we explored several methods. We used a classification method based on symmetric Kullback-Leibler, and we experimented classical classification methods such as Naive Bayes Classifiers and more sophisticated methods like Word2Vec and Long Short-Term Memory neural network. We tested our approaches on a large database of 25 Arabic dialects in addition to MSA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bigi, B., Brun, A., Haton, J.P., Smaïli, K., Zitouni, I.: A comparative study of topic identification on newspaper and e-mail. In: Proceedings of the 8th International Symposium on String Processing and Information Retrieval - SPIRE 2001, pp. 238–241. Laguna de San Rafael, Chili (2001)

    Google Scholar 

  2. Bouamor, H., Habash, N., Oflazer, K.: A multidialectal parallel corpus of Arabic. In: Proceedings of the Language Resources and Evaluation Conference, LREC-2014, pp. 1240–1245 (2014)

    Google Scholar 

  3. Cotterell, R., Callison-Burch, C.: A multi-dialect, multi-genre corpus of informal written Arabic. In: LREC, pp. 241–245 (2014)

    Google Scholar 

  4. Darwish, K., Sajjad, H., Mubarak, H.: Verifiably effective Arabic dialect identification. In: EMNLP, pp. 1465–1468 (2014)

    Google Scholar 

  5. Elfardy, H., Al-Badrashiny, M., Diab, M.: AIDA: identifying code switching in informal Arabic text. In: EMNLP, p. 94 (2014)

    Google Scholar 

  6. Elfardy, H., Diab, M.: Sentence level dialect identification in Arabic. In: ACL, vol. 2, pp. 456–461 (2013)

    Google Scholar 

  7. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE (2013)

    Google Scholar 

  8. Hetzron, R.: The Semitic Languages. Routledge language family descriptions, Routledge (1997). https://books.google.dz/books?id=nbUOAAAAQAAJ

  9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  10. Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)

    Article  MathSciNet  Google Scholar 

  11. Li, J., Lin, X., Rui, X., Rui, Y., Tao, D.: A distributed approach toward discriminative distance metric learning. IEEE Trans. Neural Netw. Learn. Syst. 26(9), 2111–2122 (2014)

    Article  MathSciNet  Google Scholar 

  12. Malmasi, S., Refaee, E., Dras, M.: Arabic dialect identification using a parallel multidialectal corpus. In: Hasida, K., Purwarianti, A. (eds.) Computational Linguistics. CCIS, vol. 593, pp. 35–53. Springer, Singapore (2016). https://doi.org/10.1007/978-981-10-0515-2_3

    Chapter  Google Scholar 

  13. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014)

  14. McCallum, A., Nigam, K., et al.: A comparison of event models for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization, vol. 752, pp. 41–48. Citeseer (1998)

    Google Scholar 

  15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop) (2013). http://arxiv.org/abs/1301.3781

  16. Pal, S., Ghosh, S., Nag, A.: Sentiment analysis in the light of LSTM recurrent neural networks. Int. J. Synth. Emot. 9(1), 33–39 (2018). https://doi.org/10.4018/IJSE.2018010103

    Article  Google Scholar 

  17. Pasha, A., et al.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: Proceedings of the Language Resources and Evaluation Conference (LREC), Reykjavik, Iceland (2014)

    Google Scholar 

  18. Rish, I., et al.: An empirical study of the naive bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, vol. 3, pp. 41–46 (2001)

    Google Scholar 

  19. Sadat, F., Kazemi, F., Farzindar, A.: Automatic identification of Arabic dialects in social media. In: Proceedings of the First International Workshop on Social Media Retrieval and Analysis, pp. 35–40. ACM (2014)

    Google Scholar 

  20. Salameh, M., Bouamor, H.: Fine-grained Arabic dialect identification. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1332–1344. Association for Computational Linguistics (2018). http://aclweb.org/anthology/C18-1113

  21. Samih, Y., Maier, W.: Detecting code-switching in moroccan Arabic social media. SocialNLP@ IJCAI-2016, New York (2016)

    Google Scholar 

  22. Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 11–21 (1972)

    Article  Google Scholar 

  23. Su, J., Shirab, J.S., Matwin, S.: Large scale text classification using semi-supervised multinomial naive bayes. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 97–104. Citeseer (2011)

    Google Scholar 

  24. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  25. Watson, J.C.: Phonology and Morphology of Arabic. Phonology of the World’s Languages. Oxford University Press, New York (2007)

    Google Scholar 

  26. Zaidan, O.F., Callison-Burch, C.: The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pp. 37–41. Association for Computational Linguistics (2011)

    Google Scholar 

  27. Zaidan, O.F., Callison-Burch, C.: Arabic dialect identification. Comput. Linguist. 1(1), 171–202 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Salima Harrat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Harrat, S., Meftouh, K., Abidi, K., Smaïli, K. (2019). Automatic Identification Methods on a Corpus of Twenty Five Fine-Grained Arabic Dialects. In: Smaïli, K. (eds) Arabic Language Processing: From Theory to Practice. ICALP 2019. Communications in Computer and Information Science, vol 1108. Springer, Cham. https://doi.org/10.1007/978-3-030-32959-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32959-4_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32958-7

  • Online ISBN: 978-3-030-32959-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics