Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism
Abstract
:1. Introduction
- A novel hybrid machine and deep learning model consisting of LSTM, BiLSTM, and logistic regression are proposed. This model is designed specifically for detecting and classifying Arabic dialects;
- A new dataset comprising four Arabic dialects, namely, Egyptian, Gulf, Jordanian, and Yemeni, is introduced. This dataset is likely collected and curated for training and evaluating the proposed model;
- The performance of the proposed model is examined. It comprises training the model on the novel dataset and evaluating its accuracy, precision, recall, F1-score, or other relevant metrics to measure its ability to identify and classify the different Arabic dialects;
- The proposed model’s performance is examined using different word representations, namely TF-IDF, Word2Vec, and GloVe, on the introduced dataset for the Arabic dialect.
2. Related Studies
3. Problem Formulation of Arabic Dialect Identification
4. Methods and Materials
4.1. Data Collection Phase
4.2. Data Cleaning
4.3. Data Annotation
4.4. Pre-Processing
- Removal of Special Characters: Special characters like “#” and “@” commonly used in Tweets are removed from the text;
- Elimination of English Words or Characters: Any English words or characters, such as mentions or references to others, are removed from the text;
- Exclusion of English Numbers: Numerical values in English are eliminated from the text;
- Exclusion of Arabic Numbers: Arabic numerical values are taken away from the text;
- Elimination of Tweets with No Words: Tweets that do not contain any words, such as those consisting only of images, mentions, or characters, are dropped from the dataset;
- Augmentation with Arabic Stop Words Removal: To enhance the analysis, the data is processed twice. The first time follows the previous steps, and the second time involves an additional step of removing Arabic stop words. The stop words are extracted from the NLTK Python libraries and include words that do not significantly contribute to sentimental analysis or dialect classification, such as “و” (and), “أو” (or), “إلا” (except), “لكن” (but), and so on.
4.5. Word Representation
4.6. Machine and Deep Learning Models
4.7. Model Evaluation
5. Results and Discussion
6. Conclusions and Future Work
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kanan, T.; Sadaqa, O.; Aldajeh, A.; Alshwabka, H.; AL-dolime, W.; AlZu’bi, S.; Elbes, M.; Hawashin, B.; Alia, M.A. A review of natural language processing and machine learning tools used to analyze arabic social media. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019; IEEE: Piscataway, NJ, USA; pp. 622–628. [Google Scholar]
- Alhejaili, R.; Alhazmi, E.S.; Alsaeedi, A.; Yafooz, W.M. Sentiment analysis of the COVID-19 vaccine for Arabic tweets using machine learning. In Proceedings of the 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 3–4 September 2021; pp. 1–5. [Google Scholar]
- Alnawas, A.; Arici, N. The corpus based approach to sentiment analysis in modern standard Arabic and Arabic dialects: A literature review. Politek. Derg. 2018, 21, 461–470. [Google Scholar] [CrossRef]
- Al Shamsi, A.A.; Abdallah, S. Text mining techniques for sentiment analysis of Arabic dialects: Literature review. Adv. Sci. Technol. Eng. Syst. J. 2021, 6, 1012–1023. [Google Scholar] [CrossRef]
- Kwaik, K.A.; Saad, M.; Chatzikyriakidis, S.; Dobnik, S. Shami: A corpus of levantine arabic dialects. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Elnagar, A.; Al-Debsi, R.; Einea, O. Arabic text classification using deep learning models. Inf. Process. Manag. 2020, 57, 102121. [Google Scholar] [CrossRef]
- Huang, F. Improved arabic dialect classification with social media data. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2118–2126. [Google Scholar]
- AlYami, R.; AlZaidy, R. Arabic dialect identification in social media. In Proceedings of the 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS), Riyadh, Saudi Arabia, 19–21 March 2020; IEEE: Piscataway, NJ, USA; pp. 1–2. [Google Scholar]
- Dunn, J. Modeling global syntactic variation in English using dialect classification. arXiv 2019, arXiv:1904.05527. [Google Scholar]
- Elfardy, H.; Diab, M. Sentence level dialect identification in Arabic. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, Short Papers. Sofia, Bulgaria, 4–9 August 2013; Volume 2, pp. 456–461. [Google Scholar]
- Ali, A.; Dehak, N.; Cardinal, P.; Khurana, S.; Yella, S.H.; Glass, J.; Bell, P.; Renals, S. Automatic dialect detection in arabic broadcast speech. arXiv 2015, arXiv:1509.06928. [Google Scholar]
- Boujou, E.; Chataoui, H.; Mekki, A.E.; Benjelloun, S.; Chairi, I.; Berrada, I. An open access nlp dataset for arabic dialects: Data collection, labeling, and model construction. arXiv 2021, arXiv:2102.11000. [Google Scholar]
- Sobhy, M.; El-Atta AH, A.; El-Sawy, A.A.; Nayel, H. Word Representation Models for Arabic Dialect Identification. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 474–478. [Google Scholar]
- El-Haj, M.; Rayson, P.; Aboelezz, M. Arabic dialect identification in the context of bivalency and code-switching. In Proceedings of the 11th International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018; European Language Resources Association: Paris, France, 2018; pp. 3622–3627. [Google Scholar]
- Malmasi, S.; Refaee, E.; Dras, M. Arabic dialect identification using a parallel multidialectal corpus. In Proceedings of the International Conference of the Pacific Association for Computational Linguistics, PACLING 2015, Bali, Indonesia, 19–21 May 2015; Springer: Singapore, 2015; pp. 35–53. [Google Scholar]
- Butnaru, A.M.; Ionescu, R.T. Unibuckernel reloaded: First place in arabic dialect identification for the second year in a row. arXiv 2018, arXiv:1805.04876. [Google Scholar]
- Johnson, A.; Everson, K.; Ravi, V.; Gladney, A.; Ostendorf, M.; Alwan, A. Automatic dialect density estimation for african american english. arXiv 2022, arXiv:2204.00967. [Google Scholar]
- Hassani, H.; Medjedovic, D. Automatic Kurdish dialects identification. Comput. Sci. Inf. Technol. 2016, 6, 61–78. [Google Scholar]
- Nayel, H.; Hassan, A.; Sobhi, M.; El-Sawy, A. Machine learning-based approach for Arabic dialect identification. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kiev, Ukraine, 19 April 2021; pp. 287–290. [Google Scholar]
- Mishra, P.; Mujadia, V. Arabic dialect identification for travel and twitter text. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 28 July–2 August 2019; pp. 234–238. [Google Scholar]
- Chittaragi, N.B.; Limaye, A.; Chandana, N.T.; Annappa, B.; Koolagudi, S.G. Automatic text-independent Kannada dialect identification system. In Information Systems Design and Intelligent Applications: Proceedings of Fifth International Conference INDIA 2018 Volume 2; Springer: Singapore, 2019; pp. 79–87. [Google Scholar]
- Doostmohammadi, E.; Nassajian, M. Investigating machine learning methods for language and dialect identification of cuneiform texts. arXiv 2020, arXiv:2009.10794. [Google Scholar]
- AlShenaifi, N.; Azmi, A. Arabic dialect identification using machine learning and transformer-based models: Submission to the NADI 2022 Shared Task. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 464–467. [Google Scholar]
- Talafha, B.; Farhan, W.; Altakrouri, A.; Al-Natsheh, H. Mawdoo3 AI at MADAR shared task: Arabic tweet dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 28 July–2 August 2019; pp. 239–243. [Google Scholar]
- Mohammed, A.; Jiangbin, Z.; Murtadha, A. A three-stage neural model for Arabic Dialect Identification. Comput. Speech Lang. 2023, 80, 101488. [Google Scholar] [CrossRef]
- Sundus, K.; Al-Haj, F.; Hammo, B. A deep learning approach for arabic text classification. In Proceedings of the 2019 2nd International Conference on New Trends in Computing Sciences (ICTCS), Amman, Jordan, 9–11 October 2019; IEEE: Piscataway, NJ, USA; pp. 1–7. [Google Scholar]
- Alqurashi, T. Applying a Character-Level Model to a Short Arabic Dialect Sentence: A Saudi Dialect as a Case Study. Appl. Sci. 2022, 12, 12435. [Google Scholar] [CrossRef]
- Abdelazim, M.; Hussein, W.; Badr, N. Automatic Dialect identification of Spoken Arabic Speech using Deep Neural Networks. Int. J. Intell. Comput. Inf. Sci. 2022, 22, 25–34. [Google Scholar] [CrossRef]
- Fares, Y.; El-Zanaty, Z.; Abdel-Salam, K.; Ezzeldin, M.; Mohamed, A.; El-Awaad, K.; Torki, M. Arabic dialect identification with deep learning and hybrid frequency based features. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 28 July–2 August 2019; pp. 224–228. [Google Scholar]
- Elaraby, M.; Abdul-Mageed, M. Deep models for arabic dialect identification on benchmarked data. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), Santa Fe, NM, USA, 20 August 2018; pp. 263–274. [Google Scholar]
- Mekki, A.E.; Mahdaouy, A.E.; Essefar, K.; Mamoun, N.E.; Berrada, I.; Khoumsi, A. BERT-based Multi-Task Model for Country and Province Level Modern Standard Arabic and Dialectal Arabic Identification. arXiv 2021, arXiv:2106.12495. [Google Scholar]
- Wang, J.H.; Liu, T.W.; Luo, X.; Wang, L. An LSTM approach to short text sentiment classification with word embeddings. In Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018), Hsinchu, Taiwan, 4–5 October 2018; pp. 214–223. [Google Scholar]
- Nowak, J.; Taspinar, A.; Scherer, R. LSTM recurrent neural networks for short text and sentiment classification. In Proceedings of the Artificial Intelligence and Soft Computing: 16th International Conference, ICAISC 2017, Zakopane, Poland, 11–15 June 2017; Proceedings, Part II 16; Springer International Publishing: Cham, Switzerland, 2017; pp. 553–562. [Google Scholar]
- Elaraby, M.; Zahran, A. A Character Level Convolutional BiLSTM for Arabic Dialect Identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 28 July–2 August 2019; pp. 274–278. [Google Scholar]
- Alhazzani, N.Z.; Al-Turaiki, I.M.; Alkhodair, S.A. Text Classification of Patient Experience Comments in Saudi Dialect Using Deep Learning Techniques. Appl. Sci. 2023, 13, 10305. [Google Scholar] [CrossRef]
- De Francony, G.; Guichard, V.; Joshi, P.; Afli, H.; Bouchekif, A. Hierarchical deep learning for Arabic dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, Florence, Italy, 28 July–2 August 2019; pp. 249–253. [Google Scholar]
- Lulu, L.; Elnagar, A. Automatic Arabic dialect classification using deep learning models. Procedia Comput. Sci. 2018, 142, 262–269. [Google Scholar] [CrossRef]
- Althobaiti, M.J. Country-level Arabic dialect identification using small datasets with integrated machine learning techniques and deep learning models. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kiev, Ukraine, 19 April 2021; pp. 265–270. [Google Scholar]
- Mansour, M.; Tohamy, M.; Ezzat, Z.; Torki, M. Arabic dialect identification using BERT fine-tuning. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, Barcelona, Spain, 12 December 2020; pp. 308–312. [Google Scholar]
- Yahya, A.E.; Gharbi, A.; Yafooz, W.M.; Al-Dhaqm, A. A Novel Hybrid Deep Learning Model for Detecting and Classifying Non-Functional Requirements of Mobile Apps Issues. Electronics 2023, 12, 1258. [Google Scholar] [CrossRef]
- Abdul-Mageed, M.; Zhang, C.; Elmadany, A.; Bouamor, H.; Habash, N. NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task. arXiv 2022, arXiv:2210.09582. [Google Scholar]
- Abdelali, A.; Mubarak, H.; Samih, Y.; Hassan, S.; Darwish, K. Arabic dialect identification in the wild. arXiv 2020, arXiv:2005.06557. [Google Scholar]
- Alghamdi, A.; Alshutayri, A.; Alharbi, B. Deep Bidirectional Transformers for Arabic Dialect Identification. In Proceedings of the 6th International Conference on Future Networks & Distributed Systems, Tashkent, Uzbekistan, 15 December 2022; pp. 265–272. [Google Scholar]
- Attieh, J.; Hassan, F. Arabic Dialect Identification and Sentiment Classification using Transformer-based Models. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 485–490. [Google Scholar]
- Fsih, E.; Kchaou, S.; Boujelbane, R.; Belguith, L.H. Benchmarking transfer learning approaches for sentiment analysis of Arabic dialect. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 431–435. [Google Scholar]
- Messaoudi, A.; Fourati, C.; Haddad, H.; BenHajhmida, M. iCompass Working Notes for the Nuanced Arabic Dialect Identification Shared task. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 415–419. [Google Scholar]
- Talafha, B.; Ali, M.; Za’ter, M.E.; Seelawi, H.; Tuffaha, I.; Samir, M.; Farhan, W.; Al-Natsheh, H.T. Multi-dialect arabic bert for country-level dialect identification. arXiv 2020, arXiv:2007.05612. [Google Scholar]
- Bayrak, G.; Issifu, A.M. Domain-Adapted BERT-based Models for Nuanced Arabic Dialect Identification and Tweet Sentiment Analysis. In Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP), Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 425–430. [Google Scholar]
- Beltagy, A.; Wael, A.; ElSherief, O. Arabic dialect identification using bert-based domain adaptation. arXiv 2020, arXiv:2011.06977. [Google Scholar]
- Bouamor, H.; Habash, N.; Salameh, M.; Zaghouani, W.; Rambow, O.; Abdulrahim, D.; Obeid, O.; Khalifa, S.; Eryani, F.; Erdmann, A.; et al. The madar arabic dialect corpus and lexicon. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Abdul-Mageed, M.; Zhang, C.; Bouamor, H.; Habash, N. NADI 2020: The first nuanced Arabic dialect identification shared task. arXiv 2020, arXiv:2010.11334. [Google Scholar]
- Abdul-Mageed, M.; Zhang, C.; Elmadany, A.; Bouamor, H.; Habash, N. NADI 2021: The second nuanced Arabic dialect identification shared task. arXiv 2021, arXiv:2103.08466. [Google Scholar]
- Abdul-Mageed, M.; Elmadany, A.; Zhang, C.; Nagoudi, E.M.B.; Bouamor, H.; Habash, N. NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task. arXiv 2023, arXiv:2310.16117. [Google Scholar]
- Abdelali, A.; Mubarak, H.; Samih, Y.; Hassan, S.; Darwish, K. QADI: Arabic dialect identification in the wild. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kiev, Ukraine, 19 April 2021; pp. 1–10. [Google Scholar]
- Bouamor, H.; Habash, N.; Oflazer, K. A Multidialectal Parallel Corpus of Arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 1240–1245. [Google Scholar]
- Alsarsour, I.; Mohamed, E.; Suwaileh, R.; Elsayed, T. Dart: A large dataset of dialectal arabic tweets. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Althobaiti, M.J. Automatic Arabic dialect identification systems for written texts: A survey. arXiv 2020, arXiv:2009.12622. [Google Scholar]
- Etman, A.; Beex, A.L. Language and dialect identification: A survey. In Proceedings of the 2015 SAI intelligent systems conference (IntelliSys), London, UK, 10–11 November 2015; IEEE: Piscataway, NJ, USA; pp. 220–231. [Google Scholar]
- Harrat, S.; Meftouh, K.; Smaïli, K. Maghrebi Arabic dialect processing: An overview. J. Int. Sci. Gen. Appl. 2018, 1, 38. [Google Scholar]
- Harrat, S.; Meftouh, K.; Smaili, K. Machine translation for Arabic dialects (survey). Inf. Process. Manag. 2019, 56, 262–273. [Google Scholar] [CrossRef]
- Elnagar, A.; Yagi, S.M.; Nassif, A.B.; Shahin, I.; Salloum, S.A. Systematic literature review of dialectal Arabic: Identification and detection. IEEE Access 2021, 9, 31010–31042. [Google Scholar] [CrossRef]
- Issa, E.; AlShakhori, M.; Al-Bahrani, R.; Hahn-Powell, G. Country-level Arabic dialect identification using RNNs with and without linguistic features. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kiev, Ukraine, 19 April 2021; pp. 276–281. [Google Scholar]
- Baimukan, N.; Bouamor, H.; Habash, N. Hierarchical aggregation of dialectal data for Arabic dialect identification. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 4586–4596. [Google Scholar]
- Obeid, O.; Inoue, G.; Habash, N. Camelira: An Arabic multi-dialect morphological disambiguator. arXiv 2022, arXiv:2211.16807. [Google Scholar]
- Tzudir, M.; Baghel, S.; Sarmah, P.; Prasanna, S.R.M. Analyzing RMFCC Feature for Dialect Identification in Ao, an Under-Resourced Language. In Proceedings of the 2022 National Conference on Communications (NCC), Mumbai, India, 24–27 May 2022; IEEE: Piscataway, NJ, USA; pp. 308–313. [Google Scholar]
- Shon, S.; Ali, A.; Samih, Y.; Mubarak, H.; Glass, J. ADI17: A fine-grained Arabic dialect identification dataset. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA; pp. 8244–8248. [Google Scholar]
- Rong, X. word2vec parameter learning explained. arXiv 2014, arXiv:1411.2738. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D Nonlinear Phenom. 2020, 404, 132306. [Google Scholar] [CrossRef]
- Zhang, S.; Zheng, D.; Hu, X.; Yang, M. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, 30 October–1 November 2015; pp. 73–78. [Google Scholar]
- Liu, G.; Guo, J. Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing 2019, 337, 325–338. [Google Scholar] [CrossRef]
- Jang, B.; Kim, M.; Harerimana, G.; Kang, S.U.; Kim, J.W. Bi-LSTM model to increase accuracy in text classification: Combining Word2vec CNN and attention mechanism. Appl. Sci. 2020, 10, 5841. [Google Scholar] [CrossRef]
- Bae, K.; Ryu, H.; Shin, H. Does Adam optimizer keep close to the optimal point? arXiv 2019, arXiv:1911.00289. [Google Scholar]
- Şen, S.Y.; Özkurt, N. Convolutional neural network hyperparameter tuning with adam optimizer for ECG classification. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; IEEE: Piscataway, NJ, USA; pp. 1–6. [Google Scholar]
- Aghaebrahimian, A.; Cieliebak, M. Hyperparameter tuning for deep learning in natural language processing. In Proceedings of the 4th Swiss Text Analytics Conference (Swisstext 2019), Winterthur, Switzerland, 18–19 June 2019. [Google Scholar]
- Yafooz, W.; Alsaeedi, A. Leveraging User-Generated Comments and Fused BiLSTM Models to Detect and Predict Issues with Mobile Apps. Comput. Mater. Contin. 2024, 79, 735–759. [Google Scholar] [CrossRef]
- Sari, W.K.; Rini, D.P.; Malik, R.F. Text Classification Using Long Short-Term Memory with GloVe. J. Ilm. Tek. Elektro Komput. Dan Inform. (JITEKI) 2019, 5, 85–100. [Google Scholar] [CrossRef]
- Ruby, U.; Yendapalli, V. Binary cross entropy with deep learning technique for image classification. Int. J. Adv. Trends Comput. Sci. Eng. 2020, 9, 5393–5397. [Google Scholar]
- Zhang, C.; Woodland, P.C. Parameterised sigmoid and ReLU hidden activation functions for DNN acoustic modelling. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Abdul-Mageed, M.; Elmadany, A.; Nagoudi, E.M.B. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv 2020, arXiv:2101.01785. [Google Scholar]
- Antoun, W.; Baly, F.; Hajj, H. Arabert: Transformer-based model for arabic language understanding. arXiv 2020, arXiv:2003.00104. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
- Pires, T.; Schlinger, E.; Garrette, D. How multilingual is multilingual BERT? arXiv 2019, arXiv:1906.01502. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Author(s) | Techniques | Dataset Size | Dialect | Platform | Classifiers |
---|---|---|---|---|---|
[11] | Machine Learning | Voice recordings | Egyptian, Gulf, Levantine, and North African. | Al-Jazeera | SVM |
[12] | Machine Learning | 50k Tweets | Algeria, Egypt, Lebanon, Tunisia and Morocco. | SGD Classifier | |
[13] | Machine Learning | 20K Tweets | TF/IDF, MNB, CNB, SVM, KNN, DT, RF, MLP | ||
[14] | Machine Learning | 16,494 Sentences | Egypt, North Africa, Gulf and Levant, MSA | AOC dataset | KNN, NB, SVM |
[15] | Machine learning | 2000 Sentences | MSA, Egyptian, Syrian, Jordanian, Palestinian, Tunisian | Multidialectal Parallel Corpus of Arabic (MPCA) | N/A |
[16] | Machine Learning | Voice Recordings | EGY, GLF, LAV, and North African or NOR | Arabic broadcast speech | KRR |
[17] | Deep Learning | Audio segments | African American English | CORAAL database | XGBoost model |
[18] | Machine Learning | 7000 words | Sorani, Kurmanji (Kurdish Dialects) | vocabulary words | Adaptation of SVM |
[31] | Deep Learning | 31k Tweets | Maghreb, Egypt, Gulf, and Levant | MTL | |
[25] | Deep Learning | 2000 MSA sentences and 540k Tweets | Algerian, Bahraini, Djiboutian, Egyptian, Iraqi, Jordanian, Kuwaiti, Lebanese, Libyan, Mauritanian, Moroccan, Omani, Palestinian, Qatari, Saudi, Somali, Sudanese, Syrian, Tunisian, Emirati, and Yemeni. | LSTM | |
[26] | Deep Learning | 7135 Documents | Arabic | Khaleej-2004 Corpus Dataset and newspaper articles | feed-forward DL neural network model |
[27] | Machine Learning and Deep Learning | 3768 Sentences | Hijazi, Najdi, Janobi, Hasawi (Saudi Dialects) | blogs, discussion forums, and reader commentaries | SVM LR SGDC CNN |
[28] | Deep learning | 34 h | Egyptian Gulf and Levantine | 52 Volunteer Participants | Gaussian NB, SVM, RNN and CNN-RNN |
[41] | Transform Learning | 10 M Tweets | Egyptian, Iraqi, Jordanian, Saudi, Kuwaiti, Omani, Palestinian, Qatar, UAE, Yemen | MARBERT | |
[42] | Transform | 540k Tweets | Emirati, Bahraini, Djiboutian, Egyptian, Iraqi, Jordanian, Kuwaiti, Lebanese, Libyan, Mauritanian, Omani, Palestinian, Qatari, Saudi, Sudanese, Syrian, Tunisian, and Yemeni. | AraBERT | |
[43] | Transform and deep learning | 100k Records 1.4 M Records | Gulf, Iraqi, Egyptian, Levantine, and North Africa dialects. | AOC dataset, SMADC dataset | MARBERT, ARBERT |
[46] | Deep Learning and transform learning | 25,269 Sentences | Arab World | NADI | Arabic BERT MARBER |
[58] | Language model/feature extraction | 6300 sentences of speech | American English Dialects (Southern and South midlands) | Voice Recordings | backend logistic classifier |
[63] | Hierarchical aggregation | N/A | Levantine, Gulf, Iraqi, Omani, Egyptian, North African, Yemeni | Twitter and voice speech MADAR | LM |
[65] | Statistical analysis | 36 Human Speakers | Chungli, Changki, Mongsen (Nagaland) | Speech and Written text for Chungli dialect only | GMM |
[7] | semi-supervised learning | 11.8k Sentences | Egyptian, Gulf, Iraqi, Levantine, Maghrebi | AOC Corpus and Facebook | N/A |
[66] | Semi/Unsupervised | 3000 h of speech | Algerian, Egyptian, Iraqi, Jordanian, Saudi, Kuwaiti, Lebanese, Libyan, Mauritanian, Moroccan, Omani, Palestinian, Qatari, Sudanese, Syrian, Emirati, Yemeni | YouTube | N/A |
Dialect | Size | Min (Words) | Max (Words) |
---|---|---|---|
Egypt | 9461 | 1 | 174 |
Jorden | 7705 | 1 | 496 |
Yemen | 7238 | 1 | 42 |
Gulf | 10,501 | 1 | 74 |
Total | 34,905 |
Dialect | Example in Arabic | Example in English |
---|---|---|
Egypt | يا عم ازيك و انت ليه اساسا تكلمني انجليزي و احنا مصريين زي بعض | Uncle, how are you, and why do you even speak English to me when we are Egyptians like each other? |
Jorden | يا زلمة والله إنك ولد | Oh man, by God, you are a boy |
Yemen | اشتى اعرف انتوا فين بتروح اليوم | I want to know where you go today |
Gulf | وايش تبغا منه | What do you want from him? |
Parameters | LSTM | BiLSTM | Proposed Model |
---|---|---|---|
Cost function | categorical_crossentropy | ||
Optimizer | adam | ||
Input shape | 10,000 for TF-IDF and 100 for others | ||
Batch size | 32 | 16 | 16 |
Epochs | 70 | 70 | 50 |
Activation function (Hidden layer) | Relu [79] | ||
Activation function (Output layer) | Softmax | ||
Dropout | 10–15% |
Model | TF-IDF | Word2Vec | GloVe |
---|---|---|---|
LR | 82.95% | 77.89% | 77.89% |
LSTM | 79.76% | 79.30% | 79.55% |
BiLSTM | 80.61% | 79.46% | 80.79% |
Hybrid | 82.96% | 81.67% | 80.18% |
Hybrid with attention | 83.31% | 81.31% | 81.22% |
Word Representation | Word2Vec | GloVe |
---|---|---|
LSTM (CBOW) | 77.2% | - |
LSTM (Skip-gram) | 78.12% | - |
BiLSTM (CBOW) | 79.01% | - |
BiLSTM (Skip-gram) | 80.67% | - |
LSTM | - | 81.23% |
BiLSTM | - | 82.45% |
Hybrid (CBOW) | 84.12% | - |
Hybrid (Skip-gram) | 86.01% | - |
Hybrid with attention (CBOW) | 85.11% | - |
Hybrid with attention (Skip-gram) | 88.73% | - |
Hybrid | - | 86.12% |
Hybrid with attention | - | 88.21% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yafooz, W.M.S. Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism. Information 2024, 15, 316. https://doi.org/10.3390/info15060316
Yafooz WMS. Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism. Information. 2024; 15(6):316. https://doi.org/10.3390/info15060316
Chicago/Turabian StyleYafooz, Wael M. S. 2024. "Enhancing Arabic Dialect Detection on Social Media: A Hybrid Model with an Attention Mechanism" Information 15, no. 6: 316. https://doi.org/10.3390/info15060316