Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
TunBERT: Pretrained Contextualized Text Representation for Tunisian Dialect Abir Messaoudi Hatem Haddad Moez BenHajhmida Malek Naski iCompass Abstract arXiv:2111.13138v1 [cs.CL] 25 Nov 2021 Pretrained contextualized text representation models learn an effective representation of a natural language to make it machine understandable. After the breakthrough of the attention mechanism, a new generation of pretrained models have been proposed achieving good performances since the introduction of the Transformer. Bidirectional Encoder Representations from Transformers (BERT) has become the state-of-the-art model for language understanding. Despite their success, most of the available models have been trained on Indo-European languages however similar research for under-represented languages and dialects remains sparse. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for under represented languages, with a specific focus on the Tunisian dialect. We evaluate our language model on sentiment analysis task, dialect identification task and reading comprehension question-answering task. We show that the use of noisy web crawled data instead of structured data (Wikipedia, articles, etc.) is more convenient for such non-standardized language. Moreover, results indicate that a relatively small web crawled dataset leads to performances that are as good as those obtained using larger datasets. Finally, our best performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks. We release the TunBERT pretrained model and the datasets used for finetuning1 . 1 Introduction In the last decade, natural language understanding has gained interest owing to the available hardware and data resources and to the evolution of the pretrained contextualized text representation models. These models learn an effective representation of a natural language to make it machine 1 To preserve anonymity, a link to Github repository will be added to the camera-ready version if the paper is accepted. Ahmed Cheikhrouhou Nourchene Ferchichi Abir Korched Faten Ghriss Amine Kerkeni InstaDeep understandable. Word2Vec (Mikolov et al., 2013) has been one of the first proposed approaches where words were represented according to their semantic property. Next, ELMO (Peters et al., 2018) combined the previous model with BiLSTM in order to deal with the polysemy problem. Afterwards, the pretraining models have been firstly proposed with ULMFit (Howard and Ruder, 2018) where they were fine-tuned for downstream tasks. These models have achieved good performances but they did not support long-term and multiple contexts of the words. After the breakthrough of the attention mechanism (Vaswani et al., 2017), a new generation of pretrained models have appeared. They have achieved tremendous performances since the introduction of the Transformers (Radford, 2018). Besides, the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019) has been unleashed to become the state-of-the-art model for language understanding and gave new inspiration to further development in the Natural Language Processing (NLP) field. Accordingly, most languages have their own BERT-based language models. Specifically, the Arabic language has multiple language models: AraBERT (Antoun et al., 2020), GigaBERT (Wuwei et al., 2020), and multilingual cased BERT model (hereafter mBERT) (Pires et al., 2019) which was simultaneously pretrained on 104 languages. Arabic language has more than 300 million native speakers around the world and it’s used as a native language in 26 countries. The official form of Arabic is called Modern Standard Arabic (MSA). Although, each country has one or more locally Arabic spoken language, called Dialect. The people in Tunisia use the Tunisian dialect (Fourati et al., 2020) in their daily communications, most of their media (TV, radio, songs, etc), and on the internet (social media, forums). Yet, this dialect is not standardized which means there is no unique way for writing and speaking. Added to that, it has its proprietary lexicon, phonetics, and morphological structure as shown in Table 1. The need for a robust language model for Tunisian dialect has become crucial to develop natural-language-processing-based applications (translation, information retrieval, sentiment analysis, etc). To the best of our knowledge, there is no such model proposed yet in literatures. In this paper, we describe the process of pretraining a Pytorch implementation of NVIDIA BERT language model2 , called TunBERT (Tunisian BERT), trained on only 67.2 MB web-scraped dataset. We systematically compare our pre-trained model on three NLP downstream tasks; that are different in nature: (i) Sentiment Analysis (SA), (ii) Tunisian dialect identification (TDI), and (iii) Reading Comprehension Question-Answering (RCQA); against mBERT (Devlin et al., 2019), AraBERT (Antoun et al., 2020), GigaBERT (Wuwei et al., 2020) and the state of the art performances when available. Our contributions can be summarized as follows: • First release of a pretrained BERT model for the Tunisian dialect using a Tunisian largescale web-scraped dataset. • TunBERT application to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian dialect identification (TDI) and Reading Comprehension Question-Answering (RCQA). • Empirical evaluations illustrate that small and diverse Tunisian training dataset can achieve similar performance compared to several baselines including previous multilingual and single-language approaches trained on large-scale corpora. • Publicly releasing TunBERT and the used datasets on popular NLP libraries3 . The rest of the paper is structured as follows. Section 2. provides a concise literature review of 2 https://github.com/NVIDIA/ DeepLearningExamples/tree/master/ PyTorch/LanguageModeling/BERT 3 To preserve anonymity, a link to Github repository will be added to the camera-ready version if the paper is accepted. previous work on monolingual and multilingual language representation. Section 3. describes the used methodology to develop TunBERT. Section 4. describes the downstream tasks and benchmark datasets that were used for evaluation. Section 5. presents the experimental setup and discusses the results. Finally, section 6. concludes and points to possible directions for future work. 2 Related Works Contextualized word representations, such as BERT (Devlin et al., 2019), RoBERTa (Delobelle et al., 2019) and more recently ALBERT (Lan et al., 2020), improved the representational power of word embeddings such as word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2017) by taking context into account. Following their success, the large pretrained language models were extended to the multilingual setting such as mBERT (Pires et al., 2019). In (Conneau and Lample, 2019), authors showed that multilingual models can obtain results competitive with monolingual models by leveraging higher quality data from other languages on specific downstream tasks. Nevertheless, these models have used large scale pretraining corpora and consequently need high computational cost. Recently, non-English monolingual models have been released: RobBERT for Dutch (Delobelle et al., 2020), FlauBERT (Le et al., 2020) and CamemBERT for French (Martin et al., 2020), (Canete et al., 2020) for Spanish and (Virtanen et al., 2019) for Finnish. In (Martin et al., 2020), authors showed that their French model trained on a 4 GB performed similarly to same model trained on the 138GB. They also concluded that a model trained on a Common- Crawl-based corpus performed consistently better than the one trained on the French Wikipedia. They suggested that a 4 GB heterogeneous dataset in terms of genre and style is large enough as a pretraining dataset to reach state-of-the-art results with the BASE architecure, better than those obtained with mBERT (pretrained on 60 GB of text). In (Virtanen et al., 2019), a Finnish BERT model trained from scratch outperformed mBERT for three reference tasks (partof-speech tagging, named entity recognition, and dependency parsing). Authors suggested that a language-specific deep transfer learning models for lower-resourced languages can outperform multilingual BERT models. Tunisian MSA ‫ﻣﺤﻼﻫﺎ ﻫﺎﻟﻐﻨﺎﻳﺔ‬ ‫ﻣﺎ ﺃﺣﻠﻰ ﻫﺬﻩ ﺍﻷﻏﻨﻴﺔ‬ ‫ﻻ ﺗﻌﺠﺒﻨﻲ ﺗﺼﺮﻓﺎﺗﻬﺎ ﻣﺎﺗﻌﺠﺒﻨﻴﺶ ﻛﻴﻔﺎﺵ ﺗﺘﺼﺮﻑ‬ ‫ﻭﻗﺘﺎﻩ ﻳﺒﺪﺍ ﺍﻟﻤﺎﺗﺶ‬ ‫ﻣﺘﻰ ﺗﺒﺪﺃ ﺍﻟﻤﺒﺎﺭﺍﺓ‬ English How nice is this song I don’t like how she behaves When does the match start Table 1: Examples of Tunisian sentences with their MSA and English translation. Compared to the increasing studies of contextualized word representations in Indo-European languages, similar research for Arabic language is still very limited. AraBERT (Antoun et al., 2020), a BERT-based model, was released using a pre-training dataset of 70 million sentences, corresponding to 24 GB of text covering news from different Arab media. AraBERT was pre-trained on a TPUv2-8 pod for 1,250K steps. It achieved stateof-the-art performances on three Arabic tasks: Sentiment Analysis, Named Entity Recognition, and Question Answering. Nevertheless, the pre-trained dataset is mostly a MSA based. Authors concluded that there is a need for pretrained models that can tackle a variety of Arabic dialects. Lately, GigaBERT (Wuwei et al., 2020) customized bilingual language model for English and Arabic has outperformed AraBERT in several downstream tasks. 3 TunBERT In this section, we describe the training Setup and pretraining data that was used for TunBERT. 3.1 Training Setup TunBERT model is based on the Pytorch implementation of NVIDIA NeMo BERT 4 . The model was pre-trained using 4 NVIDIA Tesla V100 GPUs for 1280K steps. The pretrained model characteristics are shown in Table 3. Adam optimizer was used, with a learning rate of 1e-4, a batch size of 128, a maximum sequence length of 128 and a masking probability of 15%. Cosine annealing was used for learning rate scheduling with a warm-up ratio of 0.01. Training took 122 hours and 25 minutes for 330 epochs over all the tokens. The model was trained on two unsupervised prediction tasks using a large Tunisian text corpus: The Masked Language Modeling (MLM) task and the Next Sentence Prediction (NSP) task. For the MLM task, 15% of the words in each sequence are replaced with a [MASK] token. Then, 4 https://github.com/NVIDIA/ DeepLearningExamples/tree/master/ PyTorch/LanguageModeling/BERT #Uniq Words 8,256K #Words 48,233K #Sentences 500K Table 2: Pretraining dataset statistics. #Layers 12 Hidden Size 768 #self-attention heads 12 Table 3: Pretrained model configuration. the model attempts to predict the original masked token based on the context of the non-masked tokens in the sequence. For the NSP task, pairs of sentences are provided to the model. The model has to predict if the second sentence is the subsequent sentence in the original document. In this task, 50% of the pair sentences are subsequent to each other in the original document. The remaining 50% random sample sentences are chosen from the corpus to be added to the first sentence. 3.2 Pre-training Dataset Because of the lack of available Tunisian dialect data (books, wikipedia, etc.), we use a web-scraped dataset extracted from social media, blogs and websites consisting of 500k sentences of text, to pretrain the model. The extracted data was preprocessed by removing links, emoji and punctuation symbols. Then, a filter was applied to ensure that only Arabic scripts are included. Pretraining dataset statistics are presented in Table 2. The training dataset size is 67.2 MB. 4 Evaluation We measure the performance of TunBERT by evaluating it on three tasks: Sentiment Analysis, Dialect identification and Reading Comprehension Question-Answering. Fine-tuning was done independently using the same configuration for all tasks. We do not run extensive grid search for choosing the best hyper-parameters due to computational and time constraints. We applied a configuration commonly used in the literature. We use the splits provided by the datasets authors when available Dataset #Negative #Positive #Train #Dev #Test TSAC 4175 3277 4680 1170 1516 TEC 1799 1244 1947 487 609 Table 4: TSAC and TEC Sentiment analysis datasets statistics. and the standard 80 % and 20% when not. 4.1 Sentiment Analysis For the sentiment analysis task, we used two manually annotated Tunisian Sentiment Analysis datastes: • Tunisian Sentiment Analysis Corpus (TSAC) (Medhaffar et al., 2017) obtained from Facebook comments about popular TV shows. The TSAC dataset is composed of comments based on Latin scripts, Arabic scripts and emoticons. We use only the Arabic script comments. • Tunisian Election Corpus (TEC) (Sayadi et al., 2016) obtained from tweets about Tunisian elections in 2014. Beside Tunisian content, TEC dataset content is also composed of MSA content. Statistics of the TSAC and TEC are shown in Table 4. 4.2 Tunisian Dialect identification This task focuses on identifying the Tunisian dialect of a given text from other Arabic dialects, especially on social media sources where there is no established standard orthography like MSA. First attempts to tackle the challenge identified 5 Arabic dialects categories in addition to MSA: Maghrebi, Egyptian, Levantine, Gulf, and Iraqi (Zaidan and Callison-Burch, 2011). (El-Haj et al., 2018) proposed 4 Arabic dialects categories by merging the Iraqi with the Gulf. Tunisian dialect was classified into the Maghrebi dialect along with the Algerian, Moroccan, and other dialects. Nevertheless, even if the Maghrebi vocabulary is pretty much similar throughout North African countries, many differences exist not only at the phonetic level (Harrat et al., 2018) but also at the lexical, morphological and syntactic levels (Horesh, 2019). For evaluation, two sub-tasks were performed: Dataset #Train #DEV #Test TADI 40500 2396 7192 TAD 3200 400 400 Table 5: TADI and TAD dataset statistics. • Identification of Tunisian dialect from other Arabic dialects (TADI): this is a binary classification task: Tunisian dialect and Non Tunisian dialect from an Arabic dialectical dataset. We used the Nuanced Arabic Dialect Identification (NADI) shared task dataset with a total of 21,000 tweets, covering 21 Arab countries. NADI is an imbalanced dataset in which the training includes only 747 Tunisian tweets and the remaining tweets cover other dialects. Consequently, this dataset is unbalanced. To solve this issue, we created a new dataset TADI (Tunisian and Arabic Dialect Identification) by including a sub-set of TSAC dataset as Tunisian comments to have the same number of tweets for the Tunisian dialect as same as the other dialects as shown in Table 5. • Identification of Tunisian dialect and Algerian dialect (TAD dataset): for this sub-task we used the Multi-Arabic Dialect Applications and Resources (MADAR) dataset (Bouamor et al., 2018) . More specifically, we used the shared task dataset to target a large set of dialect labels at country level. We filtered the dataset of dialect labels at country level (Bouamor et al., 2019) to only keep Tunisia and Algerian labeled data as shown in Table 5. 4.3 Reading Comprehension Question-Answering Open-domain Question-Answering (QA) task has been intensively studied to evaluate the language Understanding performances of the models. This task takes as input a textual question to look for correspondent answers within a large textual corpus. In (Mozannar et al., 2019), two MSA QA datasets has been proposed. However, to the best of our knowledge, no study was previously made for such a task for any Arabic dialect. For this task, we built TRCD (Tunisian Reading Comprehension Dataset) as Question-Answering Dataset #Train #Dev #Test #Document 114 15 15 #Paragraph 342 45 45 #QA 1026 135 135 Table 6: TRCD statistics. Model (Medhaffar et al., 2017) word2vec (Mulki et al., 2020) doc2vec (Mulki et al., 2020) Tw-StAR (Mulki et al., 2020) mBERT GigaBERT AraBERT TunBERT Accuracy 78% 77.4% 57.2% 86.5% 92.21% 94.92% 95.63% 96.98% F1.macro 78% 78.2% 61.7% 86.2% 91.03% 93.39% 94.91% 96.98% Table 7: TSAC results. dataset for Tunisian dialect. We used a dialectal version of the Tunisian constitution following the guideline in (Chen et al., 2017). It is composed of 144 documents where each document has exactly 3 paragraphs and three Question-Answer pairs are assigned to each paragraph. Questions were formulated by four native speaker annotators and each question should be paired with a paragraph as shown in Figure 1). To the best of our knowledge, this is the first Tunisian dialect dataset for the QuestionAnswering task. TRCD dataset statistics are showed in Table 6. 5 Experiments and Discussion 5.1 Tunisian Sentiment Analysis The efficiency of TunBERT language model was evaluated against mBERT, AraBERT and GigaBERT language models and the state of the art performances when available. The obtained performances of Tunisian Sentiment Analysis using TunBERT were further compared against the baseline systems that tackled the same datasets (word embeddings (word2vec), document embeddings (doc2vec) and Tw-StAR (Mulki et al., 2020)) and listed in Table 7 and Table 8. The results in Table 7 illustrate the outperformance of the pretrained contextualized text representation models over the previous techniques namely word2vec and doc2vec. TunBERT achieved the best performance on the TSAC dataset. Model (Sayadi et al., 2016) word2vec (Mulki et al., 2020) doc2vec (Mulki et al., 2020) Tw-StAR (Mulki et al., 2020) mBERT GigaBERT AraBERT TunBERT Accuracy 71.1% 61.9 % 62.2% 88.2% 58.45% 71.75% 79.14% 81.2% F1.macro 63% 58.4% 56.4% 87.8% 36.89% 65.32% 72.57% 76.45% Table 8: TEC results. It reached 92.98% as F1.macro which is a high result comparing to to 78.2%, 61.7% and 86.2% scored by word2vec, doc2vec and Tw-StAR, respectively. The results show that TunBERT also outperform pretrained language models: mBERT, GigaBERT and AraBERT. Likewise, Table 8 illustrates the outperformance of BERT-based LM against other techniques with the TEC dataset. Nevertheless, the best performances was achieved by Tw-StAR. For instance, the best achieved Tw-StAR F1.macro was in TEC dataset with a value of 87.8% compared to 76.45%, and 72.57% scored by TunBERT and AraBERT, respectively. This could be explained by the noisy nature of TEC dataset with a mixed Tunisian and MSA content. Results using mBERT achieved the worst performances could demonstrate that mBERT is not suitable for noisy data. The results showcase also the outperformance of TunBERT over the other pretrained language models. 5.2 Tunisian Dialect identification For Tunisian Dialect identification, the results in Table 9 show that the TunBERT language model outperform other state-of-the-art language models. Indeed, our model achieved a F1.macro of 87.14% compared to 68.93% achieved by mBERT. TunBERT also outperforms the Arabic language pretrained BERT AraBERT. Likewise, it has achieved a F1.macro of 93.25% for the Tunisian-Algerian dialects identification task outperforming the other used language models as shown in Table 10. 5.3 Reading Comprehension Question-Answering Fine-tuning TunBERT on the Tunisian Reading Comprehension Dataset did not give impressive results (Exact match of 2.17%, F1 score of 13.66% and a Recall of 22.59%). Comparable results Figure 1: TRCD dataset example with its corresponding English translation. Model mBERT AraBERT GigaBERT TunBERT Accuracy 75,21% 79.57% 72.67% 87.46% F1.macro 68.93% 76.7% 65.3% 87.14% Table 9: TADI results Model mBERT AraBERT GigaBERT TunBERT Accuracy 86.75% 87.5% % 93.3% F1.macro 86.4% 87.37% 0% 93.25% Table 10: TAD results were obtained for GigaBERT (Exact match of 0.7%, F1 score of 14.02% and a Recall of 21.65%). MBERT gave slightly better results (Exact match of 4.25%, F1 score of 22.6% and a Recall of31.3). Meanwhile, we noticed good results for AraBERT (Exact match of 26.24%, F1 score of 58.74% and a Recall of 63.96%). Adding a pre-training step on an MSA reading comprehension dataset (In our case the ArabicSQuAD dataset (Mozannar et al., 2019)) made great improvements in all of the models performances, especially for the TunBERT. The strategy was to use the pre-trained language model, finetune it for few epochs on the MSA dataset, then use the best checkpoint to train and test on the TRCD dataset. Following this stategy, TunBERT acheived great results with an Exact match of 27.65%, an F1 score of 60.24% and a Recall of 82.36%, as shown in Table 11. 5.4 Discussion The experimental results indicate that the proposed pre-trained TunBERT model yields improvements, compared to mBert, Gigabert and AraBERT models as shown in Tables 7 and 8 for the sentiment analysis sub-task, Tables 9 and 10 for dialect identification task, and Table 11 for the question-answering task. Not surprisingly, GigaBERT as customized BERT for English-to-Arabic cross-lingual transfer is not effective for the tackled tasks and should be applied for tasks using code-switched data as suggested in (Wuwei et al., 2020). As AraBERT was trained on news from different Arab media, it shows good performances on the three tasks as the datasets contain some formal text (MSA). The TunBERT was trained on a dataset including web text, which is useful on casual text, such as Tunisian dialect in Social media. For this reason, it performed better than AraBERT on all the performed tasks. We show that pretraining Tunisian model on highly variable dataset from social media leads to better downstream performance compared to models trained on more uniform data. Moreover, results led to the conclusion that a relatively small amount of web-scraped dataset (67.2M) leads to downstream performances as good as models pretrained on a datasets of larger magnitude (24 GB for AraBERT and about 10.4B tokens for GigaBERT). This is confirmed with the QA-task experiments where the created dataset contains a small amount of dialect texts. The Arabic-SQuAD dataset was used to help with the missing embeddings of the MSA and to permit the finetuned model to effectively learn the QA-task by providing more examples of question-answering. The TunBERT model has overcome all the other models in term of exact match and recall. 6 Conclusion In this paper, we reported our efforts to develop a powerful Transformer-based language models for Tunisian dialect: TunBERT. Our models are Finetuning datasets Language Models mBERT AraBERT GigaBERT TunBERT TRCD dataset Exact match 4.25 26.24 0.7 2.127 F1 score 22.6 58.74 14.02 13.665 Recall 31.3 63.96 21.65 22.597 Arabic SQuAD and TRCD Exact match F1 score Recall 29.07 60.86 62.18 24.11 63.53 70.43 29.78 62.44 66.34 27.65 60.24 82.36 Table 11: TRCD results before and after pre-training on Arabic-SQuaD trained on 67.2 MB Common-Crawl-based dataset extracted from social media consisting of 500k sentences of text. When fine-tuned on the various labeled datasets, our TunBERT model achieves new SOTA on all the tasks on all datasets. Compared to larger models such as GigaBERT and AraBERT, our TunBERT model has better representation of Tunisian dialect and yield better performances in addition to being less computationally costly at inference time. Our models are publicly available for research5 . In the future, we plan to evaluate our models on more Arabic NLP tasks and further pre-train them to improve their performance on the datasets where they are currently outperformed. On social media, Tunisian people tend to express themselves using an informal way called "TUNIZI" (Fourati et al., 2020) that represents the Tunisian Arabic text written using Latin characters and numerals rather than Arabic letters. For instance, the word "sou2el"6 is the Latin based characters of the word ‫ﺳﺆﺍﻝ‬. A natural future step would involve building a multi-script Tunisian dialect language model including Arabic script and Latin script based characters. References Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15. Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. Houda Bouamor, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, 5 To preserve anonymity, a link to Github repository will be added to the camera-ready version if the paper is accepted. 6 The word "Question" is the English translation. Alexander Erdmann, and Kemal Oflazer. 2018. The madar arabic dialect corpus and lexicon. In The International Conference on Language Resources and Evaluation. Houda Bouamor, Sabit Hassan, and Nizar Habash. 2019. The MADAR shared task on Arabic finegrained dialect identification. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 199–207. José Canete, Gabriel Chaperon, Rodrigo Fuentes, JouHui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020:2020. Danqi Chen, A. Fisch, J. Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. ArXiv, abs/1704.00051. Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. In In Proceedings of tAdvances in Neural Information Processing Systems, pages 7059–7069. Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2019. Liu, yinhan and ott, myle and goyal, naman and du, jingfei and joshi, mandar and chen, danqi and levy, omer and lewis, mike and zettlemoyer, luke and stoyanov, veselin. Computing Research Repository, arXiv:1907.11692. Version 1. Pieter Delobelle, Thomas Winters, and Bettina Berendt. 2020. Robbert: a dutch roberta-based language model. Computing Research Repository, arXiv:2001.06286. Version 2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Mahmoud El-Haj, Paul Rayson, and Mariam Aboelezz. 2018. Arabic dialect identification in the context of bivalency and code-switching. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pages 3622–3627. Chayma Fourati, Abir Messaoudi, and Hatem Haddad. 2020. Tunizi: a tunisian arabizi sentiment analysis dataset. In AfricaNLP Workshop, Putting Africa on the NLP Map. ICLR 2020, Virtual Event, volume arXiv:3091079. Salima Harrat, Karima Meftouh, and Kamel Smaïli. 2018. Maghrebi arabic dialect processing: an overview. Journal of International Science and General Applications, 1. SUri Horesh. 2019. Languages of the middle east and north africa. The SAGE encyclopedia of human communication sciences and disorders, 1:1058–1061. Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR). Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoit Crabbé, Laurent Besacier, and Didier Schwab. 2020. FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), pages 2479–2490. Hala Mulki, Hatem Haddad, Mourad Gridach, and Ismail Babaoğlu. 2020. Syntax-ignorant n-gram embeddings for dialectal arabic sentiment analysis. Natural Language Engineering, pages 1–24. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana. Association for Computational Linguistics. Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996– 5001, Florence, Italy. Association for Computational Linguistics. A. Radford. 2018. Improving language understanding by generative pre-training. Karim Sayadi, Marcus Liwicki, Rolf Ingold, and Marc Bui. 2016. Tunisian dialect and modern standard arabic dataset for sentiment analysis: Tunisian election context. In Proceedings of The Second International Conference on Arabic Computational Linguistics, ACLING, pages 35–53. Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé Seddah, and Benoît Sagot. 2020. CamemBERT: a tasty French language model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7203–7219. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc. Salima Medhaffar, Fethi Bougares, Yannick Estève, and Lamia Hadrich-Belguith. 2017. Sentiment analysis of Tunisian dialects: Linguistic ressources and experiments. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 55– 61. Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luomaa, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. 2019. Multilingual is not enough: Bert for finnish. Computing Research Repository, arXiv:1912.07076. Version 1. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, Workshop Track Proceedings. Hussein Mozannar, Elie Maamary, Karl El Hajal, and Hazem Hajj. 2019. Neural Arabic question answering. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 108–118, Florence, Italy. Association for Computational Linguistics. Lan Wuwei, Chen Yang, Xu Wei, and Ritter Alan. 2020. Gigabert: Zero-shot transfer learning from english to arabic. In Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP). Omar F. Zaidan and Chris Callison-Burch. 2011. The Arabic online commentary dataset: an annotated dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 37–41.