TunBERT: Pretrained Contextualized Text Representation for Tunisian
Dialect
Abir Messaoudi
Hatem Haddad
Moez BenHajhmida
Malek Naski
iCompass
Abstract
arXiv:2111.13138v1 [cs.CL] 25 Nov 2021
Pretrained contextualized text representation
models learn an effective representation of a
natural language to make it machine understandable. After the breakthrough of the attention mechanism, a new generation of pretrained models have been proposed achieving
good performances since the introduction of
the Transformer. Bidirectional Encoder Representations from Transformers (BERT) has become the state-of-the-art model for language
understanding. Despite their success, most
of the available models have been trained on
Indo-European languages however similar research for under-represented languages and dialects remains sparse. In this paper, we investigate the feasibility of training monolingual
Transformer-based language models for under
represented languages, with a specific focus
on the Tunisian dialect. We evaluate our language model on sentiment analysis task, dialect identification task and reading comprehension question-answering task. We show
that the use of noisy web crawled data instead
of structured data (Wikipedia, articles, etc.)
is more convenient for such non-standardized
language. Moreover, results indicate that a
relatively small web crawled dataset leads to
performances that are as good as those obtained using larger datasets. Finally, our best
performing TunBERT model reaches or improves the state-of-the-art in all three downstream tasks. We release the TunBERT pretrained model and the datasets used for finetuning1 .
1
Introduction
In the last decade, natural language understanding
has gained interest owing to the available hardware
and data resources and to the evolution of the
pretrained contextualized text representation
models. These models learn an effective representation of a natural language to make it machine
1
To preserve anonymity, a link to Github repository will
be added to the camera-ready version if the paper is accepted.
Ahmed Cheikhrouhou
Nourchene Ferchichi
Abir Korched
Faten Ghriss
Amine Kerkeni
InstaDeep
understandable. Word2Vec (Mikolov et al., 2013)
has been one of the first proposed approaches
where words were represented according to their
semantic property. Next, ELMO (Peters et al.,
2018) combined the previous model with BiLSTM
in order to deal with the polysemy problem. Afterwards, the pretraining models have been firstly
proposed with ULMFit (Howard and Ruder, 2018)
where they were fine-tuned for downstream tasks.
These models have achieved good performances
but they did not support long-term and multiple
contexts of the words.
After the breakthrough of the attention mechanism (Vaswani et al., 2017), a new generation
of pretrained models have appeared. They have
achieved tremendous performances since the introduction of the Transformers (Radford, 2018). Besides, the Bidirectional Encoder Representations
from Transformers (BERT) (Devlin et al., 2019)
has been unleashed to become the state-of-the-art
model for language understanding and gave new inspiration to further development in the Natural Language Processing (NLP) field. Accordingly, most
languages have their own BERT-based language
models. Specifically, the Arabic language has multiple language models: AraBERT (Antoun et al.,
2020), GigaBERT (Wuwei et al., 2020), and multilingual cased BERT model (hereafter mBERT)
(Pires et al., 2019) which was simultaneously pretrained on 104 languages.
Arabic language has more than 300 million
native speakers around the world and it’s used as
a native language in 26 countries. The official
form of Arabic is called Modern Standard Arabic
(MSA). Although, each country has one or more
locally Arabic spoken language, called Dialect.
The people in Tunisia use the Tunisian dialect
(Fourati et al., 2020) in their daily communications,
most of their media (TV, radio, songs, etc), and
on the internet (social media, forums). Yet, this
dialect is not standardized which means there is no
unique way for writing and speaking. Added to
that, it has its proprietary lexicon, phonetics, and
morphological structure as shown in Table 1.
The need for a robust language model for
Tunisian dialect has become crucial to develop
natural-language-processing-based applications
(translation, information retrieval, sentiment
analysis, etc). To the best of our knowledge, there
is no such model proposed yet in literatures.
In this paper, we describe the process of pretraining a Pytorch implementation of NVIDIA BERT
language model2 , called TunBERT (Tunisian
BERT), trained on only 67.2 MB web-scraped
dataset. We systematically compare our pre-trained
model on three NLP downstream tasks; that are different in nature: (i) Sentiment Analysis (SA), (ii)
Tunisian dialect identification (TDI), and (iii) Reading Comprehension Question-Answering (RCQA);
against mBERT (Devlin et al., 2019), AraBERT
(Antoun et al., 2020), GigaBERT (Wuwei et al.,
2020) and the state of the art performances when
available. Our contributions can be summarized as
follows:
• First release of a pretrained BERT model for
the Tunisian dialect using a Tunisian largescale web-scraped dataset.
• TunBERT application to three NLP downstream tasks: Sentiment Analysis (SA),
Tunisian dialect identification (TDI) and Reading Comprehension Question-Answering
(RCQA).
• Empirical evaluations illustrate that small
and diverse Tunisian training dataset can
achieve similar performance compared to several baselines including previous multilingual
and single-language approaches trained on
large-scale corpora.
• Publicly releasing TunBERT and the used
datasets on popular NLP libraries3 .
The rest of the paper is structured as follows.
Section 2. provides a concise literature review of
2
https://github.com/NVIDIA/
DeepLearningExamples/tree/master/
PyTorch/LanguageModeling/BERT
3
To preserve anonymity, a link to Github repository will
be added to the camera-ready version if the paper is accepted.
previous work on monolingual and multilingual
language representation. Section 3. describes the
used methodology to develop TunBERT. Section
4. describes the downstream tasks and benchmark
datasets that were used for evaluation. Section 5.
presents the experimental setup and discusses the
results. Finally, section 6. concludes and points to
possible directions for future work.
2
Related Works
Contextualized word representations, such as
BERT (Devlin et al., 2019), RoBERTa (Delobelle
et al., 2019) and more recently ALBERT (Lan
et al., 2020), improved the representational power
of word embeddings such as word2vec (Mikolov
et al., 2013), GloVe (Pennington et al., 2014) and
fastText (Bojanowski et al., 2017) by taking context
into account. Following their success, the large pretrained language models were extended to the multilingual setting such as mBERT (Pires et al., 2019).
In (Conneau and Lample, 2019), authors showed
that multilingual models can obtain results competitive with monolingual models by leveraging higher
quality data from other languages on specific downstream tasks. Nevertheless, these models have used
large scale pretraining corpora and consequently
need high computational cost.
Recently, non-English monolingual models have
been released: RobBERT for Dutch (Delobelle
et al., 2020), FlauBERT (Le et al., 2020) and
CamemBERT for French (Martin et al., 2020),
(Canete et al., 2020) for Spanish and (Virtanen
et al., 2019) for Finnish. In (Martin et al., 2020),
authors showed that their French model trained on
a 4 GB performed similarly to same model trained
on the 138GB. They also concluded that a model
trained on a Common- Crawl-based corpus performed consistently better than the one trained on
the French Wikipedia. They suggested that a 4
GB heterogeneous dataset in terms of genre and
style is large enough as a pretraining dataset to
reach state-of-the-art results with the BASE architecure, better than those obtained with mBERT (pretrained on 60 GB of text). In (Virtanen et al., 2019),
a Finnish BERT model trained from scratch outperformed mBERT for three reference tasks (partof-speech tagging, named entity recognition, and
dependency parsing). Authors suggested that a
language-specific deep transfer learning models for
lower-resourced languages can outperform multilingual BERT models.
Tunisian
MSA
ﻣﺤﻼﻫﺎ ﻫﺎﻟﻐﻨﺎﻳﺔ
ﻣﺎ ﺃﺣﻠﻰ ﻫﺬﻩ ﺍﻷﻏﻨﻴﺔ
ﻻ ﺗﻌﺠﺒﻨﻲ ﺗﺼﺮﻓﺎﺗﻬﺎ ﻣﺎﺗﻌﺠﺒﻨﻴﺶ ﻛﻴﻔﺎﺵ ﺗﺘﺼﺮﻑ
ﻭﻗﺘﺎﻩ ﻳﺒﺪﺍ ﺍﻟﻤﺎﺗﺶ
ﻣﺘﻰ ﺗﺒﺪﺃ ﺍﻟﻤﺒﺎﺭﺍﺓ
English
How nice is this song
I don’t like how she behaves
When does the match start
Table 1: Examples of Tunisian sentences with their MSA and English translation.
Compared to the increasing studies of contextualized word representations in Indo-European
languages, similar research for Arabic language
is still very limited. AraBERT (Antoun et al.,
2020), a BERT-based model, was released using
a pre-training dataset of 70 million sentences, corresponding to 24 GB of text covering news from
different Arab media. AraBERT was pre-trained on
a TPUv2-8 pod for 1,250K steps. It achieved stateof-the-art performances on three Arabic tasks: Sentiment Analysis, Named Entity Recognition, and
Question Answering. Nevertheless, the pre-trained
dataset is mostly a MSA based. Authors concluded
that there is a need for pretrained models that can
tackle a variety of Arabic dialects. Lately, GigaBERT (Wuwei et al., 2020) customized bilingual
language model for English and Arabic has outperformed AraBERT in several downstream tasks.
3
TunBERT
In this section, we describe the training Setup and
pretraining data that was used for TunBERT.
3.1 Training Setup
TunBERT model is based on the Pytorch implementation of NVIDIA NeMo BERT 4 . The model was
pre-trained using 4 NVIDIA Tesla V100 GPUs for
1280K steps. The pretrained model characteristics
are shown in Table 3. Adam optimizer was used,
with a learning rate of 1e-4, a batch size of 128, a
maximum sequence length of 128 and a masking
probability of 15%. Cosine annealing was used for
learning rate scheduling with a warm-up ratio of
0.01. Training took 122 hours and 25 minutes for
330 epochs over all the tokens.
The model was trained on two unsupervised prediction tasks using a large Tunisian text corpus: The
Masked Language Modeling (MLM) task and the
Next Sentence Prediction (NSP) task.
For the MLM task, 15% of the words in each sequence are replaced with a [MASK] token. Then,
4
https://github.com/NVIDIA/
DeepLearningExamples/tree/master/
PyTorch/LanguageModeling/BERT
#Uniq Words
8,256K
#Words
48,233K
#Sentences
500K
Table 2: Pretraining dataset statistics.
#Layers
12
Hidden Size
768
#self-attention heads
12
Table 3: Pretrained model configuration.
the model attempts to predict the original masked
token based on the context of the non-masked tokens in the sequence. For the NSP task, pairs of
sentences are provided to the model. The model
has to predict if the second sentence is the subsequent sentence in the original document. In this
task, 50% of the pair sentences are subsequent to
each other in the original document. The remaining
50% random sample sentences are chosen from the
corpus to be added to the first sentence.
3.2 Pre-training Dataset
Because of the lack of available Tunisian dialect
data (books, wikipedia, etc.), we use a web-scraped
dataset extracted from social media, blogs and
websites consisting of 500k sentences of text, to
pretrain the model. The extracted data was preprocessed by removing links, emoji and punctuation symbols. Then, a filter was applied to ensure
that only Arabic scripts are included. Pretraining
dataset statistics are presented in Table 2. The training dataset size is 67.2 MB.
4
Evaluation
We measure the performance of TunBERT by evaluating it on three tasks: Sentiment Analysis, Dialect identification and Reading Comprehension
Question-Answering. Fine-tuning was done independently using the same configuration for all tasks.
We do not run extensive grid search for choosing
the best hyper-parameters due to computational
and time constraints. We applied a configuration
commonly used in the literature. We use the splits
provided by the datasets authors when available
Dataset
#Negative
#Positive
#Train
#Dev
#Test
TSAC
4175
3277
4680
1170
1516
TEC
1799
1244
1947
487
609
Table 4: TSAC and TEC Sentiment analysis datasets
statistics.
and the standard 80 % and 20% when not.
4.1 Sentiment Analysis
For the sentiment analysis task, we used two
manually annotated Tunisian Sentiment Analysis
datastes:
• Tunisian Sentiment Analysis Corpus (TSAC)
(Medhaffar et al., 2017) obtained from Facebook comments about popular TV shows. The
TSAC dataset is composed of comments based
on Latin scripts, Arabic scripts and emoticons.
We use only the Arabic script comments.
• Tunisian Election Corpus (TEC) (Sayadi et al.,
2016) obtained from tweets about Tunisian
elections in 2014. Beside Tunisian content,
TEC dataset content is also composed of MSA
content.
Statistics of the TSAC and TEC are shown in Table
4.
4.2
Tunisian Dialect identification
This task focuses on identifying the Tunisian dialect of a given text from other Arabic dialects, especially on social media sources where there is no
established standard orthography like MSA. First
attempts to tackle the challenge identified 5 Arabic
dialects categories in addition to MSA: Maghrebi,
Egyptian, Levantine, Gulf, and Iraqi (Zaidan and
Callison-Burch, 2011). (El-Haj et al., 2018) proposed 4 Arabic dialects categories by merging the
Iraqi with the Gulf. Tunisian dialect was classified
into the Maghrebi dialect along with the Algerian,
Moroccan, and other dialects. Nevertheless, even
if the Maghrebi vocabulary is pretty much similar
throughout North African countries, many differences exist not only at the phonetic level (Harrat
et al., 2018) but also at the lexical, morphological
and syntactic levels (Horesh, 2019).
For evaluation, two sub-tasks were performed:
Dataset
#Train
#DEV
#Test
TADI
40500
2396
7192
TAD
3200
400
400
Table 5: TADI and TAD dataset statistics.
• Identification of Tunisian dialect from other
Arabic dialects (TADI): this is a binary classification task: Tunisian dialect and Non
Tunisian dialect from an Arabic dialectical
dataset. We used the Nuanced Arabic Dialect
Identification (NADI) shared task dataset with
a total of 21,000 tweets, covering 21 Arab
countries. NADI is an imbalanced dataset in
which the training includes only 747 Tunisian
tweets and the remaining tweets cover other
dialects. Consequently, this dataset is unbalanced. To solve this issue, we created a
new dataset TADI (Tunisian and Arabic Dialect Identification) by including a sub-set of
TSAC dataset as Tunisian comments to have
the same number of tweets for the Tunisian
dialect as same as the other dialects as shown
in Table 5.
• Identification of Tunisian dialect and Algerian
dialect (TAD dataset): for this sub-task we
used the Multi-Arabic Dialect Applications
and Resources (MADAR) dataset (Bouamor
et al., 2018) . More specifically, we used
the shared task dataset to target a large set
of dialect labels at country level. We filtered
the dataset of dialect labels at country level
(Bouamor et al., 2019) to only keep Tunisia
and Algerian labeled data as shown in Table
5.
4.3 Reading Comprehension
Question-Answering
Open-domain Question-Answering (QA) task has
been intensively studied to evaluate the language
Understanding performances of the models. This
task takes as input a textual question to look for
correspondent answers within a large textual
corpus. In (Mozannar et al., 2019), two MSA QA
datasets has been proposed. However, to the best
of our knowledge, no study was previously made
for such a task for any Arabic dialect.
For this task, we built TRCD (Tunisian Reading
Comprehension Dataset) as Question-Answering
Dataset
#Train
#Dev
#Test
#Document
114
15
15
#Paragraph
342
45
45
#QA
1026
135
135
Table 6: TRCD statistics.
Model
(Medhaffar et al., 2017)
word2vec (Mulki et al., 2020)
doc2vec (Mulki et al., 2020)
Tw-StAR (Mulki et al., 2020)
mBERT
GigaBERT
AraBERT
TunBERT
Accuracy
78%
77.4%
57.2%
86.5%
92.21%
94.92%
95.63%
96.98%
F1.macro
78%
78.2%
61.7%
86.2%
91.03%
93.39%
94.91%
96.98%
Table 7: TSAC results.
dataset for Tunisian dialect. We used a dialectal version of the Tunisian constitution following
the guideline in (Chen et al., 2017). It is composed of 144 documents where each document has
exactly 3 paragraphs and three Question-Answer
pairs are assigned to each paragraph. Questions
were formulated by four native speaker annotators
and each question should be paired with a paragraph as shown in Figure 1).
To the best of our knowledge, this is the
first Tunisian dialect dataset for the QuestionAnswering task. TRCD dataset statistics are
showed in Table 6.
5
Experiments and Discussion
5.1 Tunisian Sentiment Analysis
The efficiency of TunBERT language model was
evaluated against mBERT, AraBERT and GigaBERT language models and the state of the art
performances when available. The obtained performances of Tunisian Sentiment Analysis using
TunBERT were further compared against the baseline systems that tackled the same datasets (word
embeddings (word2vec), document embeddings
(doc2vec) and Tw-StAR (Mulki et al., 2020)) and
listed in Table 7 and Table 8.
The results in Table 7 illustrate the outperformance of the pretrained contextualized text
representation models over the previous techniques namely word2vec and doc2vec. TunBERT
achieved the best performance on the TSAC dataset.
Model
(Sayadi et al., 2016)
word2vec (Mulki et al., 2020)
doc2vec (Mulki et al., 2020)
Tw-StAR (Mulki et al., 2020)
mBERT
GigaBERT
AraBERT
TunBERT
Accuracy
71.1%
61.9 %
62.2%
88.2%
58.45%
71.75%
79.14%
81.2%
F1.macro
63%
58.4%
56.4%
87.8%
36.89%
65.32%
72.57%
76.45%
Table 8: TEC results.
It reached 92.98% as F1.macro which is a high
result comparing to to 78.2%, 61.7% and 86.2%
scored by word2vec, doc2vec and Tw-StAR, respectively. The results show that TunBERT also
outperform pretrained language models: mBERT,
GigaBERT and AraBERT.
Likewise, Table 8 illustrates the outperformance
of BERT-based LM against other techniques with
the TEC dataset. Nevertheless, the best performances was achieved by Tw-StAR. For instance,
the best achieved Tw-StAR F1.macro was in TEC
dataset with a value of 87.8% compared to 76.45%,
and 72.57% scored by TunBERT and AraBERT,
respectively. This could be explained by the noisy
nature of TEC dataset with a mixed Tunisian and
MSA content. Results using mBERT achieved
the worst performances could demonstrate that
mBERT is not suitable for noisy data. The results
showcase also the outperformance of TunBERT
over the other pretrained language models.
5.2 Tunisian Dialect identification
For Tunisian Dialect identification, the results in
Table 9 show that the TunBERT language model
outperform other state-of-the-art language models.
Indeed, our model achieved a F1.macro of 87.14%
compared to 68.93% achieved by mBERT. TunBERT also outperforms the Arabic language pretrained BERT AraBERT. Likewise, it has achieved
a F1.macro of 93.25% for the Tunisian-Algerian
dialects identification task outperforming the other
used language models as shown in Table 10.
5.3 Reading Comprehension
Question-Answering
Fine-tuning TunBERT on the Tunisian Reading
Comprehension Dataset did not give impressive
results (Exact match of 2.17%, F1 score of 13.66%
and a Recall of 22.59%). Comparable results
Figure 1: TRCD dataset example with its corresponding English translation.
Model
mBERT
AraBERT
GigaBERT
TunBERT
Accuracy
75,21%
79.57%
72.67%
87.46%
F1.macro
68.93%
76.7%
65.3%
87.14%
Table 9: TADI results
Model
mBERT
AraBERT
GigaBERT
TunBERT
Accuracy
86.75%
87.5% %
93.3%
F1.macro
86.4%
87.37%
0%
93.25%
Table 10: TAD results
were obtained for GigaBERT (Exact match of
0.7%, F1 score of 14.02% and a Recall of 21.65%).
MBERT gave slightly better results (Exact match
of 4.25%, F1 score of 22.6% and a Recall of31.3).
Meanwhile, we noticed good results for AraBERT
(Exact match of 26.24%, F1 score of 58.74% and a
Recall of 63.96%).
Adding a pre-training step on an MSA reading
comprehension dataset (In our case the ArabicSQuAD dataset (Mozannar et al., 2019)) made
great improvements in all of the models performances, especially for the TunBERT. The strategy
was to use the pre-trained language model, finetune it for few epochs on the MSA dataset, then use
the best checkpoint to train and test on the TRCD
dataset. Following this stategy, TunBERT acheived
great results with an Exact match of 27.65%, an F1
score of 60.24% and a Recall of 82.36%, as shown
in Table 11.
5.4 Discussion
The experimental results indicate that the proposed
pre-trained TunBERT model yields improvements,
compared to mBert, Gigabert and AraBERT
models as shown in Tables 7 and 8 for the
sentiment analysis sub-task, Tables 9 and 10 for
dialect identification task, and Table 11 for the
question-answering task.
Not surprisingly, GigaBERT as customized
BERT for English-to-Arabic cross-lingual transfer
is not effective for the tackled tasks and should
be applied for tasks using code-switched data as
suggested in (Wuwei et al., 2020).
As AraBERT was trained on news from different
Arab media, it shows good performances on the
three tasks as the datasets contain some formal text
(MSA). The TunBERT was trained on a dataset
including web text, which is useful on casual text,
such as Tunisian dialect in Social media. For this
reason, it performed better than AraBERT on all
the performed tasks.
We show that pretraining Tunisian model on
highly variable dataset from social media leads
to better downstream performance compared to
models trained on more uniform data. Moreover,
results led to the conclusion that a relatively small
amount of web-scraped dataset (67.2M) leads to
downstream performances as good as models pretrained on a datasets of larger magnitude (24 GB for
AraBERT and about 10.4B tokens for GigaBERT).
This is confirmed with the QA-task experiments
where the created dataset contains a small amount
of dialect texts. The Arabic-SQuAD dataset was
used to help with the missing embeddings of the
MSA and to permit the finetuned model to effectively learn the QA-task by providing more examples of question-answering. The TunBERT model
has overcome all the other models in term of exact
match and recall.
6
Conclusion
In this paper, we reported our efforts to develop
a powerful Transformer-based language models
for Tunisian dialect: TunBERT. Our models are
Finetuning
datasets
Language Models
mBERT
AraBERT
GigaBERT
TunBERT
TRCD dataset
Exact match
4.25
26.24
0.7
2.127
F1 score
22.6
58.74
14.02
13.665
Recall
31.3
63.96
21.65
22.597
Arabic SQuAD
and TRCD
Exact match F1 score Recall
29.07
60.86
62.18
24.11
63.53
70.43
29.78
62.44
66.34
27.65
60.24
82.36
Table 11: TRCD results before and after pre-training on Arabic-SQuaD
trained on 67.2 MB Common-Crawl-based dataset
extracted from social media consisting of 500k sentences of text. When fine-tuned on the various labeled datasets, our TunBERT model achieves new
SOTA on all the tasks on all datasets. Compared
to larger models such as GigaBERT and AraBERT,
our TunBERT model has better representation of
Tunisian dialect and yield better performances in
addition to being less computationally costly at inference time. Our models are publicly available
for research5 . In the future, we plan to evaluate
our models on more Arabic NLP tasks and further
pre-train them to improve their performance on the
datasets where they are currently outperformed.
On social media, Tunisian people tend to express
themselves using an informal way called "TUNIZI"
(Fourati et al., 2020) that represents the Tunisian
Arabic text written using Latin characters and numerals rather than Arabic letters. For instance,
the word "sou2el"6 is the Latin based characters
of the word ﺳﺆﺍﻝ. A natural future step would
involve building a multi-script Tunisian dialect language model including Arabic script and Latin
script based characters.
References
Wissam Antoun, Fady Baly, and Hazem Hajj. 2020.
AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th
Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
Tomas Mikolov. 2017. Enriching word vectors with
subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
Houda Bouamor, Nizar Habash, Mohammad Salameh,
Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani,
5
To preserve anonymity, a link to Github repository will
be added to the camera-ready version if the paper is accepted.
6
The word "Question" is the English translation.
Alexander Erdmann, and Kemal Oflazer. 2018. The
madar arabic dialect corpus and lexicon. In The International Conference on Language Resources and
Evaluation.
Houda Bouamor, Sabit Hassan, and Nizar Habash.
2019. The MADAR shared task on Arabic finegrained dialect identification. In Proceedings of the
Fourth Arabic Natural Language Processing Workshop, pages 199–207.
José Canete, Gabriel Chaperon, Rodrigo Fuentes, JouHui Ho, Hojin Kang, and Jorge Pérez. 2020. Spanish pre-trained bert model and evaluation data.
Pml4dc at iclr, 2020:2020.
Danqi Chen, A. Fisch, J. Weston, and Antoine Bordes.
2017. Reading wikipedia to answer open-domain
questions. ArXiv, abs/1704.00051.
Alexis Conneau and Guillaume Lample. 2019. Crosslingual language model pretraining. In In Proceedings of tAdvances in Neural Information Processing
Systems, pages 7059–7069.
Pieter Delobelle, Thomas Winters, and Bettina Berendt.
2019. Liu, yinhan and ott, myle and goyal, naman
and du, jingfei and joshi, mandar and chen, danqi
and levy, omer and lewis, mike and zettlemoyer, luke
and stoyanov, veselin. Computing Research Repository, arXiv:1907.11692. Version 1.
Pieter Delobelle, Thomas Winters, and Bettina Berendt.
2020.
Robbert: a dutch roberta-based language model. Computing Research Repository,
arXiv:2001.06286. Version 2.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages
4171–4186.
Mahmoud El-Haj, Paul Rayson, and Mariam Aboelezz.
2018. Arabic dialect identification in the context of
bivalency and code-switching. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC), pages 3622–3627.
Chayma Fourati, Abir Messaoudi, and Hatem Haddad.
2020. Tunizi: a tunisian arabizi sentiment analysis
dataset. In AfricaNLP Workshop, Putting Africa on
the NLP Map. ICLR 2020, Virtual Event, volume
arXiv:3091079.
Salima Harrat, Karima Meftouh, and Kamel Smaïli.
2018. Maghrebi arabic dialect processing: an
overview. Journal of International Science and General Applications, 1.
SUri Horesh. 2019. Languages of the middle east and
north africa. The SAGE encyclopedia of human communication sciences and disorders, 1:1058–1061.
Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 328–339, Melbourne, Australia.
Association for Computational Linguistics.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2020. ALBERT: A lite BERT for self-supervised
learning of language representations. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoit Crabbé, Laurent Besacier, and Didier
Schwab. 2020. FlauBERT: Unsupervised language
model pre-training for French. In Proceedings of the
Eleventh International Conference on Language Resources and Evaluation (LREC), pages 2479–2490.
Hala Mulki, Hatem Haddad, Mourad Gridach, and
Ismail Babaoğlu. 2020. Syntax-ignorant n-gram
embeddings for dialectal arabic sentiment analysis.
Natural Language Engineering, pages 1–24.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. GloVe: Global vectors for word
representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages
2227–2237, New Orleans, Louisiana. Association
for Computational Linguistics.
Telmo Pires, Eva Schlinger, and Dan Garrette. 2019.
How multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4996–
5001, Florence, Italy. Association for Computational Linguistics.
A. Radford. 2018. Improving language understanding
by generative pre-training.
Karim Sayadi, Marcus Liwicki, Rolf Ingold, and Marc
Bui. 2016. Tunisian dialect and modern standard
arabic dataset for sentiment analysis: Tunisian election context. In Proceedings of The Second International Conference on Arabic Computational Linguistics, ACLING, pages 35–53.
Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric
de la Clergerie, Djamé Seddah, and Benoît Sagot.
2020. CamemBERT: a tasty French language model.
In Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages
7203–7219.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
Salima Medhaffar, Fethi Bougares, Yannick Estève,
and Lamia Hadrich-Belguith. 2017. Sentiment analysis of Tunisian dialects: Linguistic ressources and
experiments. In Proceedings of the Third Arabic
Natural Language Processing Workshop, pages 55–
61.
Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luomaa, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. 2019. Multilingual is not
enough: Bert for finnish. Computing Research
Repository, arXiv:1912.07076. Version 1.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, Workshop Track
Proceedings.
Hussein Mozannar, Elie Maamary, Karl El Hajal, and
Hazem Hajj. 2019. Neural Arabic question answering. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 108–118,
Florence, Italy. Association for Computational Linguistics.
Lan Wuwei, Chen Yang, Xu Wei, and Ritter Alan. 2020.
Gigabert: Zero-shot transfer learning from english
to arabic. In Proceedings of The 2020 Conference on
Empirical Methods on Natural Language Processing (EMNLP).
Omar F. Zaidan and Chris Callison-Burch. 2011. The
Arabic online commentary dataset: an annotated
dataset of informal Arabic with high dialectal content. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Human Language Technologies, pages 37–41.