1. Introduction
The Arabic language is spoken by more than 400 million people in 22 countries, including the Kingdom of Saudi Arabia [
1], and is the fourth most frequently used language on the World Wide Web [
1]. An estimated 58% of the Saudi Arabian population, approximately
million users [
2], are reported to have access to the Internet. Generally, the Arabic language is classified into three main types: classical Arabic, modern standard Arabic (MSA) and Arabic dialect (AD). ADs are defined as the spoken or written variations of the Arabic language used by different Arab regions. Recently, it was noticed that ADs are more frequently used for informal written communication on the Web, owing to the wide range of available social media applications.
Language identification, and more recently, dialect identification and discrimination, has become a compelling task in language recognition and natural language processing (NLP). It is the task of automatically classifying the language vocabulary used by a specific community into the geographical region that the native speaker belongs to [
3]. This task requires a more fine-grained level of identification, as it is the most challenging language identification task [
4].
However, the knowledge obtained from dialect identification is very helpful for many applications, such as document retrieval to classify documents based on their dialect and according to user preferences [
5], to complement existing recognition language modelling systems, and to build a natural language generation system using the generated dialectal mappings [
6].
Arabic dialect identification (ADI) has been receiving increasingly significant attention in recent years. Early works on ADI focused on either distinguishing the dialects from MSA or among dialect countries [
5,
7].
According to a recent review [
8], fine-grained dialects have not been thoroughly investigated in the ADI literature. Recently, Salameh et al. [
9] created a fine-grained dialects dataset that included 25 cities from a number of different Arab countries. Their results were one of the main motivations of this study, to investigate the problem of fine-grained Arabic language dialects that only included dialects from very close regions in one country, using a short sentence that consisted of a few words, and without the notion of a word.
The use of character-model machine learning models, when building either classic machine learning or deep learning models, in fact shows a better applicability in the field of natural language processing (NLP) over word-level models [
10,
11]. This is due to the fact that the word-level model has a number of shortcomings. First, it represents each word separately as a token and as words sharing a common root, with the prefix or suffix treated as separate words. Thus, making the word-level model is statistically inefficient.
Second, the list of vocabulary in the word-level model is fixed in the training corpus, and when the model is tested on unseen words, it usually fails to handle it. So, it cannot handle small changes in words such as small differences in characters between fine-grained dialects.
Character-level models, on the other hand, are effectively unrestricted in their vocabulary. On this matter, a number of researchers have discovered that the character-level model can overcome word-level model issues in text classification problems, as long as the text is represented as a sequence of one-hot vectors, and without changing the machine learning models. Moreover, character-level models help to minimise the steps needed for data preprocessing compared to word-level models, and the Arabic language is indeed a very challenging language.
Based on this intuition, the character-level model was investigated in this study to solve the problem of ADI, more precisely for fine-grained dialects on regions from the same country, and the Saudi dialects were chosen as a case in this study, as each dialect has unique phrases or words that could be very informative. With disregard to the order or the meaning of these phrases or words, the CNN approach could be applied in this situation as it has been successfully applied in computer vision tasks.
Saudi dialects share the same Arabic characters as other dialects which are spoken in close geographical regions. It differs from MSA on all levels of linguistic representations. They also do not follow any grammar rules, unlike MSA.
However, two studies were found (i.e., [
12,
13]) that focus specifically on building Saudi dialects. Other studies either consider Saudi dialects as part of the Gulf dialects and classify them either as MSA or other Arabic regional dialects, such as the Maghrebi, Iraqi, Egyptian, and Levantine dialects.
In this study, the main aim was to investigate the use of a character-level model for solving the problem of automatically identifying Saudi dialects, considering a number of supervised machine learning approaches.
Therefore, the main objectives of this study are as follows:
- 1
To collect a fine-grained Saudi dialects corpus, which includes the four main Saudi region dialects, consisting of short dialect sentences.
- 2
To train and to test several classical machine learning models on character-level input features without the notation of words.
- 3
To investigate deep learning based on a character level for the ADI problem.
The rest of the paper is organised as follows.
Section 2 explains the related works on ADI.
Section 3 describes our proposed Saudi dialect identification approach.
Section 4 discusses our experimental results. Finally,
Section 5 presents the concluding remarks and suggests future research lines.
2. Related Work
ADI has been receiving significant attention in recent years. The methodologies used to tackle this intricate problem can be divided into five main categories (see [
14] for a survey of the literature): first, nonautomated manual methods that depend on lexicons and linguistic rules; second, language models that estimate the probability of different linguistic units belonging to a particular dialect; third, machine learning; and finally, deep learning models.
As this study focuses on solving the ADI problem using classical machine learning and deep learning approaches based on character-level input features, this section will only review the studies that covered these approaches.
More recently, a number of works on ADI has involved the use of machine learning (ML) approaches with feature engineering, where the performance of different models using different features is assessed and compared [
14].
Sadat et al. [
15] compiled a dataset from social media outlets encompassing 18 different dialects, in an attempt to provide a framework for the multiclassification of ADI tasks. They implemented a Markov language model and a naïve Bayes (NB) classifier using unigram, bigram, and trigram character-level features. The best performance was recorded for the character-level bigram NB, with an F1 score of 80%.
Salameh et al. [
9] performed a fine-grain ADI task covering 25 Arabic dialects plus MSA. They used two datasets. The first dataset consisted of 2000 sentences translated into 25 cities’ dialects and MSA (Corpus-26). The second had an additional 10,000 sentences translated into the dialects of five cities and MSA (Corpus-6). A number of different combinations of features were used to train the models, including word n-grams, character n-grams, and language model probability scores. Two machine learning algorithms were applied: linear support vector machine (SVM) and multinomial naïve Bayes (MNB), which reported the best-performing model results. The best performance set of features combined character-level uni-, bi-, and trigrams, word unigrams, and the probability score of a five-gram character language model.
A study by Adouane et al. [
16] used an SVM to distinguish between Arabicised Berber and seven ADs at the country level (i.e., Algerian, Egyptian, Gulf, Levantine, Mesopotamian, Moroccan, and Tunisian) plus MSA. The dataset was a manually annotated corpus of blogs and newspapers that they had collected. The best feature set combined character-level five-grams and six-grams, and a lexicon that they constructed, weighted using the term frequency–inverse document frequency (TF-IDF). The SVM using the features mentioned resulted in an F1 score of 92.94%.
Malmasi et al. [
17] put forward an ADI problem to distinguish between the transcriptions of the conversational speech of four Arabic dialects (i.e., Egyptian, Gulf, Levantine, and North African) and MSA. Adouane et al. [
18], in an attempt to solve the ADI task posed by [
17], also used a linear SVM. The best performance was achieved using character-level five-gram and six6-gram features (F1 score= 49.5%). Using the same data as in [
17], Eldesouki [
19] reported the best performing model as an SVM with character-level bi-, tri-, four-, and five-grams together weighted using TF-IDF (accuracy = 51.36%). Using the same features, the second best-performing model was a logistic regression with the same features (accuracy = 50.82%).
Another study by Malmasi et al. [
5] conducted a six-way classification ADI task using a linear SVM. They used the dataset compiled by [
20] containing a collection of 2000 parallel sentences in five ADs plus MSA for training the model. The results showed that the character n-grams were the best features compared to other features such as word n-grams. The best performance was achieved using a combination of character uni-, bi-, and trigrams with an accuracy of 66.48%.
In the ADI task stated in [
17], to discriminate between four Arabic dialects and MSA, deep learning models showed encouraging results compared to the ML techniques that were mentioned earlier for the same task [
18,
19]. For example, Guggilla [
21] used an adaptation of a CNN that included four layers: input, convolution, max-pooling and a fully connected softmax layer. He used randomly generated embeddings as features that kept updating during training. The model achieved an F1 score of 43.29%. Belinkov and Glass [
22] employed a character-level CNN with seven layers: embedding, dropout, multiple parallel convolutions, max-pooling, fully connected, and softmax layers. The CNN using character embeddings as features reached an F1 score of 48.34%, outperforming the model in [
21].
For the same ADI task mentioned earlier in [
23] with data provided by [
24], Ali [
25] employed a character-level CNN combined with dialect embedding vectors and a representation extracted from linguistic features. He experimented with three CNN architectures that differed in the input layer before the convolution layer. The first CNN used a one-hot character representation for the input layer. The second used an embedding layer before the convolution layer and the third used a GRU recurrent layer before the convolution layer. The three architectures scored accuracies of 57.11%, 56.97% and 57.59%, respectively.
All of the studies mentioned above mainly involved the problem of distinguishing the Arabic dialects spoken in different countries and MSA. However, there has been an increasing level of interest in studying dialects within the same country in the past two years. Differentiating the dialects of different provinces of the same country is an even more complex task, as capturing linguistic differences becomes more intricate and convoluted.
In an attempt to explore the problem of ADI on a province level, Abdul-Mageed et al. [
26] recently shared a Twitter-based dataset covering a total of 100 provinces from 21 Arab countries.
Using the same data as in [
26], Nayel et al. [
27] built an ensemble of five models to classify province-level dialects: complement Naïve Bayes (CNB), decision tree (DT), logistic regression (LR), random forest (RF), and support vector machine (SVM). In addition, they used TF-IDF with unigram features to train the system. The SVM outperformed the other models on the training data, with an F1 score of 4.73. The final ensemble classifier reached an accuracy of 4.8 and an F score of 4.55, outperforming other transformer techniques used for the same ADI task [
26,
28].
In summary, solving the problem of ADI on province-level dialects is still in the early stages, and more work needs to be performed. As we saw on ADI using country levels, models utilising character n-gram features showed higher accuracies than those using word n-gram features. Therefore, the former was chosen in this study to be investigated further for solving the problem of ADI on province-level dialects.
3. Saudi Dialect Identification Approach
The main task of automatic dialect identification is to build a model that can predict in which dialect a term or word
w is written. This process requires a more fine-grained level of identification, as it is the most challenging language identification task [
4].
The implemented ADI approach, as shown in
Figure 1, consists of five main phases: dialect data collection, data preprocessing and labelling, character-based feature extraction, CNN character-based model/classical machine learning character-based models, and model evaluation performance. The following subsections explain these phases in more detail.
3.1. Saudi Dialect Data Collection Phase
Previous studies have shown that isolated words and individual phonemes can be successfully used for dialect identification [
29,
30]. Accordingly, a dialect corpus that mostly consisted of words or very short sentences was built for the purpose of this study.
In fact, the Saudi regional dialect is divided into four main categories [
31]; these are:
Hejazi: The dialect spoken by native speakers in the west of Saudi Arabia, which includes the Makkah and Al-Madinah regions.
Najdi: The dialect language spoken by native speakers in central Saudi Arabia, which includes the Riyadh and Al-Qasim regions.
Janoobe: The dialect language used by native speakers in the south of Saudi Arabia, which includes the Aseer region, Najran city and Jazan city.
Hasawi: The dialect language used by native speakers in the east of Saudi Arabia, which includes the Al-Hasa region and Al-Dammam city.
Therefore, the four Saudi regional dialects were collected from Arabic social media data, such as blogs, discussion forums, and reader commentaries, given that the language of such social media is typically dialect language. Data were also collected from Twitter, based on the number of dialect regional hashtags over the period from January 2020 to May 2020, which included الهجةـالحجازية #للهجةـالنجدية #اللهجةـالجنوبي #اللهجةـالحساوية
Table 1 shows the statistics for the collected dataset. As displayed, there was a total of 3768 dialects sentences.
3.2. Data Preprocessing and Labelling
After the data were harvested, some data preprocessing steps were applied, including the two main steps, data cleaning and normalisation. For data cleaning, the decision was made to minimise its application to the data as much as possible, due to the fact that applying text-cleaning techniques to such a fine-grained dialect language might affect the contextual meaning.
However, punctuation marks, extra spaces, and all diacritics and elongations were deleted. Some duplicate words were found in more than one dialect language. Therefore, to validate these data, a human annotator was asked to check these duplicates and to validate the collected data in general.
For the normalisation step, the following was applied to normalise Arabic letters:utf8
أ, إ, and آ were replaced with ا (here the different forms of the letter alif were replaced with the standard alif form);
ئ was replaced with ا, (here the letter ya with hamza above was replaced with the standard alif form);
ى was replaced with ي, (here the letter ya was replaced with the standard ya form);
ة was replaced with ه, (here the letter ta marbuta replaced with the letter ha marbuta);
ؤ was replaced with و, (here the letter waw with hamza above was replaced with the standard waw form);
كـ was replaced with ك, (here the letter kaf with the initial shape was replaced with the isolated kaf shape);
Arabic stop words were removed, such as من, في, على, and الى.
The last stage of this phase was the labelling process. Four annotators were asked to label and validate the collected data. The selection of annotators was conditioned based on the fact that they had to have lived in at least two regions in order to make sure that they could distinguish between at least a pair of dialect languages. Then, each annotator validated a pair of dialect languages and distinguished between the overlap text, and they were asked to highlight the ones whose origin dialects they could not determine. For these words, a second annotator was asked, and if the first two could not agree, then a third annotator was asked to determine the origin dialect of the word. The three annotators were unable to agree on the origin dialects for a small number of words, and so the decision was made to delete these words. In total, six annotators were assigned to finishing the whole process of labelling and validating the collected data.
3.3. Character-Based Feature Extraction
In this phase, the main aim was to convert each sentence in our dialect dataset into a character-based feature. As two different techniques were used for building the predicating models, each one had a different feature extraction method.
For the deep learning, each sentence was represented by a numeric character sequence of vectors L in length, where L represented the maximum length of a sentence in our dataset, equal to 55. First, each character in the alphabet set was encoded in a one-hot representation of size , where c was total number of characters in the alphabet set. Then, the sequence of characters for each sentence was transformed into a sequence of vectors.
Two different alphabet sets were used in the experiments: the first set consisted of 30 characters (28 Arabic alphabet characters plus alhamza and the space), while the second set consisted of 37 characters, the same 30 in set one plus the 7 diacritics in the Arabic language, which are Tashdid, Fatha, Damma, Tanwin Damm, Kasra, Tanwin Kasr, and Sukun.
The experiment was run for each character set, where each sentence and its corresponding dialects were encoded into a sequence of vectors for the two fixed character sets, and a sequence of vectors of size 55 was obtained as the input layer.
On the other hand, for classical machine learning, the TF–IDF was used to contract the character-based feature set from the dataset.
TF–IDF combines two scores, the term frequency TF, which calculates the frequency of character n-grams in each sentence, and the inverse document frequency IDF, in order to reduce the weights of the character n-grams that are repeated frequently and increase the weights of character n-grams that are repeated very rarely. Therefore, TF–IDF is defined as follows:
Here, TF(g, s) calculates the number of times n-gram g appears in sentence s, and the IDF is defined as shown in Equation (
2), where D is the total number of sentences in the dataset and df(g) is the number of sentences in which the character n-gram g appears in D.
3.4. Deep Learning Character-Based Model
Our classification problem is a multiclass problem, where each dialect represents a separate class. Therefore, given s and its label l, the task is to predict the Saudi dialect language that s is written in, using only its character sequence.
The CNN architecture described in [
10] was adapted. The architecture consists of two main steps, as shown in
Figure 2: the main feature of the first part uses the CNN layer as a feature extractor, and then the convolution output directly feeds to the long short-term memory (LSTM) layer to capture the long-term dependencies in the second step.
The input layer for the convolution layer was a sequence of vectors, which was the output of the previous step. In the convolution layer, two 1D convolutions were applied in parallel to each of our character input layers to map this sequence x into a hidden sequence h, with two different filters of size 64 and 100, each with a kernel size equal to 3 and a pool size of 2.
After each convolution operation, a nonlinear activation of the rectifier linear unit (ReLU) [
32] type was applied. Then, a temporal max-pooling layer was applied that strongly activated some nodes in the sequence. Next, a dense layer with a size of 128 and a ReLU activation function was applied. The two parallel 1D convolution outputs were concatenated into a sequence and passed to a dropout layer, in which
of the input units were dropped to reduce overfitting. Afterwards, a TimeDistributed layer was applied to make the sequence suitable for the bidirectional LSTM (BiLSTM) layer.
Then, a BiLSTM was used, resulting in two sequences in the forward and backward directions with hidden sizes of 128 each. The final output of the BiLSTM layer for both sequences was concatenated to yield 256-dimensional hidden units, and the fully connected layer (Dense) with a size of 128 and a RelU activation function was applied. To reduce overfitting, two dropout layers with a drop rate of were applied after and before the Dense layer. Finally, the resulting vectors passed to the Softmax layer to produce the final probability distribution over our dialect classes k.
3.5. Classical Machine Learning Character-Based Model
As the main aim of this project was to predict in which dialect a word or term is written, a variety of popular and powerful supervised classification algorithms were applied to the collected dataset, such as logistic regression (LR), stochastic gradient descent classifier (SGDC), variations of the naive Bayes (NB) models, and support vector classification (SVC).
3.6. Model Evaluation Performance
The performance of multiple classical machine learning and CNN algorithm was evaluated, based on the following widely used metrics: accuracy, recall, precision and F-measure. To calculate them, the confusion matrix must be built that contains true positive (TP), true negative (TN), false positive (FP), and false negative (FN) terms.
The accuracy can be calculated as follows:
The recall is the ratio of the correctly predicted dialects to the total number of dialects in the actual class, and it can be calculated as follows:
The precision is the ratio of correctly predicted dialects to the total number of predicted dialects in the sentence. It is calculated as follows:
The F-measure represents the weighted average, and it is calculated as follows:
4. Experiments and Results
All the experiments were run using an Apple Macintosh computer with a 2 GHz Quad-Core Intel Core i5 with 16 GB memory, and the implementation was carried out using the high-level technical computing language Python 3 version 3.7.6. For the classical machine learning classification algorithm, the Natural Language ToolKit (NLTK) was used, ref. [
33] as well as the scikit-learn Python library [
34]. For the CNN approach, the Keras library with TensorFlow as the back-end was used [
35].
Two different experiments were run: the first as a multiway ADI problem, including four-way and three-way, and the second as a two-way ADI problem. For both experiments, the considered CNN and classical machine learning algorithms were trained on 80% of our dataset and tested on the rest of the dataset.
For classical machine learning in both experiments, the models were trained by using the popular approach of a five-fold validation, in which the data were trained in four folds, and the remaining fold was used as the validation set. This method was repeated five times, and then the average result was recorded. However, for both experiments a pipeline was developed to determine the best parameters using a grid-search approach, which includes the following tested parameters:
Different values for the document frequency threshold (max_df), which required when calculating TF–IDF, including 0.5, 0.75, and 1.0.
Different values for the minimum document frequency threshold (min_df), including 1, 5 and 10.
Different n-grams combination for character model as shown in
Table 2.
For the CNN, the approach was tested with the data normalised in terms of removing tashkeel from the text in Models 2 and 4, and without the data normalised in Models 1 and 3. For the training, it was also tested by adding two drop layers in Models 3 and 4, and no drop layer in Models 1 and 2.
Table 3 shows the combination of these tested parameters.
Table 4,
Table 5 and
Table 6 represent the results of the considered algorithms for the four-way, three-way, and two-way ADI problems, respectively. The following subsections discuss the results along with their experiments in more detail.
4.1. Results of the Multiway ADI Problem
For the multiway ADI problem, a four-way ADI problem was run including all of the four classes in the collected data, and a three-way ADI problem, which included the Hijazi, Najdi, and Hasawi dialects.
Table 4 shows the results of the four-way ADI problem for all of our considered algorithms, including the CNN approach (Models 1, 2, 3, and 4) and the classical machine learning algorithms.
For the CNN approach, the best result was achieved for Model 1, which reached an accuracy of . In the model, the tashkeel was considered as a character (resulting in a 67-character list), and there were no drop layers.
In the case of removing tashkeel and adding drop layers (Model 4), the performance of the CNN model slightly increased in performance by percent compared with the performance without the drop layers (Model 3).
On the other hand, when considering tashkeel in the character list and without adding the drop layers (Model 1), the performance increased by compared with the same case but with the drop layers (Model 3). Therefore, the results indicated that this slight improvement might come from the dual effects of two factors, the addition of tashkeel in the character list and the lack of drop layers being added in the CNN architecture, not from the tashkeel alone. Thus, the decision was made to normalise the dataset by removing the tashkeel from the Arabic sentences and running the considered classical machine learning algorithms for the three- and two-way ADI problems.
For the classical machine learning algorithms, our grid search clearly indicated that most of them achieved the highest performance using TF-IDF with max_df being equal to and min_df being equal to 1, with the ngram_range being equal to (1, 4). In the case of the four-way ADI problem, all the considered classical machine learning algorithms outperformed the CNN models, where the best performance was achieved using LR and NuSVC, reaching an accuracy of .
Table 5 shows the results of the three-way ADI problem. In this problem, the worst class was deleted, which was Janobi, while keeping the other classes to test the performance of the considered algorithms in that case. However, the best performance was achieved by using the CNN approach (Model 6), in which the tashkeel was removed and no drop layer was added, reaching an accuracy of
, followed by a close performance of the LR algorithm achieving
. The classical machine learning algorithms also outperformed the other CNN models in the three-way ADI problem. It has to be mentioned that the overall increase in performance for all the considered algorithms in the three-way ADI problem compared with the four-way was due to the deletion of the worst class being predicted, which was the Janobi class.
4.2. Results of the Two-Way ADI Problem
For the two-way ADI problem, only the Hijazi and Hasawi dialects were included in the dataset. Only these two classes were considered because they originate from the furthest two regions in the Kingdom of Saudi Arabia (Hijazi is the dialect in the east, while Hasawi is the dialect in the west). Therefore, it was assumed that the difference between their dialects might be more easily distinguished by our considered machine learning algorithms than other dialects.
As Model 1 had achieved the best performance among the other CNN models in the multiway ADI problem, it was chosen to carry out our investigation for the two-way ADI problem.
Table 6 shows the results of the CNN approach (Model 1) and the classical machine learning algorithms in terms of precision, recall, F-measure, and accuracy for the two-way ADI problem.
The results revealed that all the considered algorithms increased in performance. Model 1 achieved the best performance, with an accuracy of and an F-measure equal to , compared with the other classical machine learning algorithms. Among the latter, the MultinomialNB algorithm, in which the parameters were determined through a grid search, achieved accuracy, followed by SVC, which achieved .
5. Discussion
In this study, the main aim was to investigate the use of a character-based model using classical and CNN machine learning algorithms to solve the problem of identifying fine-grained Arabic language dialects in the form of a written short text and without word notation.
In general, the results of all the considered algorithms demonstrated low performance for the four-way task, indicating how difficult the problem is for the existing approach. The same result was reported in the literature for such a very fine-grained ADI problem, particularly when the dialects were from very close regions (i.e., at a province level) because the degree of similarity between them was very high.
However, the results of the four-way ADI problem revealed that the classical machine learning algorithms based on a character model outperformed the CNN approach that was also based on a character model. This outcome indicated a need for further development in the architecture of the CNN approach to deal with such fined-grained dialects on a multiway ADI problem. By contrast, the CNN approach outperformed the other considered classical machine learning algorithms in the two-way ADI problem.
The results of the classical machine learning algorithms revealed that the best parameter for the TF-IDF was a combination of character n-grams model ranging from unigrams to four-grams, with a max_df equal to 0.75 and a min_df equal to 1. Among the considered algorithms, the LR achieved the highest performance in the four-way and three-way ADI problems.
The reason why most of the considered algorithms performed poorly in identifying the Janobi class was the high degree of similarity between the Janobi class and other classes, particular the Najdi and Hijazi classes. Most of the disagreements between our annotators occurred in that class as well. This situation might be due to the fact that people from the Janobi region have either lived there some time in their life or grown up in the Hijaz or Najd regions in Saudi Arabia. This indicates evidence that the text dialect of the Janobi region found on the Internet has been mingled with other dialects in Saudi Arabia.
Overall, it is not surprising that the accuracy of all considered algorithms ranged from to for the four-way, and from to for the three-way identification problem.
6. Conclusions
This study investigated the use of a character-level model to solve the ADI problem applied to a short Arabic sentence. This study focused on the Saudi dialects problem and particularly tested two-, three-, and four-way identification tasks. The main adaptive approach consisted of five phases: dialect data collection, data preprocessing and labelling, character-based feature extraction, classical machine learning/deep learning character-based models, and model evaluation performance. In the first phase, 3768 short dialect texts were collected from the Internet in four main Saudi dialects, including Hijazi, Najdi, Janobi, and Hasawi. The MultinomialNB, BernoulliNB, LogisticRegression, SGDClassifier, SVC, LinearSVC, NuSVC, and CNN approaches were then used in the learning phase, and their performance was evaluated and compared. The results showed that the best-performing algorithm was LR and NuSVC, reaching 40.9% accuracy. In the three-way task, the CNN approach (Model 6), in which the tashkeel was removed and no drop layers were added, outperformed the other models, reaching a 47.0% accuracy. Moreover, using TF-IDF with a combination of character n-grams ranging from unigrams to four-grams achieved the best performance for the considered classical machine learning algorithms.
In future research endeavours in ADI, the plan is to improve the CNN approach further, especially for the multiway problem. Different character feature construction techniques could also be considered, such as by representing a short part of the words, and not just characters.