Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content
  • I am a computational linguist. My main interest is how computational linguistics can deepen our understanding of the ... moreedit
Comparative studies of different translations for the same source text can be valuable sources of insights relevant to the fluid notion of 'translation style'. Such studies can employ a wide variety of techniques, including computational... more
Comparative studies of different translations for the same source text can be valuable sources of insights relevant to the fluid notion of 'translation style'. Such studies can employ a wide variety of techniques, including computational analysis which targets specific elements in the text in order to allow for a systematic view of translator style. This study attempts a computational-stylistic analysis of the two English translations of Naguib Mahfouz's controversial novel Awlad Haratina (literally, Children of our Alley). The aim of the study is twofold. First, it aims to show how quantifiable computational and distant reading techniques can help identify patterns of stylistic differences between these two translations. Second, it attempts to situate the results of this analysis within the wider social context of the two English translations (Stewart 1981 and Theroux 1996) of one of the most famous modern Arabic novels. The results clearly show patterns of linguistic use specific to each of the two translations highlighting differences in lexical variety and richness, sentence structure, readability level, stylometric analysis as well some lexical choices. These results can be interpreted within the social context of producing those two translations, with particular reference to characteristics of retranslation as discussed in the literature.
There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith... more
There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith authenticity, and there is a need to build a ‘gold standard’ corpus for good practices in Hadith authentication. We write a scraper in Python programming language and collect a corpus of 3,651 authentic prophetic traditions and 3,593 fake ones. We process the corpora with morphological segmentation and perform extensive experimental studies using a variety of machine learning algorithms, mainly through automatic machine learning, to distinguish between these two categories. With a feature set including words, morphological segments, characters, top N words, top N segments, function words, and several vocabulary richness features, we analyze the results in terms of both prediction and interpretability to explain which features are more characteristic of each...
This article describes the system submitted by the RGCL team to the IDAT 2019 Shared Task: Irony Detection in Arabic Tweets. The system detects irony in Arabic tweets using deep learning. The paper evaluates the performance of several... more
This article describes the system submitted by the RGCL team to the IDAT 2019 Shared Task: Irony Detection in Arabic Tweets. The system detects irony in Arabic tweets using deep learning. The paper evaluates the performance of several deep learning models, as well as how text cleaning and text pre-processing influence the accuracy of the system. Several runs were submitted. The highest F1 score achieved for one of the submissions was 0.818 making the team RGCL rank 4th out of 10 teams in final results. Overall, we present a system that uses minimal pre-processing but capable of achieving competitive results.
<jats:title>Abstract</jats:title> <jats:p>The primary purpose of this article is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi'i Muslims are proposing... more
<jats:title>Abstract</jats:title> <jats:p>The primary purpose of this article is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi'i Muslims are proposing different theories. Given the morphologically complex nature of Arabic, we test whether morphological segmentation, applied to the book and works by the two authors suspected by Sunnis to have authored the texts, can be used for author verification of the Nahj Al-Balagha. Our findings indicate that morphological segmentation may lead to slightly better results than whole words and that regardless of the feature sets, the three sub-corpora cluster into three distinct groups using principal component analysis, hierarchical clustering, multi-dimensional scaling, and bootstrap consensus trees. Supervised classification methods such as Naive Bayes, Support Vector Machines, k Nearest Neighbours, Random Forests, AdaBoost, Bagging, and Decision Trees confirm the same results, which is a clear indication that (1) the book is internally consistent and can thus be attributed to a single person and (2) it was not authored by either of the suspected authors.</jats:p>
In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words. The reason is that an incorrect translation of such words would miss the fundamental aspect of the source... more
In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words. The reason is that an incorrect translation of such words would miss the fundamental aspect of the source text, i.e. the author's sentiment. In the online world, MT systems are extensively used to translate User-Generated Content (UGC) such as reviews, tweets, and social media posts, where the main message is often the author's positive or negative attitude towards the topic of the text. It is important in such scenarios to accurately measure how far an MT system can be a reliable real-life utility in transferring the correct affect message. This paper tackles an under-recognised problem in the field of machine translation evaluation which is judging to what extent automatic metrics concur with the gold standard of human evaluation for a correct translation of sentiment. We evaluate the efficacy of conventional quality metrics in spotting a mistra...
We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr... more
We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr ) for film age appropriateness classification with the objective of improving the prediction of age appropriateness for parents and children. We use state-of-the art Deep Learning image feature extraction, including DENSENet, ResNet, Inception, and NASNet. We have tested several Machine learning algorithms and have found xgboost to yield the best results. Previously reported classification accuracy, using only textual features, were 79.1% and 65.3% for American MPAA and British BBFC classification respectively. Using images alone, we achieve 64.8% and 56.7% classification accuracy. The most consistent combination of textual features and images’ features achieves 81.1% and 66.8%, both statistically significant improvements over the use of text only.
In order to study how Judaism, Christianity, and Islam are represented in Wikipedia, I use corpus linguistics tools to extract the adjective noun collocates of the adjectives Jewish, Christian, and Islamic from the 2013 English Wikipedia... more
In order to study how Judaism, Christianity, and Islam are represented in Wikipedia, I use corpus linguistics tools to extract the adjective noun collocates of the adjectives Jewish, Christian, and Islamic from the 2013 English Wikipedia in order find out their semantic prosody. I then rank the positive and negative noun collocates using the logdice scores in order to find whether there is a statistically significant difference between them. In the case of negative nouns, an ANOVA test found a statistically significant difference. Pair-wise comparisons suggest that Islamic is more negative than either Christian or Jewish while there is no statistically significant difference between Jewish and Christian. On the positive side, there is no statistically significant difference between the adjectives. Intra-adjectival comparisons suggest that there is no statistically significant difference between Islamic’s positive and negative collocates while both Christian and Jewish are more posit...
While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts.... more
While morphological segmentation has always been a hot topic in Arabic, due to the morphological complexity of the language and the orthography, most effort has focused on Modern Standard Arabic. In this paper, we focus on pre-MSA texts. We use the Gradient Boosting algorithm to train a morphological segmenter with a corpus derived from Al-Manar, a late 19th/early 20th century magazine that focused on the Arabic and Islamic heritage. Since most of the cultural heritage Arabic available suffers from substandard orthography, we have trained a machine learner to standardize the text. Our segmentation accuracy reaches 98.47%, and the orthography standardization an F-macro of 0.98 and an F-micro of 0.99. We also produce stemming as a by-product of segmentation.
In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married... more
In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married and unmarried individuals, parents and non-parents, and people of different age groups in terms of their causes of happiness and how they express happiness? Can gender, marital status, parenthood status and/or age be predicted from textual data expressing happiness? The first question is tackled in two steps: first, we transform the happy moments into a set of topics, lemmas, part of speech sequences, and dependency relations; then, we use each set as predictors in multi-variable binary and multinomial logistic regressions to rank these predictors in terms of their influence on each outcome variable (gender, marital status, parenthood status and age). For the prediction task, we use character, lexical, grammatical, semantic, and syntactic features ...
In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in which a verb governs either an object NP and an Adjunct Phrase (PP or AdvP) or a subject... more
In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in which a verb governs either an object NP and an Adjunct Phrase (PP or AdvP) or a subject NP and an Adjunct Phrase. I have used binary logistic regression where the criterion variable is whether the subject/object NP shifts, and used as predictor variables heaviness (the number of tokens per NP, adjunct), part of speech tag, verb disposition (ie. whether the verb has a history of taking double objects or sentential objects), NP number, NP definiteness, and the presence of referring pronouns in either the NP or the adjunct. The results show that only object heaviness and adjunct heaviness are useful predictors of object HNPS, while subject heaviness, adjunct heaviness, subject part of speech tag, definiteness, and adjunct head POS tags are active predictors of subject HNPS. I also show that HNPS can in principle be predicted from sentence ...
Discourse markers are lexical items that play the role of conveying the speaker’s attitude towards the topic of conversation. Although discourse markers have this function, they have little semantic content, yet their importance for... more
Discourse markers are lexical items that play the role of conveying the speaker’s attitude towards the topic of conversation. Although discourse markers have this function, they have little semantic content, yet their importance for understanding (oral) discourse can hardly be overestimated. As such, they have been widely studied in English. While the Qurʾān has a number of these discourse markers, none of them seem to have been properly noticed, let alone studied, by Arabic linguists and Qurʾān commentators. This article introduces what I believe to be the most frequent of these in the Qurʾān: araʾaytum (literally: “have you seen?”) in its various morphological manifestations. This article uses concepts from historical linguistics, pragmatics, and corpus linguistics – and in particular lexical co-occurrences – to examine the development of this form from a sense verb that simply means “to see” to a pragmatic attitudinal marker that is semantically vacuous and whose main function is...
We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text,... more
We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24% to 86.84% on unseen CEA text, and reduces the percentage of out-of vocabulary words from 28.98% to 16.66%. The process holds promise for any NLP task targeting the dialectal varieties of Arabic; e.g., this approach may provide a cheap way to leverage MSA data and morphological resources to create resources for colloquial Arabic to English machine translation. It can also considerably speed up the annotation of Arabic dialects.
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any... more
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any word segmentation, the second approach is segmention-based, using a machine learning segmenter. Surprisingly, word-based POS tagging yields the best results, with a word accuracy of 94.74%. 1
We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained... more
We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained segmenter and POS tagger outperform the Arabic Treebak-trained ones although the latter is 21 times as big, which shows the need for building religious Arabic linguistic resources. The small corpus we annotate improves segmentation accuracy by 5% absolute (from 90.84% to 95.70%), and POS tagging by 9% absolute (from 82.22% to 91.26) when using gold standard segmentation, and by 9.6% absolute (from 78.62% to 88.22) when using automatic segmentation.
Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a... more
Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a corpus of 17000 films along with their age ratings. We use the textual contents in an experiment to predict the correct age classification for the United States (G, PG, PG-13, R and NC-17) and the United Kingdom (U, PG, 12A, 15, 18 and R18). Our experiments indicate that gradient boosting machines beat FastText and various Deep Learning architectures. We reach an overall accuracy of 79.3% for the US ratings compared to a projected super human accuracy of 84%. For the UK ratings, we reach an overall accuracy of 65.3% (UK) compared to a projected super human accuracy of 80.0%.
W.3" N2-N2&4%43"+-2N'-:-R&M5.3" .%" 3@0%$X&M5.3" )." :$" :$0R5." $2$7." 3-0%" 8-+N:.X.3A" 8." M5&" 2.0)" 8.%%." :$0R5." )&PP&8&:."... more
W.3" N2-N2&4%43"+-2N'-:-R&M5.3" .%" 3@0%$X&M5.3" )." :$" :$0R5." $2$7." 3-0%" 8-+N:.X.3A" 8." M5&" 2.0)" 8.%%." :$0R5." )&PP&8&:." 9"+$Y%2&3.2" )$03" :." )-+$&0." )5"LZWVU" Z533&A" :.3" N.2P-2+$08.3" )O50"3@3%[+.")."%2$)58%&-0"3%$%&3%&M5.")4N.0).0%"8-03&)42$7:.+.0%").":$"M5$0%&%4".%").":$" M5$:&%4").3"8-2N53")O$NN2.0%&33$R.U"T$03"8.%%."4%5).A"0-53"+-0%2-03"M5O50"N24%2$&%.+.0%" 7$34" 352" :.3" +-%3" )." :$" :$0R5." 3-528." K$2$7.S" .%" :O&0%2-)58%&-0" )." M5.:M5.3" 2[R:.3" :&0R5&3%&M5.3" N$2" 2$NN-2%" 9" :$" 3@0%$X." )." :$" :$0R5." 8&7:." KP2$0Q$&3SA" N.2+.%" )O-7%.0&2" ).3" $+4:&-2$%&-03&qu...
This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age,... more
This paper introduces a novel approach to the study of lexical and pragmatic meaning called ‘sociolexical profiling’, which aims at correlating the use of lexical items with author-attributed demographic features, such as gender, age, profession, and education. The approach was applied to a case study of a set of English idioms derived from the Pattern Dictionary of English Verbs (PDEV), a corpus-driven lexical resource which defines verb senses in terms of the phraseological patterns in which a verb typically occurs. For each selected idiom, a gender profile was generated based on data extracted from the Blog Authorship Corpus (BAC) in order to establish whether any statistically significant differences can be detected in the way men and women use idioms in every-day communication. A quantitative and qualitative analysis of the gender profiles was subsequently performed, enabling us to test the validity of the proposed approach. If performed on a large scale, we believe that sociol...
We present an annotation and morphological segmentation scheme for Egyptian Colloquial Arabic (ECA) in which we annotate user-generated content that significantly deviates from the orthographic and grammatical rules of Modern Standard... more
We present an annotation and morphological segmentation scheme for Egyptian Colloquial Arabic (ECA) in which we annotate user-generated content that significantly deviates from the orthographic and grammatical rules of Modern Standard Arabic and thus cannot be processed by the commonly used MSA tools. Using a per letter classification scheme in which each letter is classified as either a segment boundary or not, and using a memory-based classifier, with only word-internal context, prove effective and achieve a 92% exact match accuracy at the word level. The well-known MADA system achieves 81% while the per letter classification scheme using the ATB achieves 82%. Error analysis shows that the major problem is that of character ambiguity since the ECA orthography overloads the characters which would otherwise be more specific in MSA, like the differences between y (UŠ) and Y (U‰) and A (O§) , > ( O£), and < (O¥) which are collapsed to y (UŠ) and A (O§) respectively or even total...
One very common type of fake news is satire which comes in a form of a news website or an online platform that parodies reputable real news agencies to create a sarcastic version of reality. This type of fake news is often disseminated by... more
One very common type of fake news is satire which comes in a form of a news website or an online platform that parodies reputable real news agencies to create a sarcastic version of reality. This type of fake news is often disseminated by individuals on their online platforms as it has a much stronger effect in delivering criticism than through a straightforward message. However, when the satirical text is disseminated via social media without mention of its source, it can be mistaken for real news. This study conducts several exploratory analyses to identify the linguistic properties of Arabic fake news with satirical content. It shows that although it parodies real news, Arabic satirical news has distinguishing features on the lexico-grammatical level. We exploit these features to build a number of machine learning models capable of identifying satirical fake news with an accuracy of up to 98.6%. The study introduces a new dataset (3185 articles) scraped from two Arabic satirical ...
We follow the Muslim Brotherhood’s (MB) English website Islamweb.com in a qualitative-cum-quantitative analysis in search of the key issues that preoccupied the MB from 2009 to 2012. Our findings indicate that the bulk of the content on... more
We follow the Muslim Brotherhood’s (MB) English website Islamweb.com in a qualitative-cum-quantitative analysis in search of the key issues that preoccupied the MB from 2009 to 2012. Our findings indicate that the bulk of the content on the MB English website focuses on political participation, the relationship between Islamists and democracy, human rights violations in Egypt under Mubarak, and the violations of the police against MB university students in Egypt. Issues like the Palestinian question, women’s rights, and non-Muslim minorities rank very low on the MB’s English website agenda. We have also found that MB tries to avoid negative connotations, for example, by translating their top executive office as chairman, while the literal one is Guide, possibly to detach itself from the image of the Supreme Guide of the Islamic Revolution in Iran.
The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and... more
The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and (2) the Arabic word may contain several units like pronouns, conjunctions, articles and prepositions without an intervening white space. These two problems lead to difficulties in the automatic processing of Arabic. The thesis proposes a pre-processing scheme that applies word segmentation and word vocalization for the purpose of grammatical analysis: part of speech tagging and parsing. The thesis examines the impact of human-produced vocalization and segmentation on the grammatical analysis of Arabic, then applies a pipeline of automatic vocalization and segmentation for the purpose of Arabic part of speech tagging. The pipeline is then used, along with the POS tags produced, for the purpose of dependency parsing, which produces grammatical relatio...
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any... more
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any word segmentation, the second approach is segmention-based, using a machine learning segmenter. Surprisingly, word-based POS tagging yields the best results, with a word accuracy of 94.74%.
The primary purpose of this paper is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi'i Muslims are proposing different theories. Given the morphologically complex nature of... more
The primary purpose of this paper is author verification of the Nahj Al-Balagha, a book attributed to Imam Ali and over which Sunni and Shi'i Muslims are proposing different theories. Given the morphologically complex nature of Arabic, we test whether morphological segmentation, applied to the book and works by the two authors suspected by Sunnis to have authored the texts,  can be used for author verification of the Nahj Al-Balagha. Our findings indicate that morphological segmentation may lead to slightly better results than whole words, and that regardless of the feature sets, the three sub-corpora cluster into three distinct groups using Principal Component Analysis, Hierarchical Clustering, Multi-dimensional Scaling and Bootstrap Consensus Trees. Supervised classification methods such as Naive Bayes, Support Vector Machines, $k$ Nearest Neighbours, Random Forests, AdaBoost, Bagging and Decision Trees confirm the same results, which is a clear indication that (a) the book is internally consistent and can thus be attributed to a single person, and (b) it was not authored by either of the suspected authors.
There has not been any research that provides an evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith... more
There has not been any research that provides an  evaluation of the linguistic features extracted from the matn (text) of a Hadith. Moreover, none of the fairly large corpora are publicly available as a benchmark corpus for Hadith authenticity, and there is a need to build a “gold standard” corpus for good practices in Hadith authentication. We write a scraper in Python programming language and collect a corpus of 3651 authentic prophetic traditions and 3593 fake ones.  We process the corpora with morphological segmentation and perform extensive experimental studies using a variety of machine learning algorithms, mainly through Automatic Machine Learning, to distinguish between these two categories. With a feature set including words, morphological segments, characters, top $N$ words, top $N$ segments, function words and several vocabulary richness features, we analyse the results in terms of both prediction and interpretability to explain which features are more characteristic of each class. Many experiments have produced good results and the highest accuracy (i.e., 78.28\%) is achieved using word n-grams as features using the Multinomial Naive Bayes classifier. Our extensive experimental studies conclude that, at least for  Digital Humanities, feature engineering may still be desirable due to the high interpretability of the features. The corpus and software (scripts) will be made publicly available to other researchers in an effort to promote progress and replicability.
We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from... more
We combine textual information from a corpus of film scripts and the images of important scenes from IMDB that correspond to these films to create a bimodal dataset (the dataset and scripts can be obtained from https://tinyurl.com/se9tlmr) for film age appropriateness classification with the objective of improving the prediction of age appropriateness for parents and children. We use state-of-the art Deep Learning image feature extraction, including DENSENet, ResNet, Inception, and NASNet. We have tested several Machine learning algorithms and have found xgboost to yield the best results. Previously reported classification accuracy, using only textual features, were 79.1% and 65.3% for American MPAA and British BBFC classification respectively. Using images alone, we achieve 64.8% and 56.7% classification accuracy. The most consistent combination of textual features and images' features achieves 81.1% and 66.8%, both statistically significant improvements over the use of text only.
Research Interests:
Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a... more
Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a corpus of 17000 film transcripts along with their age ratings. We use the textual contents in an experiment to predict the correct age classification for the United States (G, PG, PG-13, R and NC-17) and the United Kingdom (U, PG, 12A, 15, 18 and R18). Our experiments indicate that gradient boosting machines beat FastText and various Deep Learning architectures. We reach an overall accuracy of 79.3% for the US ratings compared to a projected super human accuracy of 84%. For the UK ratings, we reach an overall accuracy of 65.3% (UK) compared to a projected super human accuracy of 80.0%.
Research Interests:
In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married... more
In this paper, we use a corpus of about 100,000 happy moments written by people of different genders, marital statuses, parenthood statuses, and ages to explore the following questions: Are there differences between men and women, married and unmarried individuals, parents and non-parents, and people of different age groups in terms of their causes of happiness and how they express happiness? Can gender, marital status, parenthood status and/or age be predicted from textual data expressing happiness? The first question is tackled in two steps: first, we transform the happy moments into a set of topics, lemmas, part of speech sequences, and dependency relations; then, we use each set as predictors in multi-variable binary and multinomial logistic regressions to rank these predictors in terms of their influence on each outcome variable (gender, marital status, parenthood status and age). For the prediction task, we use character, lexical, grammatical, semantic, and syntactic features in a machine learning document classification approach. The classification algorithms used include logistic regression, gradient boosting, and fastText. Our results show that textual data expressing moments of happiness can be quite beneficial in understanding the “causes of happiness” for different social groups, and that social characteristics like gender, marital status, parenthood status, and, to some extent age, can be successfully predicted form such textual data. This research aims to bring together elements from philosophy and psychology to be examined by computational corpus linguistics methods in a way that promotes the use of Natural Language Processing for the Humanities. View Full-Text
We annotate 60,000 words of classical Arabic with topics in philosophy, religion, literature, and law with fine-grained segment-based morphological descriptions. We use these annotations for building a morphological segmenter and... more
We annotate 60,000 words of classical Arabic with topics in philosophy, religion, literature, and law with fine-grained segment-based morphological descriptions. We use these annotations for building a morphological segmenter and part-of-speech tagger for Classical Arabic. With character-level classification and features from the word and its lexical context, the segmenter achieves a word accuracy of 96.8% with the main issue being a high rate of out-of-vocabulary words. A token-based part-of-speech tagger achieves an accuracy of 96.22% (97.72% on known tokens) in spite of the small size of the corpus. An error analysis shows that most of the tagging errors are results of segmentation and that the quality improves with more data being added. The morphological segmenter/tagger has a wide range of potential applications in processing the Classical Arabic, a low-resource variety of the language.
Research Interests:
Discourse markers are lexical items that play the role of conveying the speaker’s attitude towards the topic of conversation. Although discourse markers have this function, they have little semantic content, yet their importance for... more
Discourse markers are lexical items that play the role of conveying the speaker’s attitude towards the topic of conversation. Although discourse markers have this function, they have little semantic content, yet their importance for understanding (oral) discourse can hardly be overestimated. As such, they have been widely studied in English. While the Qurʾān has a number of these discourse markers, none of them seem to have been properly noticed, let alone studied, by Arabic linguists and Qurʾān commentators. This article introduces what I believe to be the most frequent of these in the Qurʾān: araʾaytum (literally: “have you seen?”) in its various morphological manifestations. This article uses concepts from historical linguistics, pragmatics, and corpus linguistics – and in particular lexical cooccurrences – to examine the development of this form from a sense verb that simply means “to see” to a pragmatic attitudinal marker that is semantically vacuous and whose main function is to express the speaker’s dissatisfaction with, resentment at, or disapproval of the topic of conversation. While the analysis provided in this article is mainly linguistic, the findings will affect the way we read the Arabic-Islamic heritage, especially as regards the authenticity of what are known as the Satanic Verses, also known as the episode of the High-Flying Cranes (Qiṣṣat al-ġarānīq). This article also provides suggestions for the translation of this discourse marker.

http://booksandjournals.brillonline.com/content/journals/10.1163/22321969-12340050
Research Interests:
In order to study how Judaism, Christianity, and Islam are represented in Wikipedia, I use corpus linguistics tools to extract the adjective noun collocates of the adjectives Jewish, Christian, and Islamic from the 2013 English Wikipedia... more
In order to study how Judaism, Christianity, and Islam are represented in Wikipedia, I use corpus linguistics tools to extract the adjective noun collocates of the adjectives Jewish, Christian, and Islamic from the 2013 English Wikipedia in order find out their semantic prosody.  I then rank the positive and negative noun collocates using the logdice scores in order to find whether there is a statistically significant difference between them. In the case of negative nouns, an ANOVA test found a statistically significant difference. Pair-wise comparisons suggest that Islamic is more negative than either Christian or Jewish while  there is no statistically significant difference between Jewish and Christian. On the positive side, there is no statistically significant difference between the adjectives. Intra-adjectival comparisons suggest that there is no statistically significant difference between Islamic's positive and negative collocates while both Christian and  Jewish are more positive than negative.
Research Interests:
This paper discusses the shift in authority in the digital age. While traditional institutions used to dominate the scene, with the advent of the Internet, new entities have stepped in and occupied their spots. More and more Muslims... more
This paper discusses the shift in authority in the digital age. While traditional institutions used to dominate the scene, with the advent of the Internet, new entities have stepped in and occupied their spots. More and more Muslims consult "independent" fatwa websites, but this process has led not to the establishment of new authorities, but the reinstatement of minority opinions that were long thought forgotten.  The theological, legal, and political consequences are unprecedented.
THIS PAPER WILL BE PUBLISHED IN THE SPRING OF 2017.
Research Interests:
Using digital humanities tools and methods, we extract, classify, and analyze 1006 Jihad fatwas from a corpus of 164000 online fatwas. We use the questions and page hits to rank clusters of fatwas in order to discover what Jihad... more
Using digital humanities tools and methods, we extract, classify, and analyze 1006 Jihad fatwas from a corpus of 164000 online fatwas. We use the questions and  page hits to rank clusters of fatwas in order to discover what Jihad questions Muslims ask, what Jihad issues interest them the most, and what the targets of Jihad may be.  We focus more on the questions than the answers since it is questions that can give us a window into what may be called the Muslim collective mind. The results show that Jihad questions are interwoven with several topics from performing prayers to expiation for homosexuality. While Prophet Muhammad's military expeditions were the most asked about and most viewed category, since they are seen as a model of what Jihad is, the second most important category was concubinage, and most questions pre-date ISIS and Boko Haram. When there was a target, Jews were found in 73% of the questions.
Research Interests:
This study examines how post-Arab Spring Muslim Brotherhood attempts to use English digital media to exercise influence on the Western approach toward the post-Arab Spring political scene. It investigates how news is framed, the... more
This study examines how post-Arab Spring Muslim Brotherhood attempts to use English digital media to exercise influence on the Western approach toward the post-Arab Spring political scene. It investigates how news is framed, the implications of the coverage relating to the message they try to convey to the audience and their political standpoints on the US-Western interests in the region and their signification to covering Islam in general.  ...
Research Interests:
Object and Subject Heavy-NP shift in Arabic Emad Mohamed emohamed@umail.iu.edu Abstract In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in... more
Object and Subject Heavy-NP shift in Arabic
Emad Mohamed
emohamed@umail.iu.edu

Abstract
In order to examine whether Arabic has Heavy Noun Phrase Shifting (HNPS), I have extracted from the Prague Arabic Dependency Treebank a data set in which a verb governs both an object NP and a an Adjunct Phrase (PP or AdvP) or in which a verb governs a subject NP and and Adjunct Phrase. I have used binary logistic regression where the criterion variable is whether the subject/object NP  shifts, and used as predictor variables heaviness (the number of tokens per NP, Adjunct), part of speech tag, verb disposition  (i.e. whether the verb has a history of taking double objects or sentential objects), NP number, NP definiteness, and the presence of referring pronouns in either the NP or the Adjunct. The results show that only object heaviness and Adjunct heaviness are useful predictors of object HNPS, while subject heaviness, Adjunct heaviness, subject part of speech tag, definiteness, and Adjunct head POS tag are active predictors of subject HNPS. We have also shown the HNPS can in principle be predicted from the sentence structure.
Key Words: Heavy NP Shift, Arabic, Corpus-based syntax, logistic regression
Research Interests:
... SANDRA K ¨UBLER1 and EMAD MOHAMED2 1Department of Linguistics Indiana University Bloomington, IN 47405, USA e-mail: skuebler@indiana.edu 2Computer Science Program Carnegie Mellon ... Diab (2009) described AMIRA, the second generation... more
... SANDRA K ¨UBLER1 and EMAD MOHAMED2 1Department of Linguistics Indiana University Bloomington, IN 47405, USA e-mail: skuebler@indiana.edu 2Computer Science Program Carnegie Mellon ... Diab (2009) described AMIRA, the second generation of this toolset. ...
Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, we compare two novel methods for POS tagging of Arabic without the use of gold standard word segmentation but with the full... more
Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, we compare two novel methods for POS tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic ...
We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained... more
We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained segmenter and POS tagger out-perform the Arabic Treebak-trained ones although the latter is 21 times as big , which    shows the need for building religious Arabic linguistic resources. The small corpus we annotate improves segmentation accuracy by 5%    absolute (from 90.84% to 95.70%), and POS    tagging by 9% absolute (from 82.22% to    91.26) when using gold standard segmentation, and by 9.6% absolute (from 78.62% to    88.22) when using automatic segmentation.
Research Interests:
We present a method for generating Colloquial Egyptian Arabic (CEA) from morphologically disambiguated Modern Standard Arabic (MSA). When used in POS tagging, this process improves the accuracy from 73.24%... more
We  present  a  method  for  generating  Colloquial Egyptian Arabic (CEA) from  morphologically disambiguated  Modern  Standard  Arabic  (MSA).    When  used in POS  tagging,  this  process  improves    the  accuracy  from  73.24%    to    86.84%    on    unseen    CEA  text,  and  reduces  the  percentage    of    out-of-vocabulary  words  from  28.98%    to    16.66%.  The    process  holds  promise  for  any  NLP  task  targeting    the dialectal varieties of Arabic; e.g., this approach    may  provide  a  cheap  way  to  leverage  MSA  data    and  morphological  resources  to  create  resources    for  colloquial  Arabic  to  English  machine  translation.  It can also  considerably speed up the  annotation of Arabic dialects.
Research Interests:
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags... more
In this paper, we compare two novel methods for part of speech tagging of Arabic without the  use  of  gold  standard  word  segmentation but with the full POS tagset of the Penn Arabic  Treebank.  The  first  approach uses  complex tags without any word segmentation, the second  approach  is  segmention-based,  using a  machine  learning segmenter.  Surprisingly, word-based  POS  tagging  yields  the  best  results, with a word accuracy of 94.74
Research Interests:
We use an automatic pipeline of word tokenization, stemming, POS tagging, and vocalization to perform real-world Arabic dependency parsing. In spite of the high accuracy on the... more
We    use    an    automatic    pipeline    of    word tokenization,    stemming,    POS    tagging,    and vocalization    to    perform    real-world    Arabic dependency    parsing.    In    spite    of  the    high accuracy  on  the  modules,  the  very  few  errors  in
tokenization,  which  reaches  an  accuracy  of 99.34%,  lead  to  a  drop  of  more  than  10%  in parsing,    indicating    that    no    high    quality dependency  parsing  of  Arabic,  and  possibly other  morphologically  rich  languages,  can  be
reached without (semi-)perfect tokenization. The other      module      components,      stemming, vocalization,  and  part  of  speech  tagging,  do  not have    the    same    profound    effect    on    the dependency parsing process
Research Interests: