Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Nazlia Omar

Sentiment lexicon generated from using Unsupervised Context-Aware Gloss Expansion
Vital to the task of mining sentiment from text is a sentiment lexicon, or a dictionary of terms annotated for their a priori information across the semantic dimension of sentiment. Each term has assigned a general, out-of-context... more
Vital to the task of mining sentiment from text is a sentiment lexicon, or a dictionary of terms annotated for their a priori information across the semantic dimension of sentiment. Each term has assigned a general, out-of-context sentiment polarity. Unfortunately, online dictionaries and similar lexical resources do not readily include information on the sentiment properties of their entries. Moreover, manually compiling sentiment lexicons is tedious in terms of annotator time and effort. This has resulted in the emergence of a large volume of research concentrated on automated sentiment lexicon generation algorithms. Most of these algorithms were designed for English, attributable to the abundance of readily available lexical resources in this language. This is not the case for low-resource languages such as the Malay language. Although there has been an exponential increase in research on Malay sentiment analysis over the past few years, the subtask of sentiment lexicon induction for this particular language remains under-investigated. We present a minimally-supervised sentiment lexicon induction model specifically designed for the Malay language. It takes as input only two initial paradigm positive and negative terms, and mines WordNet Bahasa’s synonym chains and Kamus Dewan’s gloss information to extract subjective, sentiment-laden terms. The model automatically bootstraps a reliable, high coverage sentiment lexicon that can be employed in Malay sentiment analysis on full-text. Intrinsic evaluation of the model against a manually annotated test set demonstrates that its ability to assign sentiment properties to terms is on par with human judgement.
Part-of-Speech (POS) tagging effectiveness is essential in the era of the 4th industrial revolution as high technology machines such as cars and smart homes can be controlled using human voice command. POS tagger is important in many... more
Part-of-Speech (POS) tagging effectiveness is essential in the era of the 4th industrial revolution as high technology machines such as cars and smart homes can be controlled using human voice command. POS tagger is important in many domains, including information retrieval. POS tags such as verb or noun, in turn, can be used as features for higher-level natural language processing (NLP) tasks such as Named Entity Recognition, Sentiment Analysis, and Question Answering chatbots. However, research on developing an effective part-of-speech (POS) tagger for the Malay language is still in its infancy. Many existing methods that have been tested in English have not been tested for the Malay language. This study presents an experiment to tag Malay words using the supervised machine learning (ML) approach. The purpose of this work is to investigate the performance of the supervised ML approaches in tagging Malay words and the effectiveness of the affixes-based feature patterns. The Naive Bayes and k-nearest neighbor models have been used to assign a specific tag for the words. A corpus obtained from Dewan Bahasa dan Pustaka (DBP) has been used in this experiment. DBP has defined 21 tagsets (categories) for the corpus. We have used two sizes of corpora for the tests, which have 20,000 tokens and 40,000 tokens. Moreover, affixes-based feature pattern engineering has been extracted from the corpora to improve the process of tagging.
With the evolution of user-based web content, people naturally and freely share their opinion in numerous domains. However, this would result in a massive cost to label training data for many domains and prevent us from taking advantage... more
With the evolution of user-based web content, people naturally and freely share their opinion in numerous domains. However, this would result in a massive cost to label training data for many domains and prevent us from taking advantage of the shared information across domains. As a result, cross-domain sentiment analysis is a challenging NLP task due to feature and polarity divergence. The main aim of this work is to automatically create a bidirectional thesaurus which could be used to transfer feature vectors of the source and target domains. This paper aims at designing an algorithm of feature transfer to select and transfer the informative and representative features between the source and target domains. Furthermore, several experiments were conducted in order to evaluate the proposed model, and the results were compared to similar known baseline methods.
In a multi-label classification problem, each document is associated with a subset of labels. The documents often consist of multiple features. In addition, each document is usually associated with several labels. Therefore, feature... more
In a multi-label classification problem, each document is associated with a subset of labels. The documents often consist of multiple features. In addition, each document is usually associated with several labels. Therefore, feature selection is an important task in machine learning, which attempts to remove irrelevant and redundant features that can hinder the performance. This paper suggests transforming the multi-label documents into single-label documents before using the standard feature selection algorithm. Under this process, the document is copied into labels to which it belongs by adopting assigning all features to each label it belongs. With this context, we conducted a comparative study on five feature selection methods. These methods are incorporated into the traditional Naive Bayes classifiers, which are adapted to deal with multi-label documents. Experiments conducted with benchmark datasets showed that the multi-label Naive Bayes classifier coupled with the GSS method delivered a better performance than the MLNB classifier using other methods.
Adverse drug reactions (ADR) are important information for verifying the view of the patient on a particular drug. Regular user comments and reviews have been considered during the data collection process to extract ADR mentions, when the... more
Adverse drug reactions (ADR) are important information for verifying the view of the patient on a particular drug. Regular user comments and reviews have been considered during the data collection process to extract ADR mentions, when the user reported a side effect after taking a specific medication. In the literature, most researchers focused on machine learning techniques to detect ADR. These methods train the classification model using annotated medical review data. Yet, there are still many challenging issues that face ADR extraction, especially the accuracy of detection. The main aim of this study is to propose LSA with ANN classifiers for ADR detection. The findings show the effectiveness of utilizing LSA with ANN in extracting ADR.
The number of online documents has rapidly grown, and with the expansion of the Web, document analysis, or text analysis, has become an essential task for preparing, storing, visualizing and mining documents. The texts generated daily on... more
The number of online documents has rapidly grown, and with the expansion of the Web, document analysis, or text analysis, has become an essential task for preparing, storing, visualizing and mining documents. The texts generated daily on social media platforms such as Twitter, Instagram and Facebook are vast and unstructured. Most of these generated texts come in the form of short text and need special analysis because short text suffers from lack of information and sparsity. Thus, this topic has attracted growing attention from researchers in the data storing and processing community for knowledge discovery. Short text clustering (STC) has become a critical task for automatically grouping various unlabelled texts into meaningful clusters. STC is a necessary step in many applications, including Twitter personalization, sentiment analysis, spam filtering, customer reviews and many other social network-related applications. In the last few years, the natural-language-processing resear...
With the evolution of user-based web content, people naturally and freely share their opinion in numerous domains. However, this would result in a massive cost to label training data for many domains and prevent us from taking advantage... more
With the evolution of user-based web content, people naturally and freely share their opinion in numerous domains. However, this would result in a massive cost to label training data for many domains and prevent us from taking advantage of the shared information across-domains. As a result, cross-domain sentiment analysis is a challenging NLP task due to feature and polarity divergence. To build a sentiment sensitive thesaurus that to group different features that express the same sentiments for cross-domain sentiment classification, different co-occurrence measures are used. This paper presents a comparative study covering different co-occurrence methods for building a cross-domain sentiment thesaurus. This work also defines a Bidirectional Conditional Probability (BCP) to handle the unsymmetrical co-occurrence problem. Two machine learning classifiers (Naïve Bayes (NB) and Support Vector Machine (SVM)) and three feature selection methods (Information gain, Odd ratio, Chi-square) are used to evaluate the proposed model. Experimental results show that BCP results outperform four baseline co-occurrence calculation methods (PMI, PMI-square, EMI, and G-means) in the task of cross-domain sentiment analysis.
Semantic measures are used in handling different issues in several research areas, such as artificial intelligence, natural language processing, knowledge engineering, bioinformatics, and information retrieval. Hierarchical feature-based... more
Semantic measures are used in handling different issues in several research areas, such as artificial intelligence, natural language processing, knowledge engineering, bioinformatics, and information retrieval. Hierarchical feature-based semantic measures have been proposed to estimate the semantic similarity between two concepts/words depending on the features extracted from a semantic taxonomy (hierarchy) of a given lexical source. The central issue in these measures is the constant weighting assumption that all elements in the semantic representation of the concept possess the same relevance. In this paper, a new weighting-based semantic similarity measure is proposed to address the issues in hierarchical feature-based measures. Four mechanisms are introduced to weigh the degree of relevance of features in the semantic representation of a concept by using topological parameters (edge, depth, descendants, and density) in a semantic taxonomy. With the semantic taxonomy of WordNet, ...
Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or... more
Wikipedia has become a high coverage knowledge source which has been used in many research areas such as natural language processing, text mining and information retrieval. Several methods have been introduced for extracting explicit or implicit relations from Wikipedia to represent semantics of concepts/words. However, the main challenge in semantic representation is how to incorporate different types of semantic relations to capture more semantic evidences of the associations of concepts. In this article, we propose a semantic concept model that incorporates different types of semantic features extracting from Wikipedia. For each concept that corresponds to an article, four semantic features are introduced: template links, categories, salient concepts and topics. The proposed model is based on the probability distributions that are defined for these semantic features of a Wikipedia concept. The template links and categories are the document-level features which are directly extrac...
The alignment of WordNet and Wikipedia has received wide attention from researchers of computational linguistics, who are building a new lexical knowledge source or enriching the semantic information of WordNet entities. The main... more
The alignment of WordNet and Wikipedia has received wide attention from researchers of computational linguistics, who are building a new lexical knowledge source or enriching the semantic information of WordNet entities. The main challenge of this alignment is how to handle the synonymy and ambiguity issues in the contents of two units from different sources. Therefore, this paper introduces mapping method that links an Arabic WordNet synset to its corresponding article in Wikipedia. This method uses monolingual and bilingual features to overcome the lack of semantic information in Arabic WordNet. For evaluating this method, an Arabic mapping data set, which contains 1,291 synset–article pairs, is compiled. The experimental analysis shows that the proposed method achieves promising results and outperforms the state-of-the-art methods that depend only on monolingual features. The mapped method has also been used to increase the coverage of Arabic WordNet by inserting new synsets from...
The process of eliminating irrelevant, redundant and noisy features while trying to maintain less information loss is known as a feature selection problem. Given the vast amount of the textual data generated and shared on the internet... more
The process of eliminating irrelevant, redundant and noisy features while trying to maintain less information loss is known as a feature selection problem. Given the vast amount of the textual data generated and shared on the internet such as news reports, articles, tweets and product reviews, the need for an effective text-feature selection method becomes increasingly important. Recently, stochastic optimization algorithms have been adopted to tackle this problem. However, the efficiency of these methods is decreased when tackling high-dimensional problems. This decrease could be attributed to premature convergence where the population diversity is not well maintained. As an innovative attempt, a cooperative Binary Bat Algorithm (BBACO) is proposed in this work to select the optimal text feature subset for classification purposes. The proposed BBACO uses a new mechanism to control the population’s diversity during the optimization process and to improve the performance of BBA-based...
Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes... more
Compared to other languages, there is still a limited body of research which has been conducted for the automated Arabic Text Categorization (TC) due to the complex and rich nature of the Arabic language. Most of such research includes supervised Machine Learning (ML) approaches such as Naive Bayes (NB), K-Nearest Neighbour (KNN), Support Vector Machine and Decision Tree. Most of these techniques have complex mathematical models and do not usually lead to accurate results for Arabic TC. Moreover, all the previous research tended to deal with the Feature Selection (FS) and the classification respectively as independent problems in automatic TC, which led to the cost and complex computational issues. Based on this, the need to apply new techniques suitable for Arabic language and its complex morphology arises. A new approach in the Arabic TC term called the Frequency Ratio Accumulation Method (FRAM), which has a simple mathematical model is applied in this study. The categorization ta...
It is practically impossible for pure machine translation approach to process all of translation problems; however, Rule Based Machine Translation and Statistical Machine translation (RBMT and SMT) use different architectures for... more
It is practically impossible for pure machine translation approach to process all of translation problems; however, Rule Based Machine Translation and Statistical Machine translation (RBMT and SMT) use different architectures for performing translation task. Lexical analyser and syntactic analyser are solved by Rule Based and some amount of ambiguity is left to be solved by Expectation–Maximization (EM) algorithm, which is an iterative statistic algorithm for finding maximum likelihood. In this paper we have proposed an integrated Hybrid Machine Translation (HMT) system. The goal is to combine the best properties of each approach. Initially, Arabic text is keyed into RBMT; then the output will be edited by EM algorithm to generate the final translation of English text. As we have seen in previous works, the performance and enhancement of EM algorithm, the key of EM algorithm performance is the ability to accurately transform a frequency from one language to another. Results showing ...
Sentiment analysis techniques are increasingly exploited to categorize the opinion text to one or more predefined sentiment classes for the creation and automated maintenance of review-aggregation websites. In this paper, a Malay... more
Sentiment analysis techniques are increasingly exploited to categorize the opinion text to one or more predefined sentiment classes for the creation and automated maintenance of review-aggregation websites. In this paper, a Malay sentiment analysis classification model is proposed to improve classification performances based on the semantic orientation and machine learning approaches. First, a total of 2,478 Malay sentiment-lexicon phrases and words are assigned with a synonym and stored with the help of more than one Malay native speaker, and the polarity is manually allotted with a score. In addition, the supervised machine learning approaches and lexicon knowledge method are combined for Malay sentiment classification with evaluating thirteen features. Finally, three individual classifiers and a combined classifier are used to evaluate the classification accuracy. In experimental results, a wide-range of comparative experiments is conducted on a Malay Reviews Corpus (MRC), and it...
... Abdul Kadir, R., Tengku Sembok, TM, Halimah, BZ: Improvement of document under-standing ability through the notion of answer literal expansion in logical-linguistic ap-proach. WSEAS Transactions on Information Science and Applications... more
... Abdul Kadir, R., Tengku Sembok, TM, Halimah, BZ: Improvement of document under-standing ability through the notion of answer literal expansion in logical-linguistic ap-proach. WSEAS Transactions on Information Science and Applications 6(6), 966–975 (2009) 15. Prince, V ...
ABSTRACT This paper presents a method for measuring the compositionality score of multiword expressions (MWEs). Based on Wikipedia (WP) as a lexicon resource, the multiword expressions are identified using the title of Wikipedia articles... more
ABSTRACT This paper presents a method for measuring the compositionality score of multiword expressions (MWEs). Based on Wikipedia (WP) as a lexicon resource, the multiword expressions are identified using the title of Wikipedia articles that are made up of more than one word without further process. Through the semantic representation, this method exploits the hierarchical taxonomy in Wikipedia to represent the concept (single word or multiword) as a feature vector containing the WP articles that belong to concept of categories and sub-categories. The literality and the multiplicative function composition scores are used for measuring the compositionality score of an MWE utilizing the semantic similarity. The proposed method is evaluated by comparing the compositionality score against human judgments (dataset) containing 100 Arabic noun-noun compounds.
... BZ Halimah1, A. Azlina1, TM Sembok1, I. Sufian1, Sharul Azman MN1, AB Azuraliza1, AO Zulaiha1, O. Nazlia1, A. Salwani1, A. Sanep1, MT Hailani1, MZ Zaher1, J. Azizah, MY Nor Faezah1, WO Choo1, Chew ... World Bank, New York (2005) [6]... more
... BZ Halimah1, A. Azlina1, TM Sembok1, I. Sufian1, Sharul Azman MN1, AB Azuraliza1, AO Zulaiha1, O. Nazlia1, A. Salwani1, A. Sanep1, MT Hailani1, MZ Zaher1, J. Azizah, MY Nor Faezah1, WO Choo1, Chew ... World Bank, New York (2005) [6] Uzair, M.: Interest Free Banking. ...
P. Wen et al. (Eds.): RSKT 2009, LNCS 5589, pp. 475–482, 2009. © Springer-Verlag Berlin Heidelberg 2009 ... Automated Grammar Checking of Tenses for ESL ... Nazlia Omar ∗ , Nur Asma Mohd. Razali, and Saadiyah Darus ... Faculty of... more
P. Wen et al. (Eds.): RSKT 2009, LNCS 5589, pp. 475–482, 2009. © Springer-Verlag Berlin Heidelberg 2009 ... Automated Grammar Checking of Tenses for ESL ... Nazlia Omar ∗ , Nur Asma Mohd. Razali, and Saadiyah Darus ... Faculty of Information Science and Technology, ...
ABSTRACT Although there is no machine learning technique that fully meets human requirements, finding a quick and efficient translation mechanism has become an urgent necessity, due to the differences between the languages spoken in the... more
ABSTRACT Although there is no machine learning technique that fully meets human requirements, finding a quick and efficient translation mechanism has become an urgent necessity, due to the differences between the languages spoken in the world's communities and the vast development that has occurred worldwide, as each technique demonstrates its own advantages and disadvantages. Thus, the purpose of this paper is to shed light on some of the techniques that employ machine translation available in literature, to encourage researchers to study these techniques. We discuss some of the linguistic characteristics of the Arabic language. Features of Arabic that are related to machine translation are discussed in detail, along with possible difficulties that they might present. This paper summarizes the major techniques used in machine translation from Arabic into English, and discusses their strengths and weaknesses
Abstract Pelajar yang memasuki pusat pengajian tinggi dijangkakan mempunyai kemahiran bahasa Inggeris yang baik untuk membolehkan mereka menyediakan tugasan kursus dan aktiviti akademik dalam bahasa Inggeris. Satu daripada tugasan ialah... more
Abstract Pelajar yang memasuki pusat pengajian tinggi dijangkakan mempunyai kemahiran bahasa Inggeris yang baik untuk membolehkan mereka menyediakan tugasan kursus dan aktiviti akademik dalam bahasa Inggeris. Satu daripada tugasan ialah menulis esei. ...
Research Interests:

And 97 more

"The rapid growth of computer technologies creates a plethora of ways in which technology can be integrated into one of the alternatives to facilitate essay marking. Automated essay marking systems developed from the late 1960s have... more
"The rapid growth of computer technologies creates a plethora of ways in which technology can be integrated into one of the alternatives to facilitate essay marking.  Automated essay marking systems developed from the late 1960s have attempted to prove that computers can evaluate essays as competently as human expert. Several computer-based essay marking (CBEM) systems have been developed to mark students’ essays and they can be divided into semi-automated and automated systems. This paper illustrates the development of an Automated Tool for Detecting Errors in Tenses (ATDEiT™). The first phase analysed the errors found in 400 essays written by 112 English as second language (ESL) learners at tertiary level using Markin 3.1 software. The results showed that the most common errors were found in tenses. This finding led to the second phase of the research, which was the design of an automated marking tool. Consequently, the techniques and algorithm for error analysis marking tool for ESL learners were developed. An initial testing was conducted to evaluate the results of the marking tool using 50 essays. Findings showed that ATDEiT™ achieved a high level (93.5%) of recall and an average level (78.8%) of precision. This proves that ATDEiT™ has the potential to be used as an automated tool for detecting errors in tenses for ESL learners.
"
Although computers and artificial intelligence have been proposed as tools to facilitate the evaluationof student essays, they are not specifically developed for Malaysian ESL (English as a secondlanguage) learners. A marking tool which... more
Although computers and artificial intelligence have been proposed as tools to facilitate the evaluationof student essays, they are not specifically developed for Malaysian ESL (English as a secondlanguage) learners. A marking tool which is specifically developed to analyze errors in ESL writing isvery much needed. Though there are numerous techniques adopted in automated essay marking,research on the formation and use of heuristics to aid the construction of computer-based essaymarking system has been scarce. Thus, this paper aims to introduce new heuristics that can be usedto mark essays automatically and detect grammatical errors in tenses. This approach, which usesnatural language processing technique, can be applied as part of the software requirement for aCBEM (Computer Based Essay Marking) system for ESL learners. The preliminary result based on thetraining set shows that the heuristics are useful and can improve the effectiveness of automated essaymarking tool for writing in ESL.