We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we... more
We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we present to the supervised learning algorithms. There are many types of descriptors used in the literature. The most popular one is the n-gram. It corresponds to a series of characters of n-length. The standard approach of the n-grams consists in setting first the parameter n, extracting the corresponding ngrams descriptors, and in working with this value during the whole data mining process. In this paper, we propose an hierarchical approach to the n-grams construction. The goal is to obtain descriptors of varying length for a better characterization of the protein families. This approach tries to answer to the domain knowledge of the biologists. The patterns, which characterize the proteins’ family, have most of the time a various length. Our idea is to transpose the frequent itemsets extraction principle, mainly used for the association rule mining, in the n-grams extraction for protein classification context. The experimentation shows that the new approach is consistent with the biological reality and has the same accuracy of the standard approach.
This study investigates Bayes classification of online Arabic characters using histograms of tangent differences and Gibbs modeling of the class-conditional probability density functions. The parameters of these Gibbs density functions... more
This study investigates Bayes classification of online Arabic characters using histograms of tangent differences and Gibbs modeling of the class-conditional probability density functions. The parameters of these Gibbs density functions are estimated following the Zhu et al. constrained maximum entropy formalism, originally introduced for image and shape synthesis. We investigate two partition function estimation methods: one uses the training sample, and the other draws from a reference distribution. The efficiency of the corresponding Bayes decision methods, and of a combination of these, is shown in experiments using a database of 9,504 freely written samples by 22 writers. Comparisons to the nearest neighbor rule method and a Kohonen neural network method are provided.
In many application domains, classification tasks have to tackle multiclass imbalanced training sets. We have been looking for a CBA approach (Classification Based on Association rules) in such difficult contexts. Actually, most of the... more
In many application domains, classification tasks have to tackle multiclass imbalanced training sets. We have been looking for a CBA approach (Classification Based on Association rules) in such difficult contexts. Actually, most of the CBA-like methods are one-vs-all approaches (OVA), i.e., selected rules characterize a class with what is relevant for this class and irrelevant for the union of the other classes. Instead, our method considers that a rule has to be relevant for one class and irrelevant for every other class taken separately. Furthermore, a constrained hill climbing strategy spares users tuning parameters and/or spending time in tedious post-processing phases. Our approach is empirically validated on various benchmark data sets.
A pattern discovered from a collection of data is usually considered potentially interesting if its information content can assist the user in their decision making process. To that end, we have defined the potential interestingness of a... more
A pattern discovered from a collection of data is usually considered potentially interesting if its information content can assist the user in their decision making process. To that end, we have defined the potential interestingness of a pattern based on whether it provides statistical knowledge that is able to affect one's belief system. In previous work, we proposed two novel
Large repositories of data contain sensitive information that must be protected against unauthorized access. The protection of the confidentiality of this information has been a long-term goal for the database security research community... more
Large repositories of data contain sensitive information that must be protected against unauthorized access. The protection of the confidentiality of this information has been a long-term goal for the database security research community and for the government statistical agencies. Recent advances in data mining and machine learning algorithms have increased the disclosure risks that one may encounter when releasing data to outside parties. A key problem, and still not sufficiently investigated, is the need to balance the ...
ABSTRACT Graph-based Association Rules Mining (ARM) is a research area that represents a transactional database into a graph structure to optimize the search for frequent item sets. Sub-graph search is the process of pruning the search by... more
ABSTRACT Graph-based Association Rules Mining (ARM) is a research area that represents a transactional database into a graph structure to optimize the search for frequent item sets. Sub-graph search is the process of pruning the search by looking for the best representation of connected nodes in a graph to represent the fully connected graphs. Triangle Counting Approach is one of the sub-graph search approaches to find the most represented graph. This study aims to employ the Triangle Counting Approach for graph-based association rules mining. A triangle counting method for graph-based ARM is proposed to prune the graph in the search for frequent item sets. The triangle counting is integrated with one of the graph-based ARM methods. It consists of four important phases; data representation, triangle construction, bit vector representation, and triangle integration with the graph-based ARM method. The performance of the proposed method is compared with the original graph-based ARM. Experimental results show that the proposed method reduces the execution time of rules generation and produces less number of rules with higher confidence.
The detection of diabetes mellitus with elevated risk at early stage is critical in global clinical management. It aims to apply association rule mining to electronic medical records (EMR) to detect sets of risk factors and their... more
The detection of diabetes mellitus with elevated risk at early stage is critical in global clinical management. It aims to apply association rule mining to electronic medical records (EMR) to detect sets of risk factors and their corresponding subpopulations of patients. Association rule mining accomplishes a very large set of rules for summarizing the risk of diabetes in EMR with high dimensionality. To review the association rule set summarization techniques and conducted comparative evaluation to provide the best optimal summary based on their merits and demerits. In this paper, discuss about various methods to summarize the high risk of diabetes with accuracy.
This article sets forth a detailed theoretical proposal of how the truth of ordinary empirical statements, often atomic in form, is computed. The method of computation draws on psychological concepts such as those of associative networks... more
This article sets forth a detailed theoretical proposal of how the truth of ordinary empirical statements, often atomic in form, is computed. The method of computation draws on psychological concepts such as those of associative networks and spreading activation, rather that the concepts of philosophical or logical theories of truth. Axioms for a restricted class of cases are given, as well as some detailed examples. http://www.sciencedirect.com/science/article/pii/S1570868304000461
Nowadays, one of the most important usages of machine learning is diagnosis of diverse diseases. In this work, we introduces a diagnosis model based on Catfish binary particle swarm optimization (CatfishBPSO), kernelized support vector... more
Nowadays, one of the most important usages of machine learning is diagnosis of diverse diseases. In this work, we introduces a diagnosis model based on Catfish binary particle swarm optimization (CatfishBPSO), kernelized support vector machines (KSVM) and association rules (AR) as our feature selection method to diagnose erythemato-squamous diseases. The proposed model consisted of two stages. In the first stage, AR is used to select the optimal feature subset from the original feature set. Next, based on the fact that kernel parameter setting in the SVM training procedure significantly influences the classification accuracy and CatfishBPSO is a promising tool for global searching, a CatfishBPSO based approach is employed for parameter determination of KSVM. Experimental results show that the proposed AR-CatfishBPSO-KSVM model achieves 99.09% classification accuracy using 24 features of the erythemato-squamous disease dataset which shows that our proposed method is more accurate compared to other popular methods in this literature like Support vector machines and AR-MLP (association rules - multilayer perceptron). It should be mentioned that we took our dataset from University of California Irvine machine learning database.
Data mining involves the use of advanced data analysis tools to find out new, suitable patterns and project the relationship among the patterns which were not known prior. In data mining, association rule learning is a trendy and familiar... more
Data mining involves the use of advanced data analysis tools to find out new, suitable patterns and project the relationship among the patterns which were not known prior. In data mining, association rule learning is a trendy and familiar method for ascertaining new relations between variables in large databases. One of the emerging research areas under Data mining is Social Networks. The objective of this paper focuses on the formulation of association rules using which decisions can be made for future Endeavour. This research applies Apriori Algorithm which is one of the classical algorithms for deriving association rules. The Algorithm is applied to Face book 100 university dataset which has originated from Adam D’Angelo of Face book. It contains self-defined characteristics of a person including variables like residence, year, and major, second major, gender, school. This paper to begin with the research uses only ten Universities and highlights the formation of association rules between the attributes or variables and explores the association rule between a course and gender, and discovers the influence of gender in studying a course. This paper attempts to cover the main algorithms used for clustering, with a brief and simple description of each.The previous research with this dataset has applied only regression models and this is the first time to apply association rules.
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to reveal biological information about genes functions and their relation to health. Data mining techniques are effective and efficient in... more
Since the rapid advance of microarray technology, gene expression data are gaining recent interest to reveal biological information about genes functions and their relation to health. Data mining techniques are effective and efficient in extracting useful patterns. Most of the current data mining algorithms suffer from high processing time while generating frequent itemsets. The aim of this paper is to provide a comparative study of two Closed Frequent Itemsets algorithms (CFI), dCHARM and RISS. They are examined with high dimension data specifically gene expression data. Nine experiments are conducted with different number of genes to examine the performance of both algorithms. It is found that RISS outperforms dCHARM in terms of processing time..
Data mining also known as knowledge discovery in databases has been recognized as a promising new area for database research. The proposed work in this paper is about optimizing the data with clustering and fuzzy association rules using... more
Data mining also known as knowledge discovery in databases has been recognized as a promising new area for database research. The proposed work in this paper is about optimizing the data with clustering and fuzzy association rules using multi-objective genetic algorithms. This algorithm is implemented in two phases. In the first phase it optimizes the data to reduce the number of comparisons using clustering. In the second phase it is implemented with multi-objective genetic algorithms to find the optimum number of fuzzy association rules using threshold value and fitness function.
Data mining also known as knowledge discovery in databases has been recognized as a promising new area for database research. The proposed work in this paper is about optimizing the data with clustering and fuzzy association rules using... more
Data mining also known as knowledge discovery in databases has been recognized as a promising new area for database research. The proposed work in this paper is about optimizing the data with clustering and fuzzy association rules using multi-objective genetic algorithms. This algorithm is implemented in two phases. In the first phase it optimizes the data to reduce the number of comparisons using clustering. In the second phase it is implemented with multi-objective genetic algorithms to find the optimum number of fuzzy association rules using threshold value and fitness function.
This paper presents a novel fuzzy-based intelligent architecture that aims to find relevant and important associations between embedded-agent based services that form Ambient Intelligent Environments (AIEs). The embedded agents are used... more
This paper presents a novel fuzzy-based intelligent architecture that aims to find relevant and important associations between embedded-agent based services that form Ambient Intelligent Environments (AIEs). The embedded agents are used in two ways; first they monitor the inhabitants of the AIE, learning their behaviours in an online, non-intrusive and life-long fashion with the aim of pre-emptively setting the environment to the users preferred state. Secondly, they evaluate the relevance and significance of the associations to various services with the aim of eliminating redundant associations in order to minimize the agent computational latency within the AIE. The embedded agents employ fuzzy-logic due to its robustness to the uncertainties, noise and imprecision encountered in AIEs. We describe unique real world experiments that were conducted in the Essex intelligent Dormitory (iDorm) to evaluate and validate the significance of the proposed architecture and methods.
The current DSS tools are generally built as “desktop applications” and designed for the use of data mining experts. In this paper, design and implementation of ASMINER, a new web-based data mining exploration and reporting tool, is... more
The current DSS tools are generally built as “desktop applications” and designed for the use of data mining experts. In this
paper, design and implementation of ASMINER, a new web-based data mining exploration and reporting tool, is
introduced. ASMINER enables both decision makers and also knowledge workers, exploring and reporting with three data
mining techniques (decision trees, clustering and association rules mining), by presenting a scalable, user-friendly and fully web-based thin client data mining tool. The approach and the tools of ASMINER have a significant potential at giving an opportunity to knowledge workers to participate in decision making phases.