ABSTRACT Durant ces dernières années, l’utilisation de graphes a fait l’objet de nombreux travaux... more ABSTRACT Durant ces dernières années, l’utilisation de graphes a fait l’objet de nombreux travaux, notamment en bases de données, apprentissage automatique, bioinformatique et en analyse des réseaux sociaux. La fouille de sous-graphes fréquents constitue un défi majeur dans le contexte de très grandes bases de graphes. Dans ce papier, nous présentons une nouvelle approche basée sur le paradigme MapReduce pour la fouille de sous-graphes fréquents à grande échelle. L’approche proposée offre une nouvelle technique de partitionnement qui tient compte des caractéristiques des données et qui améliore le partitionnement par défaut de MapReduce. L’étude des performances de notre approche réalisée en utilisant un nuage privé a montré son efficacité.
Recently, graph mining approaches have become very popular, especially in domains such as bioinfo... more Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches for frequent subgraph discovery in large clusters, has taken place. However, failures are a norm rather than being an exception in large clusters. In this context, the MapReduce framework was designed so that node failures are automatically handled by the framework. In this paper, we propose a large-scale and fault-tolerant approach of subgraph mining by means of a density-based partitioning technique, using MapReduce. Our partitioning aims to balance computation load on a collection of machines. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases.
Journal of computational biology : a journal of computational molecular cell biology, Jan 20, 2015
Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in ... more Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in silico methods of phenotypic prediction and genotype-phenotype relationship discovery are limited. In this work, we analyzed basal DNA repair proteins of most known proteome sequences of IRRB and ionizing-radiation-sensitive bacteria (IRSB) in order to learn a classifier that correctly predicts this bacterial phenotype. We formulated the problem of predicting bacterial ionizing radiation resistance (IRR) as a multiple-instance learning (MIL) problem, and we proposed a novel approach for this purpose. We provide a MIL-based prediction system that classifies a bacterium to either IRRB or IRSB. The experimental results of the proposed system are satisfactory with 91.5% of successful predictions.
Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in ... more Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in silico methods of phenotypic prediction and genotype-phenotype relationship discovery are limited. In this work, we analyzed basal DNA repair proteins of most known proteome sequences of IRRB and ionizing-radiation-sensitive bacteria (IRSB) in order to learn a classifier that correctly predicts this bacterial phenotype. We formulated the problem of predicting bacterial ionizing radiation resistance (IRR) as a multiple-instance learning (MIL) problem, and we proposed a novel approach for this purpose. We provide a MIL-based prediction system that classifies a bacterium to either IRRB or IRSB. The experimental results of the proposed system are satisfactory with 91.5% of successful predictions.
Engineering Applications of Artificial Intelligence, 2015
ABSTRACT Abstract The cloud computing allows to use virtually infinite resources, and seems to be... more ABSTRACT Abstract The cloud computing allows to use virtually infinite resources, and seems to be a new promising opportunity to solve scientific computing problems. The MapReduce parallel programming model is a new framework favoring the design of algorithms for cloud computing. Such framework favors processing of problems across huge datasets using a large number of heterogeneous computers over the web. In this paper, we are interested in evaluating how the MapReduce framework can create an innovative way for solving operational research problems. We proposed a MapReduce-based approach for the shortest path problem in large-scale real-road networks. Such a problem is the cornerstone of any real-world routing problem including the dial-a-ride problem (DARP), the pickup and delivery problem (PDP) and its dynamic variants. Most of efficient methods dedicated to these routing problems have to use the shortest path algorithms to construct the distance matrix between each pair of nodes and it could be a time-consuming task on a large-scale network due to its size. We focus on the design of an efficient MapReduce-based approach since a classical shortest path algorithm is not suitable to accomplish efficiently such task. Our objective is not to guarantee the optimality but to provide high quality solutions in acceptable computational time. The proposed approach consists in partitioning the original graph into a set of subgraphs, then solving the shortest path on each subgraph in a parallel way to obtain a solution for the original graph. An iterative improvement procedure is introduced to improve the solution. It is benchmarked on a graph modeling French road networks extracted from OpenStreetMap. The results of the experiment show that such approach achieves significant gain of computational time.
The choice of architecture of artificial neuron network (ANN) is still a challenging task that us... more The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad'hoc approaches for multilayer ANN architecture. Our method can be applied to regression problems. Experimental results obtained with different datasets, reveals the efficiency of our method.
Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine - BCB '12, 2012
Feature extraction is an unavoidable task, especially in the critical step of preprocessing biolo... more Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properties are motifs extracted from sequences. This output can be used to apply standard machine learning tools to perform data mining tasks such as classification. Several previous works have described feature extraction methods for bio-sequence classification, but none of them discussed the robustness of these methods when perturbing the input data. In this work, we introduce the notion of stability of the generated motifs in order to study the robustness of motif extraction methods. We express this robustness in terms of the ability of the method to reveal any change occurring in the input data and also its ability to target the interesting motifs. We use these criteria to evaluate and experimentally compare four existing extraction methods for biological sequences.
Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. The use of these bac... more Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. The use of these bacteria for the treatment of radioactive wastes is determined by their surprising capacity of adaptation to radionuclides and a variety of toxic molecules. In silico methods are unavailable for the purpose of phenotypic prediction and genotype-phenotype relationship discovery. We analyze basal DNA repair proteins of most known proteomes sequences of IRRB and ionizingradiation-sensitive bacteria (IRSB) in order to learn a classifier that correctly predicts unseen bacteria. In this work, we formulate the problem of predicting IRRB as a multipleinstance learning (MIL) problem and we propose a novel approach for predicting IRRB. We use a local alignment technique to measure the similarity between protein sequences to predict ionizing-radiation-resistant bacteria. The first results are satisfactory and provide a MIL-based prediction system that predicts whether a bacterium belongs to IRRB or to IRSB.
ABSTRACT Durant ces dernières années, l’utilisation de graphes a fait l’objet de nombreux travaux... more ABSTRACT Durant ces dernières années, l’utilisation de graphes a fait l’objet de nombreux travaux, notamment en bases de données, apprentissage automatique, bioinformatique et en analyse des réseaux sociaux. La fouille de sous-graphes fréquents constitue un défi majeur dans le contexte de très grandes bases de graphes. Dans ce papier, nous présentons une nouvelle approche basée sur le paradigme MapReduce pour la fouille de sous-graphes fréquents à grande échelle. L’approche proposée offre une nouvelle technique de partitionnement qui tient compte des caractéristiques des données et qui améliore le partitionnement par défaut de MapReduce. L’étude des performances de notre approche réalisée en utilisant un nuage privé a montré son efficacité.
Recently, graph mining approaches have become very popular, especially in domains such as bioinfo... more Recently, graph mining approaches have become very popular, especially in domains such as bioinformatics, chemoinformatics and social networks. In this scope, one of the most challenging tasks is frequent subgraph discovery. This task has been motivated by the tremendously increasing size of existing graph databases. Since then, an important problem of designing efficient and scaling approaches for frequent subgraph discovery in large clusters, has taken place. However, failures are a norm rather than being an exception in large clusters. In this context, the MapReduce framework was designed so that node failures are automatically handled by the framework. In this paper, we propose a large-scale and fault-tolerant approach of subgraph mining by means of a density-based partitioning technique, using MapReduce. Our partitioning aims to balance computation load on a collection of machines. We experimentally show that our approach decreases significantly the execution time and scales the subgraph discovery process to large graph databases.
Journal of computational biology : a journal of computational molecular cell biology, Jan 20, 2015
Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in ... more Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in silico methods of phenotypic prediction and genotype-phenotype relationship discovery are limited. In this work, we analyzed basal DNA repair proteins of most known proteome sequences of IRRB and ionizing-radiation-sensitive bacteria (IRSB) in order to learn a classifier that correctly predicts this bacterial phenotype. We formulated the problem of predicting bacterial ionizing radiation resistance (IRR) as a multiple-instance learning (MIL) problem, and we proposed a novel approach for this purpose. We provide a MIL-based prediction system that classifies a bacterium to either IRRB or IRSB. The experimental results of the proposed system are satisfactory with 91.5% of successful predictions.
Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in ... more Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. In this context, in silico methods of phenotypic prediction and genotype-phenotype relationship discovery are limited. In this work, we analyzed basal DNA repair proteins of most known proteome sequences of IRRB and ionizing-radiation-sensitive bacteria (IRSB) in order to learn a classifier that correctly predicts this bacterial phenotype. We formulated the problem of predicting bacterial ionizing radiation resistance (IRR) as a multiple-instance learning (MIL) problem, and we proposed a novel approach for this purpose. We provide a MIL-based prediction system that classifies a bacterium to either IRRB or IRSB. The experimental results of the proposed system are satisfactory with 91.5% of successful predictions.
Engineering Applications of Artificial Intelligence, 2015
ABSTRACT Abstract The cloud computing allows to use virtually infinite resources, and seems to be... more ABSTRACT Abstract The cloud computing allows to use virtually infinite resources, and seems to be a new promising opportunity to solve scientific computing problems. The MapReduce parallel programming model is a new framework favoring the design of algorithms for cloud computing. Such framework favors processing of problems across huge datasets using a large number of heterogeneous computers over the web. In this paper, we are interested in evaluating how the MapReduce framework can create an innovative way for solving operational research problems. We proposed a MapReduce-based approach for the shortest path problem in large-scale real-road networks. Such a problem is the cornerstone of any real-world routing problem including the dial-a-ride problem (DARP), the pickup and delivery problem (PDP) and its dynamic variants. Most of efficient methods dedicated to these routing problems have to use the shortest path algorithms to construct the distance matrix between each pair of nodes and it could be a time-consuming task on a large-scale network due to its size. We focus on the design of an efficient MapReduce-based approach since a classical shortest path algorithm is not suitable to accomplish efficiently such task. Our objective is not to guarantee the optimality but to provide high quality solutions in acceptable computational time. The proposed approach consists in partitioning the original graph into a set of subgraphs, then solving the shortest path on each subgraph in a parallel way to obtain a solution for the original graph. An iterative improvement procedure is introduced to improve the solution. It is benchmarked on a graph modeling French road networks extracted from OpenStreetMap. The results of the experiment show that such approach achieves significant gain of computational time.
The choice of architecture of artificial neuron network (ANN) is still a challenging task that us... more The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad'hoc approaches for multilayer ANN architecture. Our method can be applied to regression problems. Experimental results obtained with different datasets, reveals the efficiency of our method.
Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine - BCB '12, 2012
Feature extraction is an unavoidable task, especially in the critical step of preprocessing biolo... more Feature extraction is an unavoidable task, especially in the critical step of preprocessing biological sequences. This step consists for example in transforming the biological sequences into vectors of motifs where each motif is a subsequence that can be seen as a property (or attribute) characterizing the sequence. Hence, we obtain an object-property table where objects are sequences and properties are motifs extracted from sequences. This output can be used to apply standard machine learning tools to perform data mining tasks such as classification. Several previous works have described feature extraction methods for bio-sequence classification, but none of them discussed the robustness of these methods when perturbing the input data. In this work, we introduce the notion of stability of the generated motifs in order to study the robustness of motif extraction methods. We express this robustness in terms of the ability of the method to reveal any change occurring in the input data and also its ability to target the interesting motifs. We use these criteria to evaluate and experimentally compare four existing extraction methods for biological sequences.
Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. The use of these bac... more Ionizing-radiation-resistant bacteria (IRRB) are important in biotechnology. The use of these bacteria for the treatment of radioactive wastes is determined by their surprising capacity of adaptation to radionuclides and a variety of toxic molecules. In silico methods are unavailable for the purpose of phenotypic prediction and genotype-phenotype relationship discovery. We analyze basal DNA repair proteins of most known proteomes sequences of IRRB and ionizingradiation-sensitive bacteria (IRSB) in order to learn a classifier that correctly predicts unseen bacteria. In this work, we formulate the problem of predicting IRRB as a multipleinstance learning (MIL) problem and we propose a novel approach for predicting IRRB. We use a local alignment technique to measure the similarity between protein sequences to predict ionizing-radiation-resistant bacteria. The first results are satisfactory and provide a MIL-based prediction system that predicts whether a bacterium belongs to IRRB or to IRSB.
Uploads
Papers by Sabeur Aridhi