This paper proposes a novel approach for storing and retrieving massive DNA sequences. The method... more This paper proposes a novel approach for storing and retrieving massive DNA sequences. The method is based on a perceptual hash function, commonly used to determine the similarity between digital images that we adapted for DNA sequences. Perceptual hash function presented here is based on a Discrete Cosine Transform Sign Only (DCT--SO). Each nucleotide is encoded as a fixed gray level intensity pixel and the hash is calculated from its significant frequency characteristics. This results to a drastic data reduction between the sequence and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes are not affected by "avalanche effect" and thus can be compared. The similarity distance between two hashes is estimated with the Hamming Distance, which is used to retrieve DNA sequences. Experiments that we conducted show that our approach is relevant for storing massive DNA sequences, and retrieve them.
In the last few years, the amount of collected data, in various computer science applications, ha... more In the last few years, the amount of collected data, in various computer science applications, has grown considerably. These large volumes of data need to be analyzed in order to extract useful hidden knowledge. This work focuses on association rule extraction. This technique is one of the most popular in data mining. Nevertheless, the number of extracted association rules is often very high, and many of them are redundant. In this paper, we propose a new algorithm, called PRINCE. Its main feature is the construction of a partially ordered structure for extracting subsets of association rules, called generic bases. Without loss of information these subsets form representation of the whole association rule set. To reduce the cost of such a construction, the partially ordered structure is built thanks to the minimal generators associated to frequent closed patterns. The closed ones are simultaneously derived with generic bases thanks to a simple bottom-up traversal of the obtained structure. The experimentations we carried out in benchmark and "worst case" contexts showed the efficiency of the proposed algorithm, compared to algorithms like CLOSE, A-CLOSE and TITANIC.
Twenty Third International Flairs Conference, 2010
... {tarek.hamrouni@fst.rnu.tn, hamrouni@cril.univ-artois.fr} ... On the other hand, in many real... more ... {tarek.hamrouni@fst.rnu.tn, hamrouni@cril.univ-artois.fr} ... On the other hand, in many real-life applications like market basket analysis, medical data analysis, social network anal-ysis and bioinformatics, etc., the disjunctive connector link-ing items can bring key information as ...
One of the most powerful techniques to study proteins is to look for recurrent fragments (also ca... more One of the most powerful techniques to study proteins is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent substructures is formulated as a process of frequent subgraph discovery where each subgraph represents a 3D-motif. In this scope, several efficient approaches for frequent 3D-motifs discovery have been proposed in the literature. However, the set of discovered 3D-motifs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent 3D-motifs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative 3D-motifs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach detects relations between patterns that current subgraph selection approaches fail to detect, and that it is able to considerably decrease the number of motifs while enhancing their interestingness.
The choice of architecture of artificial neuron network (ANN) is still a challenging task that us... more The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad'hoc approaches for multilayer ANN architecture. Our method can be applied to regression problems. Experimental results obtained with different datasets, reveals the efficiency of our method.
Multi-layer neural networks have been successfully applied in a wide range of supervised and unsu... more Multi-layer neural networks have been successfully applied in a wide range of supervised and unsupervised learning applications. As they often produce incomprehensible models they are not widely used in data mining applications. To avoid such limitations, comprehensive models have been previously introduced making use of an apriori knowledge to build the network architecture. They permit to neural network methods to deserve a place in the tool boxes of data mining specialists. However, as the apriori knowledge is not always available for every new dataset, we hereby propose a novel approach that generates a concept semi-lattice from initial dataset, to directly build the neural network architecture. Carried out experiments showed the soundness and efficiency of our approach on various UCI.
One of the most powerful techniques to study protein structures is to look for recurrent fragment... more One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness.
This paper proposes a novel approach for storing and retrieving massive DNA sequences. The method... more This paper proposes a novel approach for storing and retrieving massive DNA sequences. The method is based on a perceptual hash function, commonly used to determine the similarity between digital images that we adapted for DNA sequences. Perceptual hash function presented here is based on a Discrete Cosine Transform Sign Only (DCT--SO). Each nucleotide is encoded as a fixed gray level intensity pixel and the hash is calculated from its significant frequency characteristics. This results to a drastic data reduction between the sequence and the perceptual hash. Unlike cryptographic hash functions, perceptual hashes are not affected by "avalanche effect" and thus can be compared. The similarity distance between two hashes is estimated with the Hamming Distance, which is used to retrieve DNA sequences. Experiments that we conducted show that our approach is relevant for storing massive DNA sequences, and retrieve them.
In the last few years, the amount of collected data, in various computer science applications, ha... more In the last few years, the amount of collected data, in various computer science applications, has grown considerably. These large volumes of data need to be analyzed in order to extract useful hidden knowledge. This work focuses on association rule extraction. This technique is one of the most popular in data mining. Nevertheless, the number of extracted association rules is often very high, and many of them are redundant. In this paper, we propose a new algorithm, called PRINCE. Its main feature is the construction of a partially ordered structure for extracting subsets of association rules, called generic bases. Without loss of information these subsets form representation of the whole association rule set. To reduce the cost of such a construction, the partially ordered structure is built thanks to the minimal generators associated to frequent closed patterns. The closed ones are simultaneously derived with generic bases thanks to a simple bottom-up traversal of the obtained structure. The experimentations we carried out in benchmark and "worst case" contexts showed the efficiency of the proposed algorithm, compared to algorithms like CLOSE, A-CLOSE and TITANIC.
Twenty Third International Flairs Conference, 2010
... {tarek.hamrouni@fst.rnu.tn, hamrouni@cril.univ-artois.fr} ... On the other hand, in many real... more ... {tarek.hamrouni@fst.rnu.tn, hamrouni@cril.univ-artois.fr} ... On the other hand, in many real-life applications like market basket analysis, medical data analysis, social network anal-ysis and bioinformatics, etc., the disjunctive connector link-ing items can bring key information as ...
One of the most powerful techniques to study proteins is to look for recurrent fragments (also ca... more One of the most powerful techniques to study proteins is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent substructures is formulated as a process of frequent subgraph discovery where each subgraph represents a 3D-motif. In this scope, several efficient approaches for frequent 3D-motifs discovery have been proposed in the literature. However, the set of discovered 3D-motifs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent 3D-motifs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative 3D-motifs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach detects relations between patterns that current subgraph selection approaches fail to detect, and that it is able to considerably decrease the number of motifs while enhancing their interestingness.
The choice of architecture of artificial neuron network (ANN) is still a challenging task that us... more The choice of architecture of artificial neuron network (ANN) is still a challenging task that users face every time. It greatly affects the accuracy of the built network. In fact there is no optimal method that is applicable to various implementations at the same time. In this paper we propose a method to construct ANN based on clustering, that resolves the problems of random and ad'hoc approaches for multilayer ANN architecture. Our method can be applied to regression problems. Experimental results obtained with different datasets, reveals the efficiency of our method.
Multi-layer neural networks have been successfully applied in a wide range of supervised and unsu... more Multi-layer neural networks have been successfully applied in a wide range of supervised and unsupervised learning applications. As they often produce incomprehensible models they are not widely used in data mining applications. To avoid such limitations, comprehensive models have been previously introduced making use of an apriori knowledge to build the network architecture. They permit to neural network methods to deserve a place in the tool boxes of data mining specialists. However, as the apriori knowledge is not always available for every new dataset, we hereby propose a novel approach that generates a concept semi-lattice from initial dataset, to directly build the neural network architecture. Carried out experiments showed the soundness and efficiency of our approach on various UCI.
One of the most powerful techniques to study protein structures is to look for recurrent fragment... more One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness.
Uploads
Papers by Engelbert Mephu Nguifo