Although most research in density-based clustering algorithms focused on finding distinct cluster... more Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other me...
Using microarray techniques, it is possible to measure the expression levels of thousands of gene... more Using microarray techniques, it is possible to measure the expression levels of thousands of genes under several experimental conditions. Extracting information frommicroarray data is an important problem in Bioinformatics scope. Producing overlapping clusters is a major issue in clustering methods. While most of the research in this area has focused on clustering using disjoint cluster, many real microarray datasets and as a result many gene regulatory networks have inherently overlapping partitions. Genes have more than one function by coding for proteins that participate in multiple metabolic pathways. So, Overlapped clusters have an important role in discovering the relationship between genes and finding overlap gene regulatory networks. Recent proposed clustering methods rely on the search of optimal disjoint clusters. In this paper, we propose a new density based clustering (OverDBC) with a bound on the number of overlap clusters. OverDBC allows genes membership in a restricted number of clusters where the total number of clusters is unbounded. We define closeness as a new concept for finding core genes along with the density concept. We compare OverDBC with DBscan (a non-overlapping density-based clustering) algorithm. We prove that OverDBC may be significantly better than non-overlapping clustering in microarray data.
Traditional antiviral therapies are expensive, limitedly available, and cause several side effect... more Traditional antiviral therapies are expensive, limitedly available, and cause several side effects. Currently, designing antiviral peptides is very important, because these peptides interfere with the key stage of virus life cycle. Most of the antiviral peptides are derived from viral proteins for example peptide derived from HIV-1 capsid protein. Because of the importance of these peptides, in this study the concept of pseudo-amino acid composition (PseAAC) and machine learning methods are used to classify or identify antiviral peptides.
This paper addresses the problem of object detec- tion in a biosonar based mobile robot in a natu... more This paper addresses the problem of object detec- tion in a biosonar based mobile robot in a natural environment. In our previous work (9) we presented a time resolved spec- trum kernel to extract the similarities between subsequences of the echoes reflected by different trees and we could get higher accuracy than methods which used specific features in all echoes.
Microbial resistance to antibiotics is a rising concern among health care professionals, driving ... more Microbial resistance to antibiotics is a rising concern among health care professionals, driving them to search for alternative therapies. In the past few years, antimicrobial peptides (AMPs) have attracted a lot of attention as a substitute for conventional antibiotics. Antimicrobial peptides have a broad spectrum of activity and can act as antibacterial, antifungal, antiviral and sometimes even as anticancer drugs. The antibacterial peptides have little sequence homology, despite common properties. Since there is a need to develop a computational method for predicting the antibacterial peptides, in the present study, we have applied the concept of Chou's pseudo-amino acid composition (PseAAC) and machine learning methods for their classification. Our results demonstrate that using the concept of PseAAC and applying Support Vector Machine (SVM) can provide useful information to predict antibacterial peptides.
Because of the importance of proteins in inducing allergenic reactions, the ability of predicting... more Because of the importance of proteins in inducing allergenic reactions, the ability of predicting their potential allergenicity has become an important issue. Bioinformatics presents valuable tools for analyzing allergens and these complementary approaches can help traditional techniques to study allergens. This work proposes a computational method for predicting the allergenic proteins. The prediction was performed using pseudo-amino acid composition (PseAAC) and Support Vector Machines (SVMs). The predictor efficiency was evaluated by fivefold cross-validation. The overall prediction accuracies and Matthew's correlation coefficient (MCC) obtained by this method were 91.19% and 0.82, respectively. Furthermore, the minimum Redundancy and Maximum Relevance (mRMR) feature selection method was utilized for measuring the effect and power of each feature. Interestingly, in our study all six characters (hydrophobicity, hydrophilicity, side chain mass, pK1, pK2 and pI) are present among the 10 higher ranked features obtained from the mRMR feature selection method.
Cancer is an important reason of death worldwide. Traditional cytotoxic therapies, such as radiat... more Cancer is an important reason of death worldwide. Traditional cytotoxic therapies, such as radiation and chemotherapy, are expensive and cause severe side effects. Currently, design of anticancer peptides is a more effective way for cancer treatment. So there is a need to develop a computational method for predicting the anticancer peptides. In the present study, two methods have been developed to predict these peptides using support vector machine (SVM) as a powerful machine learning algorithm. Classifiers have been applied based on the concept of Chou's pseudo-amino acid composition (PseAAC) and local alignment kernel. Since a number of HIV-1 proteins have cytotoxic effect, therefore we predicted the anticancer effect of HIV-1 p24 protein with these methods. After the prediction, mutagenicity of 2 anticancer peptides and 2 non-anticancer peptides was investigated by Ames test. Our results show that, the accuracy and the specificity of local alignment kernel based method are 89.7% and 92.68%, respectively. The accuracy and specificity of PseAAC-based method are 83.82% and 85.36%, respectively. By computational analysis, out of 22 peptides of p24 protein, 4 peptides are anticancer and 18 are non-anticancer. In the Ames test results, it is clear that anticancer peptides (ARP788.8 and ARP788.21) are not mutagenic. Therefore the results demonstrate that the described computation methods are useful to identify potential anticancer peptides, which are worthy of further experimental validation and 2 peptides (ARP788.8 and ARP788.21) of HIV-1 p24 protein can be used as new anticancer candidates without mutagenicity.
Journal of Structural and Functional Genomics, 2011
Matrix metalloproteinase (MMPs) and disintegrin and metalloprotease (ADAMs) belong to the zinc-de... more Matrix metalloproteinase (MMPs) and disintegrin and metalloprotease (ADAMs) belong to the zinc-dependent metalloproteinase family of proteins. These proteins participate in various physiological and pathological states. Thus, prediction of these proteins using amino acid sequence would be helpful. We have developed a method to predict these proteins based on the features derived from Chou's pseudo amino acid composition (PseAAC) server and support vector machine (SVM) as a powerful machine learning approach. With this method, for ADAMs and MMPs families, an overall accuracy and Matthew's correlation coefficient (MCC) of 95.89 and 0.90% were achieved respectively. Furthermore, the method is able to predict two major subclasses of MMP family; Furin-activated secreted MMPs and Type II trans-membrane; with MCC of 0.89 and 0.91%, respectively. The overall accuracy for Furin-activated secreted MMPs and Type II trans-membrane was 98.18 and 99.07, respectively. Our data demonstrates an effective classification of Metalloproteinase family based on the concept of PseAAC and SVM.
Many classiers are designed with the assumption of well- balanced datasets. But in real problems,... more Many classiers are designed with the assumption of well- balanced datasets. But in real problems, like protein classication and remote homology detection, when using binary classiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classication is using a dieren t error cost or decision threshold for positive and negative data to control the sensitivity of the classiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the eciency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversam- pling method for protein sequences can increase the sensitivity and also stability of the classier. Our method of oversampling involves creating synthetic protein sequences...
The amino acid gamma-aminobutyric-acid receptors (GABA(A)Rs) belong to the ligand-gated ion chann... more The amino acid gamma-aminobutyric-acid receptors (GABA(A)Rs) belong to the ligand-gated ion channels (LGICs) superfamily. GABA(A)Rs are highly diverse in the central nervous system. These channels play a key role in regulating behavior. As a result, the prediction of GABA(A)Rs from the amino acid sequence would be helpful for research on these receptors. We have developed a method to predict these proteins using the features obtained from Chou's pseudo-amino acid composition concept and support vector machine as a powerful machine learning approach. The predictor efficiency was assessed by five-fold cross-validation. This method achieved an overall accuracy and Matthew's correlation coefficient (MCC) of 94.12% and 0.88, respectively. Furthermore, to evaluate the effect and power of each feature, the minimum Redundancy and Maximum Relevance (mRMR) feature selection method was implemented. An interesting finding in this study is the presence of all six characters (hydrophobicity, hydrophilicity, side chain mass, pK1, pK2 and pI) or combination of the characters among the 5 higher ranked features (pk2 and pI, hydrophobicity and mass, pk1, hydrophilicity and mass) obtained from the mRMR feature selection method. The results show a biologically justifiable ranked attributes of pk2 and pI; hydrophobicity, hydrophilicity and mass; mass and pk1; pk2 and mass. Based on our results, using the concept of Chou's pseudo-amino acid composition and support vector machine is an effective approach for the prediction of GABA(A)Rs.
Although most research in density-based clustering algorithms focused on finding distinct cluster... more Although most research in density-based clustering algorithms focused on finding distinct clusters, many real-world applications (such as gene functions in a gene regulatory network) have inherently overlapping clusters. Even with overlapping features, density-based clustering methods do not define a probabilistic model of data. Therefore, it is hard to determine how “good” clustering, predicting, and clustering new data into existing clusters are. Therefore, a probability model for overlap density-based clustering is a critical need for large data analysis. In this paper, a new Bayesian density-based method (Bayesian-OverDBC) for modeling the overlapping clusters is presented. Bayesian-OverDBC can predict the formation of a new cluster. It can also predict the overlapping of cluster with existing clusters. Bayesian-OverDBC has been compared with other algorithms (nonoverlapping and overlapping models). The results show that Bayesian-OverDBC can be significantly better than other me...
Using microarray techniques, it is possible to measure the expression levels of thousands of gene... more Using microarray techniques, it is possible to measure the expression levels of thousands of genes under several experimental conditions. Extracting information frommicroarray data is an important problem in Bioinformatics scope. Producing overlapping clusters is a major issue in clustering methods. While most of the research in this area has focused on clustering using disjoint cluster, many real microarray datasets and as a result many gene regulatory networks have inherently overlapping partitions. Genes have more than one function by coding for proteins that participate in multiple metabolic pathways. So, Overlapped clusters have an important role in discovering the relationship between genes and finding overlap gene regulatory networks. Recent proposed clustering methods rely on the search of optimal disjoint clusters. In this paper, we propose a new density based clustering (OverDBC) with a bound on the number of overlap clusters. OverDBC allows genes membership in a restricted number of clusters where the total number of clusters is unbounded. We define closeness as a new concept for finding core genes along with the density concept. We compare OverDBC with DBscan (a non-overlapping density-based clustering) algorithm. We prove that OverDBC may be significantly better than non-overlapping clustering in microarray data.
Traditional antiviral therapies are expensive, limitedly available, and cause several side effect... more Traditional antiviral therapies are expensive, limitedly available, and cause several side effects. Currently, designing antiviral peptides is very important, because these peptides interfere with the key stage of virus life cycle. Most of the antiviral peptides are derived from viral proteins for example peptide derived from HIV-1 capsid protein. Because of the importance of these peptides, in this study the concept of pseudo-amino acid composition (PseAAC) and machine learning methods are used to classify or identify antiviral peptides.
This paper addresses the problem of object detec- tion in a biosonar based mobile robot in a natu... more This paper addresses the problem of object detec- tion in a biosonar based mobile robot in a natural environment. In our previous work (9) we presented a time resolved spec- trum kernel to extract the similarities between subsequences of the echoes reflected by different trees and we could get higher accuracy than methods which used specific features in all echoes.
Microbial resistance to antibiotics is a rising concern among health care professionals, driving ... more Microbial resistance to antibiotics is a rising concern among health care professionals, driving them to search for alternative therapies. In the past few years, antimicrobial peptides (AMPs) have attracted a lot of attention as a substitute for conventional antibiotics. Antimicrobial peptides have a broad spectrum of activity and can act as antibacterial, antifungal, antiviral and sometimes even as anticancer drugs. The antibacterial peptides have little sequence homology, despite common properties. Since there is a need to develop a computational method for predicting the antibacterial peptides, in the present study, we have applied the concept of Chou's pseudo-amino acid composition (PseAAC) and machine learning methods for their classification. Our results demonstrate that using the concept of PseAAC and applying Support Vector Machine (SVM) can provide useful information to predict antibacterial peptides.
Because of the importance of proteins in inducing allergenic reactions, the ability of predicting... more Because of the importance of proteins in inducing allergenic reactions, the ability of predicting their potential allergenicity has become an important issue. Bioinformatics presents valuable tools for analyzing allergens and these complementary approaches can help traditional techniques to study allergens. This work proposes a computational method for predicting the allergenic proteins. The prediction was performed using pseudo-amino acid composition (PseAAC) and Support Vector Machines (SVMs). The predictor efficiency was evaluated by fivefold cross-validation. The overall prediction accuracies and Matthew's correlation coefficient (MCC) obtained by this method were 91.19% and 0.82, respectively. Furthermore, the minimum Redundancy and Maximum Relevance (mRMR) feature selection method was utilized for measuring the effect and power of each feature. Interestingly, in our study all six characters (hydrophobicity, hydrophilicity, side chain mass, pK1, pK2 and pI) are present among the 10 higher ranked features obtained from the mRMR feature selection method.
Cancer is an important reason of death worldwide. Traditional cytotoxic therapies, such as radiat... more Cancer is an important reason of death worldwide. Traditional cytotoxic therapies, such as radiation and chemotherapy, are expensive and cause severe side effects. Currently, design of anticancer peptides is a more effective way for cancer treatment. So there is a need to develop a computational method for predicting the anticancer peptides. In the present study, two methods have been developed to predict these peptides using support vector machine (SVM) as a powerful machine learning algorithm. Classifiers have been applied based on the concept of Chou's pseudo-amino acid composition (PseAAC) and local alignment kernel. Since a number of HIV-1 proteins have cytotoxic effect, therefore we predicted the anticancer effect of HIV-1 p24 protein with these methods. After the prediction, mutagenicity of 2 anticancer peptides and 2 non-anticancer peptides was investigated by Ames test. Our results show that, the accuracy and the specificity of local alignment kernel based method are 89.7% and 92.68%, respectively. The accuracy and specificity of PseAAC-based method are 83.82% and 85.36%, respectively. By computational analysis, out of 22 peptides of p24 protein, 4 peptides are anticancer and 18 are non-anticancer. In the Ames test results, it is clear that anticancer peptides (ARP788.8 and ARP788.21) are not mutagenic. Therefore the results demonstrate that the described computation methods are useful to identify potential anticancer peptides, which are worthy of further experimental validation and 2 peptides (ARP788.8 and ARP788.21) of HIV-1 p24 protein can be used as new anticancer candidates without mutagenicity.
Journal of Structural and Functional Genomics, 2011
Matrix metalloproteinase (MMPs) and disintegrin and metalloprotease (ADAMs) belong to the zinc-de... more Matrix metalloproteinase (MMPs) and disintegrin and metalloprotease (ADAMs) belong to the zinc-dependent metalloproteinase family of proteins. These proteins participate in various physiological and pathological states. Thus, prediction of these proteins using amino acid sequence would be helpful. We have developed a method to predict these proteins based on the features derived from Chou's pseudo amino acid composition (PseAAC) server and support vector machine (SVM) as a powerful machine learning approach. With this method, for ADAMs and MMPs families, an overall accuracy and Matthew's correlation coefficient (MCC) of 95.89 and 0.90% were achieved respectively. Furthermore, the method is able to predict two major subclasses of MMP family; Furin-activated secreted MMPs and Type II trans-membrane; with MCC of 0.89 and 0.91%, respectively. The overall accuracy for Furin-activated secreted MMPs and Type II trans-membrane was 98.18 and 99.07, respectively. Our data demonstrates an effective classification of Metalloproteinase family based on the concept of PseAAC and SVM.
Many classiers are designed with the assumption of well- balanced datasets. But in real problems,... more Many classiers are designed with the assumption of well- balanced datasets. But in real problems, like protein classication and remote homology detection, when using binary classiers like support vector machine (SVM) and kernel methods, we are facing imbalanced data in which we have a low number of protein sequences as positive data (minor class) compared with negative data (major class). A widely used solution to that issue in protein classication is using a dieren t error cost or decision threshold for positive and negative data to control the sensitivity of the classiers. Our experiments show that when the datasets are highly imbalanced, and especially with overlapped datasets, the eciency and stability of that method decreases. This paper shows that a combination of the above method and our suggested oversam- pling method for protein sequences can increase the sensitivity and also stability of the classier. Our method of oversampling involves creating synthetic protein sequences...
The amino acid gamma-aminobutyric-acid receptors (GABA(A)Rs) belong to the ligand-gated ion chann... more The amino acid gamma-aminobutyric-acid receptors (GABA(A)Rs) belong to the ligand-gated ion channels (LGICs) superfamily. GABA(A)Rs are highly diverse in the central nervous system. These channels play a key role in regulating behavior. As a result, the prediction of GABA(A)Rs from the amino acid sequence would be helpful for research on these receptors. We have developed a method to predict these proteins using the features obtained from Chou's pseudo-amino acid composition concept and support vector machine as a powerful machine learning approach. The predictor efficiency was assessed by five-fold cross-validation. This method achieved an overall accuracy and Matthew's correlation coefficient (MCC) of 94.12% and 0.88, respectively. Furthermore, to evaluate the effect and power of each feature, the minimum Redundancy and Maximum Relevance (mRMR) feature selection method was implemented. An interesting finding in this study is the presence of all six characters (hydrophobicity, hydrophilicity, side chain mass, pK1, pK2 and pI) or combination of the characters among the 5 higher ranked features (pk2 and pI, hydrophobicity and mass, pk1, hydrophilicity and mass) obtained from the mRMR feature selection method. The results show a biologically justifiable ranked attributes of pk2 and pI; hydrophobicity, hydrophilicity and mass; mass and pk1; pk2 and mass. Based on our results, using the concept of Chou's pseudo-amino acid composition and support vector machine is an effective approach for the prediction of GABA(A)Rs.
Uploads
Papers by majid beigi