article

Computational Technique for an Efficient Classification of Protein Sequences With Distance-Based Sequence Encoding Algorithm

Authors:

Muhammad Javed Iqbal,

Brahim Belhaouari SamirAuthors Info & Claims

Computational Intelligence, Volume 33, Issue 1

Pages 32 - 55

https://doi.org/10.1111/coin.12069

Published: 01 February 2017 Publication History

Abstract

Machine learning is being implemented in bioinformatics and computational biology to solve challenging problems emerged in the analysis and modeling of biological data such as DNA, RNA, and protein. The major problems in classifying protein sequences into existing families/superfamilies are the following: the selection of a suitable sequence encoding method, the extraction of an optimized subset of features that possesses significant discriminatory information, and the adaptation of an appropriate learning algorithm that classifies protein sequences with higher classification accuracy. The accurate classification of protein sequence would be helpful in determining the structure and function of novel protein sequences. In this article, we have proposed a distance-based sequence encoding algorithm that captures the sequence's statistical characteristics along with amino acids sequence order information. A statistical metric-based feature selection algorithm is then adopted to identify the reduced set of features to represent the original feature space. The performance of the proposed technique is validated using some of the best performing classifiers implemented previously for protein sequence classification. An average classification accuracy of 92% was achieved on the yeast protein sequence data set downloaded from the benchmark UniProtKB database.

References

[1]

Altschul, S.F., W.Gish, W.Miller, E.W.Myers, and D.J.Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, Volume 215 Issue 3: pp.403-410.

[2]

Altschul, S.F., T.L.Madden, A.A.Schäffer, J.Zhang, Z.Zhang, W.Miller, and D.J.Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, Volume 25 Issue 17: pp.3389-3402.

[3]

Anastasiadis, A.D., G.D.Magoulas, and X.Liu. 2003. Classification of protein localisation patterns via supervised neural network learning. Lecture Notes in Computer Science Volume 2810: pp.430-439.

[4]

Apweiler, R., A.Bairoch, C.H.Wu, W.C.Barker, B.Boeckmann, S.Ferro, E.Gasteiger, H.Huang, R.Lopez, M.Magrane, M.J.Martin, D.A.Natale, C.O'Donovan, N.Redaschi, and L.S.L.Yeh. 2004. UniProt: the universal protein knowledgebase. Nucleic Acids Research, Volume 32 Issue DATABASE ISS.: pp.D115-D119.

[5]

Bandyopadhyay, S. 2005. An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets and Systems, Volume 152 Issue 1: pp.5-16.

Digital Library

[6]

Bentley, D.R. 2000. The human genome project-an overview. Medicinal Research Reviews, Volume 20 Issue 3: pp.189-196.

[7]

Berman, H.M., J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig, I.N.Shindyalov, and P.E.Bourne. 2000 .The protein data bank. Nucleic Acids Research, Volume 28 Issue 1: pp.235-242.

[8]

Bernardes, J.S., and C.E.Pedreira. 2013. A review of protein function prediction under machine learning perspective. Recent Patents on Biotechnology, Volume 7 Issue 2: pp.122-141.

[9]

Blekas, K., D.I.Fotiadis, and A.Likas. Motif-based protein sequence classification using neural networks, 2005. Journal of Computational Biology, Volume 12 Issue 1: pp.64-82.

[10]

Busa-Fekete, R., A.Kocsor, and S.Pongor. 2008. Tree-based algorithms for protein classification. InComputational Intelligence in Bioinformatics. Edited byA.Kelemen, A.Abraham, and Y.Chen, Springer, Berlin Heidelberg, pp. pp.165-182.

[11]

Caragea, C., A.Silvescu, and P.Mitra. 2011. Protein sequence classification using feature hashing. Proteome Science, Volume 10 Issue 1: pp.1-8.

[12]

Datta, A., V.Talukdar, A.Konar, and L.C.Jain. 2009. A neural network based approach for protein structural class prediction. Journal of Intelligent and Fuzzy Systems, Volume 20 Issue 1-2: pp.61-71.

Digital Library

[13]

Dayhoff, O.R., M.Schwartz, and B.C.Orcutt. 1978. A model of evolutionary change in proteins. InAtlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, pp. pp.345-352.

[14]

Desai, B., P.Andhale, M.Rege, and Q.Yu. 2012 .Biclustering and feature selection techniques in bioinformatics. Tiruchirappalli. Lecture Notes in Computer Science, Volume 6411: pp.280-287.

Digital Library

[15]

Ezziane, Z. 2006. Applications of artificial intelligence in bioinformatics: a review. Expert Systems with Applications, Volume 30 Issue 1: pp.2-10.

Digital Library

[16]

Gough, J., and C.Chothia. 2002. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Research, Volume 30 Issue 1: pp.268-272.

[17]

Gupta, M.K., R.Niyogi, and M.Misra. 2013. A framework for alignment-free methods to perform similarity analysis of biological sequence. In Sixth International Conference on Contemporary Computing IC3, Noida, India, pp. pp.337-342.

[18]

Hassanien, A.E., E.T.Al-Shammari, and N.I.Ghali. 2013. Computational intelligence techniques in bioinformatics. Computational Biology and Chemistry, Volume 47: pp.37-47.

Digital Library

[19]

Hong, H., Q.Hong, R.Perkins, L.Shi, H.Fang, Z.Su, Y.Dragan, J.C.Fuscoe, and W.Tong. 2009. The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments. Journal of Computational Biology, Volume 16 Issue 4: pp.1671-1688.

[20]

Mewes, H.W., D.Frishman, U.Guldener, G.Mannhaupt, K.Mayer, M.Mokrejs, B.Morgenstern, M.Munsterkotter, S.Rudd, and B.Weil. 2002. MIPS: a database for genomes and protein sequences. Nucleic Acids Research, Volume 30: pp.31-34. Available at: "http://mips.helmholtz-muenchen.de" Accessed June 2012.

[21]

Iqbal, M.J., I.Faye, A.Md Said, and B.B.Samir. 2014. Data mining of protein sequences with amino acid position-based feature encoding technique. Kuala Lumpur. Lecture Notes in Electrical Engineering, Volume 285: pp.119-126.

[22]

Iqbal, M.J., I.Faye, B.B.Samir, and A.Md Said. 2014. Efficient feature selection and classification of protein sequence data in bioinformatics. The Scientific World Journal, Volume 2014: pp.1-12.

[23]

Jean-Michel, C., and N.Cedric. 2007 Bioinformatics for Dummies. Wiley: Hoboken, NJ.

[24]

Jeong, J.C., X.Lin, and X.W.Chen. 2011. On position-specific scoring matrix for protein function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 8 Issue 2: pp.308-315.

Digital Library

[25]

Kang, D.K., A.Silvescu, and V.Honavar. 2006. RNBL-MN: a recursive naive Bayes learner for sequence classification. Singapore. Lecture Notes in Artificial Intelligence, Volume 3918: pp.45-54.

Digital Library

[26]

Karchin, R., and R.Hughey. 1998. Weighting hidden Markov models for maximum discrimination. Bioinformatics, Volume 14 Issue 9: pp.772-782.

[27]

Kurgan, L., and Y.Zhou. 2011. Machine learning models in protein bioinformatics. Current Protein and Peptide Science, Volume 12 Issue 3: pp.455.

[28]

Li, L.Q., Y.Zhang, L.Y.Zou, Y.Zhou, and X.Q.Zheng. 2012. Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition. Protein and Peptide Letters, Volume 19 Issue 4: pp.375-387.

[29]

Liu, B., J.Xu, Q.Zou, R.Xu, X.Wang, and Q.Chen. 2014 .Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics, Volume 15 Issue Suppl 2: pp.1-10.

[30]

Liu, Y., Y.Wang, and J.Zhang. 2012. New machine learning algorithm: random forest. InInformation Computing and Applications, Vol.7473. Edited byB.Liu, M.Ma, and J.Chang, Springer, Berlin, Heidelberg, pp. pp.246-252.

Digital Library

[31]

Mansoori, E.G., M.J.Zolghadri, and S.D.Katebi. 2009. Protein superfamily classification using fuzzy rule-based classifier. IEEE Transactions on Nanobioscience, Volume 8 Issue 1: pp.92-99.

[32]

Mark Hall, E.F., G.Holmes, B.Pfahringer, P.Reutemann, and I.H.Witten. 2009. The WEKA data mining software: an update. SIGKDD Explorations, Volume 11 Issue 1: pp.10-18.

Digital Library

[33]

Narayanan, A., E.C.Keedwell, and B.Olsson. 2002 .Artificial intelligence techniques for bioinformatics. Applied Bioinformatics, Volume 1 Issue 4: pp.191-222.

[34]

Needleman, S.B., and C.D.Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, Volume 48 Issue 3: pp.443-453.

[35]

Pearson, W. 2004. Finding protein and nucleotide similarities with FASTA. Current protocols in bioinformatics, pp.3.9.1-3.9.23.

[36]

Pearson, W.R. 1994. Using the FASTA program to search protein and DNA sequence databases. Methods in Molecular Biology Clifton, NJ, Volume 25: pp.365-389.

[37]

Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, Volume 1 Issue 1: pp.81-106.

Digital Library

[38]

Rich, C, and N.-M.Alexandru. 2006. An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, pp. pp.161-168.

Digital Library

[39]

Saha, S., and R.Chaki. 2013. A brief review of data mining application involving protein sequence classification. International Journal of Database Management Systems, Volume 4: pp.469-477.

[40]

Saidi, R., M.Maddouri, and E.Mephu Nguifo. 2010. Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics, Volume 11 Issue 175: pp.1-13.

[41]

Sharma, S., V.Kumar, T.S.Rani, S.D.Bhavani, and S.B.Raju. 2004. Application of neural networks for protein sequence classification. In Proceedings of the International Conference on Intelligent Sensing and Information Processing ICISIP. India, Chennai, pp. pp.325-328.

[42]

Smith, T.F., and M.S.Waterman. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, Volume 147 Issue 1: pp.195-197.

[43]

Srinivasan, S.M., S.Vural, B.R.King, and C.Guda. 2013. Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics, Volume 14 Issue 96: pp.1-14.

[44]

Vipsita, S., and S.K.Rath. 2013. Two-stage approach for protein superfamily classification. Computational Biology Journal, Volume 2013: pp.1-12.

[45]

Vipsita, S., B.K.Shee, and S.K.Rath. 2010. An efficient technique for protein classification using feature extraction by artificial neural networks. In Proceedings of the Annual IEEE India Conference INDICON, Kolkata, India, pp. pp.1-5.

[46]

Wang, J.T.L., Q.Ma, D.Shasha, and C.H.Wu. 2001. New techniques for extracting features from protein sequences. IBM Systems Journal, Volume 40 Issue 2: pp.426-441.

Digital Library

[47]

Weinert, W.R., and H.S.Lopes. 2004. Neural networks for protein classification. Applied Bioinformatics, Volume 3 Issue 1: pp.41-48.

[48]

Zainuddin, Z., and M.Kumar. 2008. Radial basis function neural networks in protein sequence classification. Malaysian Journal of Mathematical Sciences, Volume 2 Issue 2: pp.195-204.

[49]

Zhao, X.M., D.S.Huang, Y.M.Cheung, H.Q.Wang, and X.Huang. 2004. A novel hybrid GA/SVM system for protein sequences classification. Lecture Notes in Computer Science, Volume 3177: pp.11-16.

Index Terms

Computational Technique for an Efficient Classification of Protein Sequences With Distance-Based Sequence Encoding Algorithm
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

PseAAC2Vec protein encoding for TCR protein sequence classification
Abstract
The classification and prediction of T-cell receptors (TCRs) protein sequences are of significant interest in understanding the immune system and developing personalized immunotherapies. In this study, we propose a novel approach using Pseudo ...
Highlights
- The classification and prediction of T-cell receptors (TCRs) protein sequences are essential for understanding the immune system and developing personalized immunotherapies.
- However, applying machine learning algorithms to protein ...
Efficient median based clustering and classification techniques for protein sequences

In this paper, an efficient K-medians clustering (unsupervised) algorithm for prototype selection and Supervised K-medians (SKM) classification technique for protein sequences are presented. For sequence data sets, a median string/sequence can be used ...
Computational investigations of protein sequence-structure relations

Comments

Information & Contributors

Information

Published In

cover image Computational Intelligence

Computational Intelligence Volume 33, Issue 1

February 2017

142 pages

ISSN:0824-7935

Issue’s Table of Contents

Publisher

Blackwell Publishers, Inc.

United States

Publication History

Published: 01 February 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents