Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Computational Technique for an Efficient Classification of Protein Sequences With Distance-Based Sequence Encoding Algorithm

Published: 01 February 2017 Publication History
  • Get Citation Alerts
  • Abstract

    Machine learning is being implemented in bioinformatics and computational biology to solve challenging problems emerged in the analysis and modeling of biological data such as DNA, RNA, and protein. The major problems in classifying protein sequences into existing families/superfamilies are the following: the selection of a suitable sequence encoding method, the extraction of an optimized subset of features that possesses significant discriminatory information, and the adaptation of an appropriate learning algorithm that classifies protein sequences with higher classification accuracy. The accurate classification of protein sequence would be helpful in determining the structure and function of novel protein sequences. In this article, we have proposed a distance-based sequence encoding algorithm that captures the sequence's statistical characteristics along with amino acids sequence order information. A statistical metric-based feature selection algorithm is then adopted to identify the reduced set of features to represent the original feature space. The performance of the proposed technique is validated using some of the best performing classifiers implemented previously for protein sequence classification. An average classification accuracy of 92% was achieved on the yeast protein sequence data set downloaded from the benchmark UniProtKB database.

    References

    [1]
    Altschul, S.F., W.Gish, W.Miller, E.W.Myers, and D.J.Lipman. 1990. Basic local alignment search tool. Journal of Molecular Biology, Volume 215 Issue 3: pp.403-410.
    [2]
    Altschul, S.F., T.L.Madden, A.A.Schäffer, J.Zhang, Z.Zhang, W.Miller, and D.J.Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, Volume 25 Issue 17: pp.3389-3402.
    [3]
    Anastasiadis, A.D., G.D.Magoulas, and X.Liu. 2003. Classification of protein localisation patterns via supervised neural network learning. Lecture Notes in Computer Science Volume 2810: pp.430-439.
    [4]
    Apweiler, R., A.Bairoch, C.H.Wu, W.C.Barker, B.Boeckmann, S.Ferro, E.Gasteiger, H.Huang, R.Lopez, M.Magrane, M.J.Martin, D.A.Natale, C.O'Donovan, N.Redaschi, and L.S.L.Yeh. 2004. UniProt: the universal protein knowledgebase. Nucleic Acids Research, Volume 32 Issue DATABASE ISS.: pp.D115-D119.
    [5]
    Bandyopadhyay, S. 2005. An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection. Fuzzy Sets and Systems, Volume 152 Issue 1: pp.5-16.
    [6]
    Bentley, D.R. 2000. The human genome project-an overview. Medicinal Research Reviews, Volume 20 Issue 3: pp.189-196.
    [7]
    Berman, H.M., J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig, I.N.Shindyalov, and P.E.Bourne. 2000 .The protein data bank. Nucleic Acids Research, Volume 28 Issue 1: pp.235-242.
    [8]
    Bernardes, J.S., and C.E.Pedreira. 2013. A review of protein function prediction under machine learning perspective. Recent Patents on Biotechnology, Volume 7 Issue 2: pp.122-141.
    [9]
    Blekas, K., D.I.Fotiadis, and A.Likas. Motif-based protein sequence classification using neural networks, 2005. Journal of Computational Biology, Volume 12 Issue 1: pp.64-82.
    [10]
    Busa-Fekete, R., A.Kocsor, and S.Pongor. 2008. Tree-based algorithms for protein classification. InComputational Intelligence in Bioinformatics. Edited byA.Kelemen, A.Abraham, and Y.Chen, Springer, Berlin Heidelberg, pp. pp.165-182.
    [11]
    Caragea, C., A.Silvescu, and P.Mitra. 2011. Protein sequence classification using feature hashing. Proteome Science, Volume 10 Issue 1: pp.1-8.
    [12]
    Datta, A., V.Talukdar, A.Konar, and L.C.Jain. 2009. A neural network based approach for protein structural class prediction. Journal of Intelligent and Fuzzy Systems, Volume 20 Issue 1-2: pp.61-71.
    [13]
    Dayhoff, O.R., M.Schwartz, and B.C.Orcutt. 1978. A model of evolutionary change in proteins. InAtlas of Protein Sequence and Structure. National Biomedical Research Foundation, Washington, DC, pp. pp.345-352.
    [14]
    Desai, B., P.Andhale, M.Rege, and Q.Yu. 2012 .Biclustering and feature selection techniques in bioinformatics. Tiruchirappalli. Lecture Notes in Computer Science, Volume 6411: pp.280-287.
    [15]
    Ezziane, Z. 2006. Applications of artificial intelligence in bioinformatics: a review. Expert Systems with Applications, Volume 30 Issue 1: pp.2-10.
    [16]
    Gough, J., and C.Chothia. 2002. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Research, Volume 30 Issue 1: pp.268-272.
    [17]
    Gupta, M.K., R.Niyogi, and M.Misra. 2013. A framework for alignment-free methods to perform similarity analysis of biological sequence. In Sixth International Conference on Contemporary Computing IC3, Noida, India, pp. pp.337-342.
    [18]
    Hassanien, A.E., E.T.Al-Shammari, and N.I.Ghali. 2013. Computational intelligence techniques in bioinformatics. Computational Biology and Chemistry, Volume 47: pp.37-47.
    [19]
    Hong, H., Q.Hong, R.Perkins, L.Shi, H.Fang, Z.Su, Y.Dragan, J.C.Fuscoe, and W.Tong. 2009. The accurate prediction of protein family from amino acid sequence by measuring features of sequence fragments. Journal of Computational Biology, Volume 16 Issue 4: pp.1671-1688.
    [20]
    Mewes, H.W., D.Frishman, U.Guldener, G.Mannhaupt, K.Mayer, M.Mokrejs, B.Morgenstern, M.Munsterkotter, S.Rudd, and B.Weil. 2002. MIPS: a database for genomes and protein sequences. Nucleic Acids Research, Volume 30: pp.31-34. Available at: "http://mips.helmholtz-muenchen.de" Accessed June 2012.
    [21]
    Iqbal, M.J., I.Faye, A.Md Said, and B.B.Samir. 2014. Data mining of protein sequences with amino acid position-based feature encoding technique. Kuala Lumpur. Lecture Notes in Electrical Engineering, Volume 285: pp.119-126.
    [22]
    Iqbal, M.J., I.Faye, B.B.Samir, and A.Md Said. 2014. Efficient feature selection and classification of protein sequence data in bioinformatics. The Scientific World Journal, Volume 2014: pp.1-12.
    [23]
    Jean-Michel, C., and N.Cedric. 2007 Bioinformatics for Dummies. Wiley: Hoboken, NJ.
    [24]
    Jeong, J.C., X.Lin, and X.W.Chen. 2011. On position-specific scoring matrix for protein function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Volume 8 Issue 2: pp.308-315.
    [25]
    Kang, D.K., A.Silvescu, and V.Honavar. 2006. RNBL-MN: a recursive naive Bayes learner for sequence classification. Singapore. Lecture Notes in Artificial Intelligence, Volume 3918: pp.45-54.
    [26]
    Karchin, R., and R.Hughey. 1998. Weighting hidden Markov models for maximum discrimination. Bioinformatics, Volume 14 Issue 9: pp.772-782.
    [27]
    Kurgan, L., and Y.Zhou. 2011. Machine learning models in protein bioinformatics. Current Protein and Peptide Science, Volume 12 Issue 3: pp.455.
    [28]
    Li, L.Q., Y.Zhang, L.Y.Zou, Y.Zhou, and X.Q.Zheng. 2012. Prediction of protein subcellular multi-localization based on the general form of Chou's pseudo amino acid composition. Protein and Peptide Letters, Volume 19 Issue 4: pp.375-387.
    [29]
    Liu, B., J.Xu, Q.Zou, R.Xu, X.Wang, and Q.Chen. 2014 .Using distances between Top-n-gram and residue pairs for protein remote homology detection. BMC Bioinformatics, Volume 15 Issue Suppl 2: pp.1-10.
    [30]
    Liu, Y., Y.Wang, and J.Zhang. 2012. New machine learning algorithm: random forest. InInformation Computing and Applications, Vol.7473. Edited byB.Liu, M.Ma, and J.Chang, Springer, Berlin, Heidelberg, pp. pp.246-252.
    [31]
    Mansoori, E.G., M.J.Zolghadri, and S.D.Katebi. 2009. Protein superfamily classification using fuzzy rule-based classifier. IEEE Transactions on Nanobioscience, Volume 8 Issue 1: pp.92-99.
    [32]
    Mark Hall, E.F., G.Holmes, B.Pfahringer, P.Reutemann, and I.H.Witten. 2009. The WEKA data mining software: an update. SIGKDD Explorations, Volume 11 Issue 1: pp.10-18.
    [33]
    Narayanan, A., E.C.Keedwell, and B.Olsson. 2002 .Artificial intelligence techniques for bioinformatics. Applied Bioinformatics, Volume 1 Issue 4: pp.191-222.
    [34]
    Needleman, S.B., and C.D.Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, Volume 48 Issue 3: pp.443-453.
    [35]
    Pearson, W. 2004. Finding protein and nucleotide similarities with FASTA. Current protocols in bioinformatics, pp.3.9.1-3.9.23.
    [36]
    Pearson, W.R. 1994. Using the FASTA program to search protein and DNA sequence databases. Methods in Molecular Biology Clifton, NJ, Volume 25: pp.365-389.
    [37]
    Quinlan, J.R. 1986. Induction of decision trees. Machine Learning, Volume 1 Issue 1: pp.81-106.
    [38]
    Rich, C, and N.-M.Alexandru. 2006. An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, pp. pp.161-168.
    [39]
    Saha, S., and R.Chaki. 2013. A brief review of data mining application involving protein sequence classification. International Journal of Database Management Systems, Volume 4: pp.469-477.
    [40]
    Saidi, R., M.Maddouri, and E.Mephu Nguifo. 2010. Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics, Volume 11 Issue 175: pp.1-13.
    [41]
    Sharma, S., V.Kumar, T.S.Rani, S.D.Bhavani, and S.B.Raju. 2004. Application of neural networks for protein sequence classification. In Proceedings of the International Conference on Intelligent Sensing and Information Processing ICISIP. India, Chennai, pp. pp.325-328.
    [42]
    Smith, T.F., and M.S.Waterman. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, Volume 147 Issue 1: pp.195-197.
    [43]
    Srinivasan, S.M., S.Vural, B.R.King, and C.Guda. 2013. Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics, Volume 14 Issue 96: pp.1-14.
    [44]
    Vipsita, S., and S.K.Rath. 2013. Two-stage approach for protein superfamily classification. Computational Biology Journal, Volume 2013: pp.1-12.
    [45]
    Vipsita, S., B.K.Shee, and S.K.Rath. 2010. An efficient technique for protein classification using feature extraction by artificial neural networks. In Proceedings of the Annual IEEE India Conference INDICON, Kolkata, India, pp. pp.1-5.
    [46]
    Wang, J.T.L., Q.Ma, D.Shasha, and C.H.Wu. 2001. New techniques for extracting features from protein sequences. IBM Systems Journal, Volume 40 Issue 2: pp.426-441.
    [47]
    Weinert, W.R., and H.S.Lopes. 2004. Neural networks for protein classification. Applied Bioinformatics, Volume 3 Issue 1: pp.41-48.
    [48]
    Zainuddin, Z., and M.Kumar. 2008. Radial basis function neural networks in protein sequence classification. Malaysian Journal of Mathematical Sciences, Volume 2 Issue 2: pp.195-204.
    [49]
    Zhao, X.M., D.S.Huang, Y.M.Cheung, H.Q.Wang, and X.Huang. 2004. A novel hybrid GA/SVM system for protein sequences classification. Lecture Notes in Computer Science, Volume 3177: pp.11-16.

    Index Terms

    1. Computational Technique for an Efficient Classification of Protein Sequences With Distance-Based Sequence Encoding Algorithm
                Index terms have been assigned to the content through auto-classification.

                Recommendations

                Comments

                Information & Contributors

                Information

                Published In

                cover image Computational Intelligence
                Computational Intelligence  Volume 33, Issue 1
                February 2017
                142 pages

                Publisher

                Blackwell Publishers, Inc.

                United States

                Publication History

                Published: 01 February 2017

                Author Tags

                1. biological data mining
                2. feature selection
                3. protein classification algorithm
                4. sequence encoding
                5. superfamily

                Qualifiers

                • Article

                Contributors

                Other Metrics

                Bibliometrics & Citations

                Bibliometrics

                Article Metrics

                • 0
                  Total Citations
                • 0
                  Total Downloads
                • Downloads (Last 12 months)0
                • Downloads (Last 6 weeks)0
                Reflects downloads up to 09 Aug 2024

                Other Metrics

                Citations

                View Options

                View options

                Get Access

                Login options

                Media

                Figures

                Other

                Tables

                Share

                Share

                Share this Publication link

                Share on social media