Abstract
Sequence-derived structural and physicochemical features have been used to develop models for predicting protein families. Here, we test the hypothesis that high-level functional groups of proteins may be classified by a very small set of global features directly extracted from sequence alone. To test this, we represent each protein using a small number of normalized global sequence features and classify them into functional groups, using support vector machines (SVM). Furthermore, the contribution of specific subsets of features to the classification quality is thoroughly investigated. The representation of proteins using global features provides effective information for protein family classification, with comparable results to those obtained by representation using local sequence alignment scores. Furthermore, a combination of global and local sequence features significantly improves classification performance.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Scheeff, E.D., Bourne, P.E.: Application of protein structure alignments to iterated hidden markov model protocols for structure prediction. BMC Bioinformatics 7, 410 (2006)
Portugaly, E., Harel, A., Linial, N., Linial, M.: Everest: automatic identification and classification of protein domains in all protein sequences. BMC Bioinformatics 7, 277 (2006)
Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. PNAS 84(13), 4355–4358 (1987)
Yona, G., Levitt, M.: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315(5), 1257–1275 (2002)
Levitt, M., Gerstein, M.: A unified statistical framework for sequence comparison and structure comparison. PNAS 95(11), 5913–5920 (1998)
Rost, B.: Topits: threading one-dimensional predictions into three-dimensional structures. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 3, pp. 314–321 (1995)
Frith, M.C., et al.: The abundance of short proteins in the mammalian proteome. PLoS Genet 2(4), e52 (2006)
Friedberg, I., Kaplan, T., Margalit, H.: Glimmers in the midnight zone: characterization of aligned identical residues in sequence-dissimilar proteins sharing a common fold. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 162–170 (2000)
Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Mazumder, R., O’Donovan, C., Redaschi, N., Suzek, B.: The universal protein resource (uniprot): an expanding universe of protein information. Nucleic Acids Res. 34(Database issue), 187–191 (2006)
Kunik, V., Solan, Z., Edelman, S., Ruppin, E., Horn, D.: Motif Extraction and Protein Classification. In: IEEE Computational Systems Bioinformatics Conference (CSB 2005), pp. 80–85. IEEE Computer Society Press, Los Alamitos (2005)
Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: Svm-prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13), 3692–3697 (2003)
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., Sonnhammer, E.L.: The pfam protein families database. Nucleic Acids Res. 30(1), 276–280 (2002)
Syed, U., Yona, G.: Using a mixture of probabilistic decision trees for direct prediction of protein function. In: Proceedings of RECOMB, pp. 224–234 (2003)
Chou, K.C.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)
Kahsay, R.Y., Gao, G., Liao, L.: An improved hidden markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics 21(9), 1853–1858 (2005)
Chou, K.C., Cai, Y.D.: Predicting protein quaternary structure by pseudo amino acid composition. Proteins 53(2), 282–289 (2003)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Camon, E., Barrell, D., Lee, V., Dimmer, E., Apweiler, R.: The gene ontology annotation (goa) database–an integrated resource of go annotations to the uniprot knowledgebase. Silico Biol. 4(1), 5–6 (2004)
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Hulo, N., et al.: The prosite database. Nucleic Acids Res. 34(Database issue), D227–D230 (2006)
Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: Expasy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31(13), 3784–3788 (2003)
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292(2), 195–202 (1999)
Eichacker, L.A., Granvogl, B., Mirus, O., Muller, B.C., Miess, C., Schleiff, E.: Hiding behind hydrophobicity. transmembrane segments in mass spectrometry. J. Biol. Chem. 279(49), 50915–50922 (2004)
Skoufos, E.: Conserved sequence motifs of olfactory receptor-like proteins may participate in upstream and downstream signal transduction. Receptors Channels 6(5), 401–413 (1999)
Henikoff, J.G., et al.: Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28(1), 228–230 (2000)
Conticello, S.G., Pilpel, Y., Glusman, G., Fainzilber, M.: Position-specific codon conservation in hypervariable gene families. Trends Genet 16(2), 57–59 (2000)
Paulsen, I.T., Park, J.H., Choi, P.S., Saier, M.H.: A family of gram-negative bacterial outer membrane factors that function in the export of proteins, carbohydrates, drugs and heavy metals from gram-negative bacteria. FEMS Microbiology Letters 156(1), 1–8 (1997)
Chakrabarti, S., Lanczycki, C.J.: Analysis and prediction of functionally important sites in proteins. Protein Sci. 16(1), 4–13 (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Varshavsky, R., Fromer, M., Man, A., Linial, M. (2007). When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features. In: Giancarlo, R., Hannenhalli, S. (eds) Algorithms in Bioinformatics. WABI 2007. Lecture Notes in Computer Science(), vol 4645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74126-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-540-74126-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74125-1
Online ISBN: 978-3-540-74126-8
eBook Packages: Computer ScienceComputer Science (R0)