When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features

Varshavsky, Roy; Fromer, Menachem; Man, Amit; Linial, Michal

doi:10.1007/978-3-540-74126-8_3

Roy Varshavsky¹,
Menachem Fromer¹,
Amit Man¹ &
…
Michal Linial²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4645))

Included in the following conference series:

International Workshop on Algorithms in Bioinformatics

1101 Accesses

Abstract

Sequence-derived structural and physicochemical features have been used to develop models for predicting protein families. Here, we test the hypothesis that high-level functional groups of proteins may be classified by a very small set of global features directly extracted from sequence alone. To test this, we represent each protein using a small number of normalized global sequence features and classify them into functional groups, using support vector machines (SVM). Furthermore, the contribution of specific subsets of features to the classification quality is thoroughly investigated. The representation of proteins using global features provides effective information for protein family classification, with comparable results to those obtained by representation using local sequence alignment scores. Furthermore, a combination of global and local sequence features significantly improves classification performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A rule-based protein classification approach using normalized distance-based encoding method

Article 08 June 2024

A novel method for achieving an optimal classification of the proteinogenic amino acids

Article Open access 18 September 2020

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

Article Open access 16 May 2015

References

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J.: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402 (1997)
Article Google Scholar
Scheeff, E.D., Bourne, P.E.: Application of protein structure alignments to iterated hidden markov model protocols for structure prediction. BMC Bioinformatics 7, 410 (2006)
Article Google Scholar
Portugaly, E., Harel, A., Linial, N., Linial, M.: Everest: automatic identification and classification of protein domains in all protein sequences. BMC Bioinformatics 7, 277 (2006)
Article Google Scholar
Gribskov, M., McLachlan, A.D., Eisenberg, D.: Profile analysis: detection of distantly related proteins. PNAS 84(13), 4355–4358 (1987)
Article Google Scholar
Yona, G., Levitt, M.: Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315(5), 1257–1275 (2002)
Article Google Scholar
Levitt, M., Gerstein, M.: A unified statistical framework for sequence comparison and structure comparison. PNAS 95(11), 5913–5920 (1998)
Article Google Scholar
Rost, B.: Topits: threading one-dimensional predictions into three-dimensional structures. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 3, pp. 314–321 (1995)
Google Scholar
Frith, M.C., et al.: The abundance of short proteins in the mammalian proteome. PLoS Genet 2(4), e52 (2006)
Google Scholar
Friedberg, I., Kaplan, T., Margalit, H.: Glimmers in the midnight zone: characterization of aligned identical residues in sequence-dissimilar proteins sharing a common fold. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 162–170 (2000)
Google Scholar
Wu, C.H., Apweiler, R., Bairoch, A., Natale, D.A., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M.J., Mazumder, R., O’Donovan, C., Redaschi, N., Suzek, B.: The universal protein resource (uniprot): an expanding universe of protein information. Nucleic Acids Res. 34(Database issue), 187–191 (2006)
Article Google Scholar
Kunik, V., Solan, Z., Edelman, S., Ruppin, E., Horn, D.: Motif Extraction and Protein Classification. In: IEEE Computational Systems Bioinformatics Conference (CSB 2005), pp. 80–85. IEEE Computer Society Press, Los Alamitos (2005)
Chapter Google Scholar
Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: Svm-prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 31(13), 3692–3697 (2003)
Article Google Scholar
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., Sonnhammer, E.L.: The pfam protein families database. Nucleic Acids Res. 30(1), 276–280 (2002)
Article Google Scholar
Syed, U., Yona, G.: Using a mixture of probabilistic decision trees for direct prediction of protein function. In: Proceedings of RECOMB, pp. 224–234 (2003)
Google Scholar
Chou, K.C.: Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1), 10–19 (2005)
Article Google Scholar
Kahsay, R.Y., Gao, G., Liao, L.: An improved hidden markov model for transmembrane protein detection and topology prediction and its applications to complete genomes. Bioinformatics 21(9), 1853–1858 (2005)
Article Google Scholar
Chou, K.C., Cai, Y.D.: Predicting protein quaternary structure by pseudo amino acid composition. Proteins 53(2), 282–289 (2003)
Article Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Article MATH Google Scholar
Camon, E., Barrell, D., Lee, V., Dimmer, E., Apweiler, R.: The gene ontology annotation (goa) database–an integrated resource of go annotations to the uniprot knowledgebase. Silico Biol. 4(1), 5–6 (2004)
Google Scholar
Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147(1), 195–197 (1981)
Article Google Scholar
Hulo, N., et al.: The prosite database. Nucleic Acids Res. 34(Database issue), D227–D230 (2006)
Article Google Scholar
Gasteiger, E., Gattiker, A., Hoogland, C., Ivanyi, I., Appel, R.D., Bairoch, A.: Expasy: The proteomics server for in-depth protein knowledge and analysis. Nucleic Acids Res. 31(13), 3784–3788 (2003)
Article Google Scholar
Jones, D.T.: Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292(2), 195–202 (1999)
Article Google Scholar
Eichacker, L.A., Granvogl, B., Mirus, O., Muller, B.C., Miess, C., Schleiff, E.: Hiding behind hydrophobicity. transmembrane segments in mass spectrometry. J. Biol. Chem. 279(49), 50915–50922 (2004)
Article Google Scholar
Skoufos, E.: Conserved sequence motifs of olfactory receptor-like proteins may participate in upstream and downstream signal transduction. Receptors Channels 6(5), 401–413 (1999)
Google Scholar
Henikoff, J.G., et al.: Increased coverage of protein families with the blocks database servers. Nucl. Acids Res. 28(1), 228–230 (2000)
Article Google Scholar
Conticello, S.G., Pilpel, Y., Glusman, G., Fainzilber, M.: Position-specific codon conservation in hypervariable gene families. Trends Genet 16(2), 57–59 (2000)
Article Google Scholar
Paulsen, I.T., Park, J.H., Choi, P.S., Saier, M.H.: A family of gram-negative bacterial outer membrane factors that function in the export of proteins, carbohydrates, drugs and heavy metals from gram-negative bacteria. FEMS Microbiology Letters 156(1), 1–8 (1997)
Article Google Scholar
Chakrabarti, S., Lanczycki, C.J.: Analysis and prediction of functionally important sites in proteins. Protein Sci. 16(1), 4–13 (2007)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, The Hebrew University of Jerusalem,
Roy Varshavsky, Menachem Fromer & Amit Man
Department of Biological Chemistry, The Hebrew University of Jerusalem,
Michal Linial

Authors

Roy Varshavsky
View author publications
You can also search for this author in PubMed Google Scholar
Menachem Fromer
View author publications
You can also search for this author in PubMed Google Scholar
Amit Man
View author publications
You can also search for this author in PubMed Google Scholar
Michal Linial
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Raffaele Giancarlo Sridhar Hannenhalli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Varshavsky, R., Fromer, M., Man, A., Linial, M. (2007). When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features. In: Giancarlo, R., Hannenhalli, S. (eds) Algorithms in Bioinformatics. WABI 2007. Lecture Notes in Computer Science(), vol 4645. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74126-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-540-74126-8_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74125-1
Online ISBN: 978-3-540-74126-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features

Abstract

Access this chapter

Preview

Similar content being viewed by others

A rule-based protein classification approach using normalized distance-based encoding method

A novel method for achieving an optimal classification of the proteinogenic amino acids

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

When Less Is More: Improving Classification of Protein Families with a Minimal Set of Global Features

Abstract

Access this chapter

Preview

Similar content being viewed by others

A rule-based protein classification approach using normalized distance-based encoding method

A novel method for achieving an optimal classification of the proteinogenic amino acids

ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation