article

Hierarchical Motif Vectors for Prediction of Functional Sites in Amino Acid Sequences Using Quasi-Supervised Learning

Author:

Bilge KaracaliAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 9, Issue 5

Pages 1432 - 1441

https://doi.org/10.1109/TCBB.2012.68

Published: 01 September 2012 Publication History

Abstract

We propose hierarchical motif vectors to represent local amino acid sequence configurations for predicting the functional attributes of amino acid sites on a global scale in a quasi-supervised learning framework. The motif vectors are constructed via wavelet decomposition on the variations of physico-chemical amino acid properties along the sequences. We then formulate a prediction scheme for the functional attributes of amino acid sites in terms of the respective motif vectors using the quasi-supervised learning algorithm that carries out predictions for all sites in consideration using only the experimentally verified sites. We have carried out comparative performance evaluation of the proposed method on the prediction of N-glycosylation of 55,184 sites possessing the consensus N-glycosylation sequon identified over 15,104 human proteins, out of which only 1,939 were experimentally verified N-glycosylation sites. In the experiments, the proposed method achieved better predictive performance than the alternative strategies from the literature. In addition, the predicted N-glycosylation sites showed good agreement with existing potential annotations, while the novel predictions belonged to proteins known to be modified by glycosylation.

References

[1]

L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman and Hall/CRC, 2008.

Digital Library

[2]

M. Reczko and H. Bohr, "The Def Data-Base of Sequence Based Protein Fold Class Predictions," Nucleic Acids Research, vol. 22, pp. 3616-3619, Sept. 1994.

[3]

M. Bhasin and G.P.S. Raghava, "Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition," J. Biological Chemistry, vol. 279, pp. 23262-23266, May 2004.

[4]

S.J. Hua and Z.R. Sun, "Support Vector Machine Approach for Protein Subcellular Localization Prediction," Bioinformatics, vol. 17, pp. 721-728, Aug. 2001.

[5]

J.K. Vries, X. Liu, and I. Bahar, "The Relationship between N-Gram Patterns and Protein Secondary Structure," Proteins-Structure Function and Bioinformatics, vol. 68, pp. 830-838, Sept. 2007.

[6]

A.M. Facchiano and S. Costantini, "Prediction of the Protein Structural Class by Specific Peptide Frequencies," Biochimie, vol. 91, pp. 226-229, Feb. 2009.

[7]

S. Anishetty, R. Anishetty, and G. Pennathur, "Understanding Mutations and Protein Stability through Tripeptides," FEBS Letters, vol. 580, pp. 2071-2080, Apr. 2006.

[8]

A. Ceroni and P. Frasconi, "On the Role of Long-Range Dependencies in Learning Protein Secondary Structure," Proc. IEEE Int'l Joint Conf. Neural Networks, vol. 3, pp. 1899-1904, 2004.

[9]

D. Kihara, "The Effect of Long-Range Interactions on the Secondary Structure Formation of Proteins," Protein Science, vol. 14, pp. 1955-1963, Aug. 2005.

[10]

Z.R. Li, H.H. Lin, L.Y. Han, L. Jiang, X. Chen, and Y.Z. Chen, "PROFEAT: A Web Server for Computing Structural and Physicochemical Features of Proteins and Peptides from Amino Acid Sequence," Nucleic Acids Research, vol. 34, pp. W32-W37, 2006.

[11]

Z.R. Li, H.B. Rao, F. Zhu, G.B. Yang, and Y.Z. Chen, "Update of PROFEAT: A Web Server for Computing Structural and Physicochemical Features of Proteins and Peptides from Amino Acid Sequence," Nucleic Acids Research, vol. 39, pp. W385-W390, July 2011.

[12]

C. Chen, L.X. Chen, X.Y. Zou, and P.X. Cai, "Predicting Protein Structural Class Based on Multi-Features Fusion," J. Theoretical Biology, vol. 253, pp. 388-392, July 2008.

[13]

T.L. Bailey and C. Elkan, "Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization," Machine Learning, vol. 21, pp. 51-80, Oct./Nov. 1995.

Digital Library

[14]

T.L. Bailey, N. Williams, C. Misleh, and W.W. Li, "MEME: Discovering and Analyzing DNA and Protein Sequence Motifs," Nucleic Acids Research, vol. 34, pp. W369-W373, July 2006.

[15]

C.E. Lawrence and A.A. Reilly, "An Expectation Maximization (Em) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences," Proteins-Structure Function and Genetics, vol. 7, pp. 41-51, 1990.

[16]

S. Balla, V. Thapar, S. Verma, T. Luong, T. Faghri, C.H. Huang, S. Rajasekaran, J.J. del Campo, J.H. Shinn, W.A. Mohler, M.W. Maciejewski, M.R. Gryk, B. Piccirillo, S.R. Schiller, and M.R. Schiller, "Minimotif Miner: A Tool for Investigating Protein Function," Nature Methods, vol. 3, pp. 175-177, Mar. 2006.

[17]

P. Puntervoll, R. Linding, C. Gemund, S. Chabanis-Davidson, M. Mattingsdal, S. Cameron, D.M. Martin, G. Ausiello, B. Brannetti, A. Costantini, F. Ferre, V. Maselli, A. Via, G. Cesareni, F. Diella, G. Superti-Furga, L. Wyrwicz, C. Ramu, C. McGuigan, R. Gudavalli, I. Letunic, P. Bork, L. Rychlewski, B. Kuster, M. Helmer-Citterich, W.N. Hunter, R. Aasland, and T.J. Gibson, "ELM Server: A New Resource for Investigating Short Functional Sites in Modular Eukaryotic Proteins," Nucleic Acids Research, vol. 31, pp. 3625-3630, July 2003.

[18]

A. Bairoch, "PROSITE: A Dictionary of Sites and Patterns in Proteins," Nucleic Acids Research, vol. 19, no. Suppl, pp. 2241-2245, Apr. 1991.

[19]

N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, B.A. Cuche, E. de Castro, C. Lachaize, P.S. Langendijk-Genevaux, and C.J. Sigrist, "The 20 Years of PROSITE," Nucleic Acids Research, vol. 36, pp. D245-D249, Jan. 2008.

[20]

N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J. Sigrist, "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. D227-D230, Jan. 2006.

[21]

L.Y. Geer, M. Domrachev, D.J. Lipman, and S.H. Bryant, "CDART: Protein Homology by Domain Architecture," Genome Research, vol. 12, pp. 1619-1623, Oct. 2002.

[22]

N.C.W. Goonesekere and B. Lee, "Context-Specific Amino Acid Substitution Matrices and Their Use in the Detection of Protein Homologs," Proteins-Structure Function and Bioinformatics, vol. 71, pp. 910-919, May 2008.

[23]

J.G. Henikoff, S. Pietrokovski, C.M. McCallum, and S. Henikoff, "Blocks-Based Methods for Detecting Protein Homology," Electrophoresis , vol. 21, pp. 1700-1706, May 2000.

[24]

S. Hunter, R. Apweiler, T.K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, U. Das, L. Daugherty, L. Duquenne, R.D. Finn, J. Gough, D. Haft, N. Hulo, D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale, C. Orengo, A.F. Quinn, J.D. Selengut, C.J.A. Sigrist, M. Thimma, P.D. Thomas, F. Valentin, D. Wilson, C.H. Wu, and C. Yeats, "InterPro: The Integrative Protein Signature Database," Nucleic Acids Research, vol. 37, pp. D211-D215, Jan. 2009.

[25]

R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, and A. Bateman, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 38, pp. D211-D222, Jan. 2010.

[26]

I. Letunic, T. Doerks, and P. Bork, "SMART 6: Recent Updates and New Developments," Nucleic Acids Research, vol. 37, pp. D229- D232, Jan. 2009.

[27]

J. Schultz, F. Milpetz, P. Bork, and C.P. Ponting, "SMART, a Simple Modular Architecture Research Tool: Identification of Signaling Domains," Proc. Nat'l Academy Sciences USA, vol. 95, pp. 5857-5864, May 1998.

[28]

C.J. Sigrist, L. Cerutti, E. de Castro, P.S. Langendijk-Genevaux, V. Bulliard, A. Bairoch, and N. Hulo, "PROSITE, a Protein Domain Database for Functional Characterization and Annotation," Nucleic Acids Research, vol. 38, pp. D161-D166, Jan. 2010.

[29]

C. Caragea, J. Sinapov, A. Silvescu, D. Dobbs, and V. Honavar, "Glycosylation Site Prediction Using Ensembles of Support Vector Machine Classifiers," BMC Bioinformatics, vol. 8, article 438, 2007.

[30]

R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.

Digital Library

[31]

N. Blom, T. Sicheritz-Ponten, R. Gupta, S. Gammeltoft, and S. Brunak, "Prediction of Post-Translational Glycosylation and Phosphorylation of Proteins from the Amino Acid Sequence," Proteomics, vol. 4, pp. 1633-1649, June 2004.

[32]

K. Julenius, A. Molgaard, R. Gupta, and S. Brunak, "Prediction, Conservation Analysis, and Structural Characterization of Mammalian Mucin-Type O-Glycosylation Sites," Glycobiology, vol. 15, pp. 153-164, Feb. 2005.

[33]

T.P. Knepper, B. Arbogast, J. Schreurs, and M.L. Deinzer, "Determination of the Glycosylation Patterns, Disulfide Linkages, and Protein Heterogeneities of Baculovirus-Expressed Mouse Interleukin-3 by Mass Spectrometry," Biochemistry, vol. 31, pp. 11651-11659, Nov. 1992.

[34]

S.E. Hamby and J.D. Hirst, "Prediction of Glycosylation Sites Using Random Forests," BMC Bioinformatics, vol. 9, article 500, 2008.

[35]

S. Li, B. Liu, R. Zeng, Y. Cai, and Y. Li, "Predicting O-Glycosylation Sites in Mammalian Proteins by Using SVMs," Computational Biology and Chemistry, vol. 30, pp. 203-238, June 2006.

Digital Library

[36]

Y. Gavel and G. von Heijne, "Sequence Differences between Glycosylated and Non-Glycosylated Asn-X-Thr/Ser Acceptor Sites: Implications for Protein Engineering," Protein Eng., vol. 3, pp. 433-442, Apr. 1990.

[37]

R.W. Carrell, J.O. Jeppsson, L. Vaughan, S.O. Brennan, M.C. Owen, and D.R. Boswell, "Human Alpha 1-antitrypsin: Carbohydrate Attachment and Sequence Homology," FEBS Letters, vol. 135, pp. 301-303, Dec. 1981.

[38]

S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, 1998.

Digital Library

[39]

B. Karaçali, "Hierarchical Motif Vectors for Amino Acid Sequence Alignment," Proc. Ninth IASTED Int'l Conf. Biomedical Eng., 2012.

[40]

B. Karaçali, "Quasi-Supervised Learning for Biomedical Data Analysis," Pattern Recognition, vol. 43, pp. 3674-3682, 2010.

Digital Library

[41]

V.S. Mathura and D. Kolippakkam, "APDbase: Amino Acid Physico-Chemical Properties Database," Bioinformation, vol. 1, pp. 2-4, 2005.

[42]

A. Varki, R.D. Cummings, J.D. Esko, H.H. Freeze, G.W. Hart, and M.E. Etzler, Essentials of Glycobiology, second ed. Cold Spring Harbor Laboratory Press, 2008.

[43]

E. Weerapana and B. Imperiali, "Asparagine-Linked Protein Glycosylation: From Eukaryotic to Prokaryotic Systems," Glycobiology , vol. 16, pp. 91R-101R, June 2006.

[44]

J.P. Miletich and G.J. Broze Jr., "Beta Protein C is Not Glycosylated at Asparagine 329, The Rate of Translation may Influence the Frequency of Usage at Asparagine-X-Cysteine Sites," J. Biological Chemistry, vol. 265, pp. 11397-11404, July 1990.

[45]

V.N. Vapnik, The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), second ed. Springer-Verlag, 1999.

Digital Library

[46]

C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, pp. 273-297, Sept. 1995.

[47]

I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.

Digital Library

[48]

E.M. Danielsen, H. Skovbjerg, O. Noren, and H. Sjostrom, "Biosynthesis of Intestinal Microvillar Proteins, Intracellular Processing of Lactase-Phlorizin Hydrolase," Biochemical and Biophysical Research Comm., vol. 122, pp. 82-90, July 1984.

[49]

H.Y. Naim, E.E. Sterchi, and M.J. Lentze, "Biosynthesis and Maturation of Lactase-Phlorizin Hydrolase in the Human Small Intestinal Epithelial Cells," Biochemical J., vol. 241, pp. 427-434, Jan. 1987.

[50]

N. Netzer, J.M. Goodenbour, A. David, K.A. Dittmar, R.B. Jones, J.R. Schneider, D. Boone, E.M. Eves, M.R. Rosner, J.S. Gibbs, A. Embry, B. Dolan, S. Das, H.D. Hickman, P. Berglund, J.R. Bennink, J.W. Yewdell, and T. Pan, "Innate Immune and Chemically Triggered Oxidative Stress Modifies Translational Fidelity," Nature, vol. 462, pp. 522-526, Nov. 2009.

Cited By

Karaçalı B(2016)An efficient algorithm for large-scale quasi-supervised learningPattern Analysis & Applications10.1007/s10044-014-0401-y19:2(311-323)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1007/s10044-014-0401-y

Hierarchical Motif Vectors for Prediction of Functional Sites in Amino Acid Sequences Using Quasi-Supervised Learning
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning algorithms

Recommendations

Halogen bonding in complexes of proteins and non-natural amino acids

We have analyzed the influence of halogen bonding to the stability of complexes of proteins and non-natural amino acids.We employed conservation pattern, structural motif, secondary structure, and solvent accessibility calculation.The roles of key ...
Biased Distribution of Amino Acid in Intrinsically Disordered Proteins and Regions
ICBCB 2018: Proceedings of the 2018 6th International Conference on Bioinformatics and Computational Biology

The analysis on structural characteristic of proteins is helpful to understand molecular mechanisms of disordered structure formation and principles of protein folding, and can provide a foundation for predicting model of intrinsically disordered ...
Predicting Intrinsically Disordered Regions Based on the Structural Bias of Amino Acid Dimers
ICBCB 2018: Proceedings of the 2018 6th International Conference on Bioinformatics and Computational Biology

Due to many important functions of intrinsically disordered proteins, it has already become hotter and hotter research topic to distinguish intrinsically disordered regions from amino acid sequences. To accurately predict intrinsically disordered ...

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 9, Issue 5

September 2012

287 pages

ISSN:1545-5963

Issue’s Table of Contents

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 September 2012

Published in TCBB Volume 9, Issue 5

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
84
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Karaçalı B(2016)An efficient algorithm for large-scale quasi-supervised learningPattern Analysis & Applications10.1007/s10044-014-0401-y19:2(311-323)Online publication date: 1-May-2016
https://dl.acm.org/doi/10.1007/s10044-014-0401-y

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents