Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
article

Hierarchical Motif Vectors for Prediction of Functional Sites in Amino Acid Sequences Using Quasi-Supervised Learning

Published: 01 September 2012 Publication History

Abstract

We propose hierarchical motif vectors to represent local amino acid sequence configurations for predicting the functional attributes of amino acid sites on a global scale in a quasi-supervised learning framework. The motif vectors are constructed via wavelet decomposition on the variations of physico-chemical amino acid properties along the sequences. We then formulate a prediction scheme for the functional attributes of amino acid sites in terms of the respective motif vectors using the quasi-supervised learning algorithm that carries out predictions for all sites in consideration using only the experimentally verified sites. We have carried out comparative performance evaluation of the proposed method on the prediction of N-glycosylation of 55,184 sites possessing the consensus N-glycosylation sequon identified over 15,104 human proteins, out of which only 1,939 were experimentally verified N-glycosylation sites. In the experiments, the proposed method achieved better predictive performance than the alternative strategies from the literature. In addition, the predicted N-glycosylation sites showed good agreement with existing potential annotations, while the novel predictions belonged to proteins known to be modified by glycosylation.

References

[1]
L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman and Hall/CRC, 2008.
[2]
M. Reczko and H. Bohr, "The Def Data-Base of Sequence Based Protein Fold Class Predictions," Nucleic Acids Research, vol. 22, pp. 3616-3619, Sept. 1994.
[3]
M. Bhasin and G.P.S. Raghava, "Classification of Nuclear Receptors Based on Amino Acid Composition and Dipeptide Composition," J. Biological Chemistry, vol. 279, pp. 23262-23266, May 2004.
[4]
S.J. Hua and Z.R. Sun, "Support Vector Machine Approach for Protein Subcellular Localization Prediction," Bioinformatics, vol. 17, pp. 721-728, Aug. 2001.
[5]
J.K. Vries, X. Liu, and I. Bahar, "The Relationship between N-Gram Patterns and Protein Secondary Structure," Proteins-Structure Function and Bioinformatics, vol. 68, pp. 830-838, Sept. 2007.
[6]
A.M. Facchiano and S. Costantini, "Prediction of the Protein Structural Class by Specific Peptide Frequencies," Biochimie, vol. 91, pp. 226-229, Feb. 2009.
[7]
S. Anishetty, R. Anishetty, and G. Pennathur, "Understanding Mutations and Protein Stability through Tripeptides," FEBS Letters, vol. 580, pp. 2071-2080, Apr. 2006.
[8]
A. Ceroni and P. Frasconi, "On the Role of Long-Range Dependencies in Learning Protein Secondary Structure," Proc. IEEE Int'l Joint Conf. Neural Networks, vol. 3, pp. 1899-1904, 2004.
[9]
D. Kihara, "The Effect of Long-Range Interactions on the Secondary Structure Formation of Proteins," Protein Science, vol. 14, pp. 1955-1963, Aug. 2005.
[10]
Z.R. Li, H.H. Lin, L.Y. Han, L. Jiang, X. Chen, and Y.Z. Chen, "PROFEAT: A Web Server for Computing Structural and Physicochemical Features of Proteins and Peptides from Amino Acid Sequence," Nucleic Acids Research, vol. 34, pp. W32-W37, 2006.
[11]
Z.R. Li, H.B. Rao, F. Zhu, G.B. Yang, and Y.Z. Chen, "Update of PROFEAT: A Web Server for Computing Structural and Physicochemical Features of Proteins and Peptides from Amino Acid Sequence," Nucleic Acids Research, vol. 39, pp. W385-W390, July 2011.
[12]
C. Chen, L.X. Chen, X.Y. Zou, and P.X. Cai, "Predicting Protein Structural Class Based on Multi-Features Fusion," J. Theoretical Biology, vol. 253, pp. 388-392, July 2008.
[13]
T.L. Bailey and C. Elkan, "Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization," Machine Learning, vol. 21, pp. 51-80, Oct./Nov. 1995.
[14]
T.L. Bailey, N. Williams, C. Misleh, and W.W. Li, "MEME: Discovering and Analyzing DNA and Protein Sequence Motifs," Nucleic Acids Research, vol. 34, pp. W369-W373, July 2006.
[15]
C.E. Lawrence and A.A. Reilly, "An Expectation Maximization (Em) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences," Proteins-Structure Function and Genetics, vol. 7, pp. 41-51, 1990.
[16]
S. Balla, V. Thapar, S. Verma, T. Luong, T. Faghri, C.H. Huang, S. Rajasekaran, J.J. del Campo, J.H. Shinn, W.A. Mohler, M.W. Maciejewski, M.R. Gryk, B. Piccirillo, S.R. Schiller, and M.R. Schiller, "Minimotif Miner: A Tool for Investigating Protein Function," Nature Methods, vol. 3, pp. 175-177, Mar. 2006.
[17]
P. Puntervoll, R. Linding, C. Gemund, S. Chabanis-Davidson, M. Mattingsdal, S. Cameron, D.M. Martin, G. Ausiello, B. Brannetti, A. Costantini, F. Ferre, V. Maselli, A. Via, G. Cesareni, F. Diella, G. Superti-Furga, L. Wyrwicz, C. Ramu, C. McGuigan, R. Gudavalli, I. Letunic, P. Bork, L. Rychlewski, B. Kuster, M. Helmer-Citterich, W.N. Hunter, R. Aasland, and T.J. Gibson, "ELM Server: A New Resource for Investigating Short Functional Sites in Modular Eukaryotic Proteins," Nucleic Acids Research, vol. 31, pp. 3625-3630, July 2003.
[18]
A. Bairoch, "PROSITE: A Dictionary of Sites and Patterns in Proteins," Nucleic Acids Research, vol. 19, no. Suppl, pp. 2241-2245, Apr. 1991.
[19]
N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, B.A. Cuche, E. de Castro, C. Lachaize, P.S. Langendijk-Genevaux, and C.J. Sigrist, "The 20 Years of PROSITE," Nucleic Acids Research, vol. 36, pp. D245-D249, Jan. 2008.
[20]
N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J. Sigrist, "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. D227-D230, Jan. 2006.
[21]
L.Y. Geer, M. Domrachev, D.J. Lipman, and S.H. Bryant, "CDART: Protein Homology by Domain Architecture," Genome Research, vol. 12, pp. 1619-1623, Oct. 2002.
[22]
N.C.W. Goonesekere and B. Lee, "Context-Specific Amino Acid Substitution Matrices and Their Use in the Detection of Protein Homologs," Proteins-Structure Function and Bioinformatics, vol. 71, pp. 910-919, May 2008.
[23]
J.G. Henikoff, S. Pietrokovski, C.M. McCallum, and S. Henikoff, "Blocks-Based Methods for Detecting Protein Homology," Electrophoresis , vol. 21, pp. 1700-1706, May 2000.
[24]
S. Hunter, R. Apweiler, T.K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork, U. Das, L. Daugherty, L. Duquenne, R.D. Finn, J. Gough, D. Haft, N. Hulo, D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera, J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale, C. Orengo, A.F. Quinn, J.D. Selengut, C.J.A. Sigrist, M. Thimma, P.D. Thomas, F. Valentin, D. Wilson, C.H. Wu, and C. Yeats, "InterPro: The Integrative Protein Signature Database," Nucleic Acids Research, vol. 37, pp. D211-D215, Jan. 2009.
[25]
R.D. Finn, J. Mistry, J. Tate, P. Coggill, A. Heger, J.E. Pollington, O.L. Gavin, P. Gunasekaran, G. Ceric, K. Forslund, L. Holm, E.L. Sonnhammer, S.R. Eddy, and A. Bateman, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 38, pp. D211-D222, Jan. 2010.
[26]
I. Letunic, T. Doerks, and P. Bork, "SMART 6: Recent Updates and New Developments," Nucleic Acids Research, vol. 37, pp. D229- D232, Jan. 2009.
[27]
J. Schultz, F. Milpetz, P. Bork, and C.P. Ponting, "SMART, a Simple Modular Architecture Research Tool: Identification of Signaling Domains," Proc. Nat'l Academy Sciences USA, vol. 95, pp. 5857-5864, May 1998.
[28]
C.J. Sigrist, L. Cerutti, E. de Castro, P.S. Langendijk-Genevaux, V. Bulliard, A. Bairoch, and N. Hulo, "PROSITE, a Protein Domain Database for Functional Characterization and Annotation," Nucleic Acids Research, vol. 38, pp. D161-D166, Jan. 2010.
[29]
C. Caragea, J. Sinapov, A. Silvescu, D. Dobbs, and V. Honavar, "Glycosylation Site Prediction Using Ensembles of Support Vector Machine Classifiers," BMC Bioinformatics, vol. 8, article 438, 2007.
[30]
R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.
[31]
N. Blom, T. Sicheritz-Ponten, R. Gupta, S. Gammeltoft, and S. Brunak, "Prediction of Post-Translational Glycosylation and Phosphorylation of Proteins from the Amino Acid Sequence," Proteomics, vol. 4, pp. 1633-1649, June 2004.
[32]
K. Julenius, A. Molgaard, R. Gupta, and S. Brunak, "Prediction, Conservation Analysis, and Structural Characterization of Mammalian Mucin-Type O-Glycosylation Sites," Glycobiology, vol. 15, pp. 153-164, Feb. 2005.
[33]
T.P. Knepper, B. Arbogast, J. Schreurs, and M.L. Deinzer, "Determination of the Glycosylation Patterns, Disulfide Linkages, and Protein Heterogeneities of Baculovirus-Expressed Mouse Interleukin-3 by Mass Spectrometry," Biochemistry, vol. 31, pp. 11651-11659, Nov. 1992.
[34]
S.E. Hamby and J.D. Hirst, "Prediction of Glycosylation Sites Using Random Forests," BMC Bioinformatics, vol. 9, article 500, 2008.
[35]
S. Li, B. Liu, R. Zeng, Y. Cai, and Y. Li, "Predicting O-Glycosylation Sites in Mammalian Proteins by Using SVMs," Computational Biology and Chemistry, vol. 30, pp. 203-238, June 2006.
[36]
Y. Gavel and G. von Heijne, "Sequence Differences between Glycosylated and Non-Glycosylated Asn-X-Thr/Ser Acceptor Sites: Implications for Protein Engineering," Protein Eng., vol. 3, pp. 433-442, Apr. 1990.
[37]
R.W. Carrell, J.O. Jeppsson, L. Vaughan, S.O. Brennan, M.C. Owen, and D.R. Boswell, "Human Alpha 1-antitrypsin: Carbohydrate Attachment and Sequence Homology," FEBS Letters, vol. 135, pp. 301-303, Dec. 1981.
[38]
S. Mallat, A Wavelet Tour of Signal Processing. Academic Press, 1998.
[39]
B. Karaçali, "Hierarchical Motif Vectors for Amino Acid Sequence Alignment," Proc. Ninth IASTED Int'l Conf. Biomedical Eng., 2012.
[40]
B. Karaçali, "Quasi-Supervised Learning for Biomedical Data Analysis," Pattern Recognition, vol. 43, pp. 3674-3682, 2010.
[41]
V.S. Mathura and D. Kolippakkam, "APDbase: Amino Acid Physico-Chemical Properties Database," Bioinformation, vol. 1, pp. 2-4, 2005.
[42]
A. Varki, R.D. Cummings, J.D. Esko, H.H. Freeze, G.W. Hart, and M.E. Etzler, Essentials of Glycobiology, second ed. Cold Spring Harbor Laboratory Press, 2008.
[43]
E. Weerapana and B. Imperiali, "Asparagine-Linked Protein Glycosylation: From Eukaryotic to Prokaryotic Systems," Glycobiology , vol. 16, pp. 91R-101R, June 2006.
[44]
J.P. Miletich and G.J. Broze Jr., "Beta Protein C is Not Glycosylated at Asparagine 329, The Rate of Translation may Influence the Frequency of Usage at Asparagine-X-Cysteine Sites," J. Biological Chemistry, vol. 265, pp. 11397-11404, July 1990.
[45]
V.N. Vapnik, The Nature of Statistical Learning Theory (Statistics for Engineering and Information Science), second ed. Springer-Verlag, 1999.
[46]
C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, pp. 273-297, Sept. 1995.
[47]
I. Daubechies, Ten Lectures on Wavelets. SIAM, 1992.
[48]
E.M. Danielsen, H. Skovbjerg, O. Noren, and H. Sjostrom, "Biosynthesis of Intestinal Microvillar Proteins, Intracellular Processing of Lactase-Phlorizin Hydrolase," Biochemical and Biophysical Research Comm., vol. 122, pp. 82-90, July 1984.
[49]
H.Y. Naim, E.E. Sterchi, and M.J. Lentze, "Biosynthesis and Maturation of Lactase-Phlorizin Hydrolase in the Human Small Intestinal Epithelial Cells," Biochemical J., vol. 241, pp. 427-434, Jan. 1987.
[50]
N. Netzer, J.M. Goodenbour, A. David, K.A. Dittmar, R.B. Jones, J.R. Schneider, D. Boone, E.M. Eves, M.R. Rosner, J.S. Gibbs, A. Embry, B. Dolan, S. Das, H.D. Hickman, P. Berglund, J.R. Bennink, J.W. Yewdell, and T. Pan, "Innate Immune and Chemically Triggered Oxidative Stress Modifies Translational Fidelity," Nature, vol. 462, pp. 522-526, Nov. 2009.

Cited By

View all
  1. Hierarchical Motif Vectors for Prediction of Functional Sites in Amino Acid Sequences Using Quasi-Supervised Learning

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
        IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 9, Issue 5
        September 2012
        287 pages

        Publisher

        IEEE Computer Society Press

        Washington, DC, United States

        Publication History

        Published: 01 September 2012
        Published in TCBB Volume 9, Issue 5

        Author Tags

        1. Amino acids
        2. Approximation methods
        3. Databases
        4. Functional attribute prediction
        5. Humans
        6. Prediction algorithms
        7. Proteins
        8. Vectors
        9. hierarchical motif vectors
        10. protein sequence analysis
        11. quasi-supervised learning.

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 15 Oct 2024

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Get Access

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media