Abstract
In the post-genome period, the protein domain structures are published rapidly, but they have not been studied comprehensively. To figure out the cell function, the protein–DNA interactions decrypt the protein domain structures in recent research. Several machine-learning based methods are applied to the issue; however, they are not efficient to translate the tertiary structure characteristics of proteins into appropriate features for predicting the DNA-binding proteins. In this work, a novel machine-learning approach based on hidden Markov models identifies the characteristics of DNA-binding proteins with their amino acid sequences and tertiary structures. After we distill the features from DNA-binding proteins, a support vector machine based classifier predicts general DNA-binding proteins with the accuracy of 88.45 % through fivefolds cross-validation. Furthermore, we construct a response element specific classifier for predicting response element specific DNA-binding proteins, and the performance achieves the precision of 96.57 % with recall rate as 88.83 % in average. To verify the prediction of DNA-binding proteins, we used the DNA-binding proteins from MCF-7 that are likely to bind with estrogen response elements (ERE), and the results show that our methods can apply to practice.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ahmad S, Gromiha MM, Sarai A (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics 20(4):477–486
Bairoch A, Boeckmann B, Ferro S, Gasteiger E (2004) Swiss-Prot: juggling between evolution and stability. Briefings Bioinform 5(1):39–55
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucl Acids Res 30(1):276–280
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The protein data bank. Nucl Acids Res 28(1):235–242. doi:10.1093/nar/28.1.235
Bhardwaj N, Langlois RE, Zhao G, Lu H (2005) Kernel-based machine learning protocol for predicting DNA-binding proteins. Nucl Acids Res 33(20):6486–6493. doi:10.1093/nar/gki949
Bhardwaj N, Lu H (2007) Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett 581(5):1058–1066. doi:10.1016/j.febslet.2007.01.086
Cai YD, Lin SL (2003) Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochimica et Biophysica Acta 1648(1–2):127–133
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3): 1–27. doi:10.1145/1961189.1961199
Cheng P-H, Chen H-Y, Kao H-Y (2010) Protein surface search in DNA-binding protein prediction by Delaunay triangulation modeling. Computer symposium (ICS), 2010 international, pp 783–788. doi:10.1109/COMPSYM.2010.5685406
Doyle LA, Yang W, Abruzzo LV, Krogmann T, Gao Y, Rishi AK, Ross DD (1998) A multidrug resistance transporter from human MCF-7 breast cancer cells. Proc Natl Acad Sci USA 95(26):15665–15670
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14(9):755–763
Gough J, Karplus K, Hughey R, Chothia C (2001) Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol 313(4):903–919. doi:10.1006/jmbi.2001.5080
Hubbard TJ, Murzin AG, Brenner SE, Chothia C (1997) SCOP: a structural classification of proteins database. Nucl Acids Res 25(1):236–239
Jones S, Shanahan HP, Berman HM, Thornton JM (2003) Using electrostatic potentials to predict DNA-binding sites on DNA-binding proteins. Nucl Acids Res 31(24):7189–7198
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637
Krogh A, Brown M, Mian IS, Sjolander K, Haussler D (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol 235(5):1501–1531. doi:10.1006/jmbi1994.1104
Kumar M, Gromiha MM, Raghava GP (2007) Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinform 8:463
Kummerfeld SK, Teichmann SA (2006) DBD: a transcription factor prediction database. Nucl Acids Res 34(Database issue):D74–81. doi:10.1093/nar/gkj131
Latchman DS (1997) Transcription factors: an overview. Int J Biochem Cell Biol 29(12):1305–1312
Luscombe NM, Austin SE, Berman HM, Thornton JM (2000) An overview of the structures of protein–DNA complexes. Genome Biol 1(1):REVIEWS001. doi:10.1186/gb-2000-1-1-reviews001
Paillard G, Lavery R (2004) Analyzing protein–DNA recognition mechanisms. Structure 12(1):113–122
Samanta U, Bahadur RP, Chakrabarti P (2002) Quantifying the accessible surface area of protein residues in their local environment. Protein Eng 15(8):659–667
Sarai A, Kono H (2005) Protein–DNA recognition patterns and predictions. Annu Rev Biophys Biomol Struct 34:379–398. doi:10.1146/annurev.biophys.34.040204.144537
Stawiski EW, Gregoret LM, Mandel-Gutfreund Y (2003) Annotating nucleic acid-binding function based on protein structure. J Mol Biol 326(4):1065–1079
Stegmaier P, Kel AE, Wingender E (2004) Systematic DNA-binding domain classification of transcription factors. Genome Inform 15(2):276–286
Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucl Acids Res 22(22):4673–4680
West M, Flanery D, Woytek K, Rangasamy D, Wilson VG (2001) Functional mapping of the DNA binding domain of bovine papillomavirus E1 protein. J Virol 75(24):11948–11960. doi:10.1128/jvi.75.24.11948-11960.2001
Wingender E, Dietze P, Karas H, Knuppel R (1996) TRANSFAC: a database on transcription factors and their DNA binding sites. Nucl Acids Res 24(1):238–241
Witten IH, Frank E, Trigg L, Hall M, Holmes G, Cunningham SJ (1999) Weka: practical machine learning tools and techniques with java implementations. ICONIP/ANZIIS/ANNES 99:192–196
Yang JM, Tung CH (2006) Protein structure database search and evolutionary classification. Nucl Acids Res 34(13):3646–3659
Yu X, Cao J, Cai Y, Shi T, Li Y (2006) Predicting rRNA-, RNA-, and DNA-binding proteins from primary structure with support vector machines. J Theor Biol 240(2):175–184
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by G. Acampora.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Hsu, YY., Chen, WJ., Chen, SH. et al. Using hidden Markov models to predict DNA-binding proteins with sequence and structure information. Soft Comput 18, 2365–2376 (2014). https://doi.org/10.1007/s00500-013-1210-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-013-1210-8