Abstract
In this work, we propose a new computational technique to solve the protein classification problem. The goal is to predict the functional family of novel protein sequences based on their motif composition. In order to improve the results obtained with other known approaches, we propose a new data mining technique for protein classification based on Bayes’ theorem, called Highest Subset Probability (HiSP). To evaluate our proposal, datasets extracted from Prosite, a curated protein family database, are used as experimental datasets. The computational results have shown that the proposed method outperforms other known methods for all tested datasets and looks very promising for problems with characteristics similar to the problem addressed here.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Wang, X., Schroeder, D., Dobbs, D., Honavar, V.: Automated data-driven discovery of motif-based protein function classifiers. Information Sci. 155, 1–18 (2003)
Henikoff, S., Henikoff, J.G.: Protein family databases. In: Encyclopedia of life sciences. Macmillan Publishers Ltd. Nature Publishing Group (2001), http://www.els.net
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A.: The prosite database, its status in 2002. Nucleic Acids Res. 30, 235–238 (2002)
Sigrist, C., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P.: Prosite: a documented database using patterns and profiles as motif descriptors. Brief Bioinformatics 3, 265–274 (2002)
Merschmann, L., Plastino, A.: A bayesian approach for protein classification. In: Proc. of the 21st Annual ACM Symposium on Applied Computing, Dijon, France (2006) (to appear as a short paper)
Psomopoulos, F., Diplaris, S., Mitkas, P.A.: A finite state automata based technique for protein classification rules induction. In: Proc. of the 2nd European Workshop on Data Mining and Text Mining in Bioinf., Pisa, Italy, pp. 54–60 (2004)
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, New York (1973)
Wang, D., Wang, X., Honavar, V., Dobbs, D.L.: Data-driven generation of decision trees for motif-based assignment of protein sequences to functional families. In: Proc. of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology, North Carolina, USA (2001)
Hatzidamianos, G., Diplaris, S., Athanasiadis, I., Mitkas, P.A.: GenMiner: A data mining tool for protein analysis. In: Proc. of the 9th Panhellenic Conference On Informatics, Thessaloniki, Greece, pp. 346–360 (2003)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (1999)
Seidman, C.: Data Mining with Microsoft SQL Server. Microsoft Press, Redmond (2000)
Rawlings, N.D., Barret, A.J.: Merops: The peptidase database. Nucleic Acids Res. 28, 323–325 (2002)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, New York (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Merschmann, L., Plastino, A. (2006). HiSP: A Probabilistic Data Mining Technique for Protein Classification. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds) Computational Science – ICCS 2006. ICCS 2006. Lecture Notes in Computer Science, vol 3992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11758525_115
Download citation
DOI: https://doi.org/10.1007/11758525_115
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34381-3
Online ISBN: 978-3-540-34382-0
eBook Packages: Computer ScienceComputer Science (R0)