HiSP: A Probabilistic Data Mining Technique for Protein Classification

Merschmann, Luiz; Plastino, Alexandre

doi:10.1007/11758525_115

Luiz Merschmann²⁰ &
Alexandre Plastino²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3992))

Included in the following conference series:

International Conference on Computational Science

1356 Accesses
2 Citations

Abstract

In this work, we propose a new computational technique to solve the protein classification problem. The goal is to predict the functional family of novel protein sequences based on their motif composition. In order to improve the results obtained with other known approaches, we propose a new data mining technique for protein classification based on Bayes’ theorem, called Highest Subset Probability (HiSP). To evaluate our proposal, datasets extracted from Prosite, a curated protein family database, are used as experimental datasets. The computational results have shown that the proposed method outperforms other known methods for all tested datasets and looks very promising for problems with characteristics similar to the problem addressed here.

Download to read the full chapter text

Chapter PDF

A Brief Review on Protein Classification Based on Functional, Behavioral, and Structural Properties Using Data Mining Techniques

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Article 01 March 2024

Probabilistic Clustering for Hierarchical Multi-Label Classification of Protein Functions

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Wang, X., Schroeder, D., Dobbs, D., Honavar, V.: Automated data-driven discovery of motif-based protein function classifiers. Information Sci. 155, 1–18 (2003)
Article MathSciNet Google Scholar
Henikoff, S., Henikoff, J.G.: Protein family databases. In: Encyclopedia of life sciences. Macmillan Publishers Ltd. Nature Publishing Group (2001), http://www.els.net
Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A.: The prosite database, its status in 2002. Nucleic Acids Res. 30, 235–238 (2002)
Article Google Scholar
Sigrist, C., Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., Pagni, M., Bairoch, A., Bucher, P.: Prosite: a documented database using patterns and profiles as motif descriptors. Brief Bioinformatics 3, 265–274 (2002)
Article Google Scholar
Merschmann, L., Plastino, A.: A bayesian approach for protein classification. In: Proc. of the 21^st Annual ACM Symposium on Applied Computing, Dijon, France (2006) (to appear as a short paper)
Google Scholar
Psomopoulos, F., Diplaris, S., Mitkas, P.A.: A finite state automata based technique for protein classification rules induction. In: Proc. of the 2^nd European Workshop on Data Mining and Text Mining in Bioinf., Pisa, Italy, pp. 54–60 (2004)
Google Scholar
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, New York (1973)
MATH Google Scholar
Wang, D., Wang, X., Honavar, V., Dobbs, D.L.: Data-driven generation of decision trees for motif-based assignment of protein sequences to functional families. In: Proc. of the Atlantic Symposium on Computational Biology, Genome Information Systems & Technology, North Carolina, USA (2001)
Google Scholar
Hatzidamianos, G., Diplaris, S., Athanasiadis, I., Mitkas, P.A.: GenMiner: A data mining tool for protein analysis. In: Proc. of the 9^th Panhellenic Conference On Informatics, Thessaloniki, Greece, pp. 346–360 (2003)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (1999)
Google Scholar
Seidman, C.: Data Mining with Microsoft SQL Server. Microsoft Press, Redmond (2000)
Google Scholar
Rawlings, N.D., Barret, A.J.: Merops: The peptidase database. Nucleic Acids Res. 28, 323–325 (2002)
Article Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, New York (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Ciência da Computação, Universidade Federal Fluminense, Niterói, Brazil
Luiz Merschmann & Alexandre Plastino

Authors

Luiz Merschmann
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Plastino
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Advanced Computing and Emerging Technologies Centre, The School of Systems Engineering, University of Reading, RG6 6AY, Reading, United Kingdom
Vassil N. Alexandrov
Department of Mathematics and Computer Science, University of Amsterdam, Kruislaan 403, 1098, SJ Amsterdam, The Netherlands
Geert Dick van Albada
Faculty of Sciences, Section of Computational Science, University of Amsterdam, Kruislaan 403, 1098, SJ Amsterdam, The Netherlands
Peter M. A. Sloot
Computer Science Department, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Merschmann, L., Plastino, A. (2006). HiSP: A Probabilistic Data Mining Technique for Protein Classification. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds) Computational Science – ICCS 2006. ICCS 2006. Lecture Notes in Computer Science, vol 3992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11758525_115

Download citation

DOI: https://doi.org/10.1007/11758525_115
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-34381-3
Online ISBN: 978-3-540-34382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

HiSP: A Probabilistic Data Mining Technique for Protein Classification

Abstract

Chapter PDF

Similar content being viewed by others

A Brief Review on Protein Classification Based on Functional, Behavioral, and Structural Properties Using Data Mining Techniques

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Probabilistic Clustering for Hierarchical Multi-Label Classification of Protein Functions

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

HiSP: A Probabilistic Data Mining Technique for Protein Classification

Abstract

Chapter PDF

Similar content being viewed by others

A Brief Review on Protein Classification Based on Functional, Behavioral, and Structural Properties Using Data Mining Techniques

A Comprehensive Review on Machine Learning Techniques for Protein Family Prediction

Probabilistic Clustering for Hierarchical Multi-Label Classification of Protein Functions

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation