Community Detection-Based Feature Construction for Protein Sequence Classification

Tangirala, Karthik; Herndon, Nic; Caragea, Doina

doi:10.1007/978-3-319-19048-8_28

Karthik Tangirala⁷,
Nic Herndon⁷ &
Doina Caragea⁷

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 9096))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

1975 Accesses
1 Citations

Abstract

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach uses the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently uses community detection to identify groups of k-mers that appear frequently in a set of sequences. While this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extend our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Dimensionality Reduction via Community Detection in Small Sample Datasets

Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm

References

Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Article Google Scholar
Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. Proteome Science 10(1), 1–8 (2012)
Article Google Scholar
Sun, L., Luo, H., Bu, D., Zhao, G., Yu, K., Zhang, C., Liu, Y., Chen, R., Zhao, Y.: Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research (2013)
Google Scholar
Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Physical Review E, 1–6 (2004)
Google Scholar
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5(suppl. 3), 345–351 (1978)
Google Scholar
Emanuelsson, O., Nielsen, H., Brunak, S., Heijne, G.: Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. Journal of Molecular Biology 300(4), 1005–1016 (2000)
Article Google Scholar
Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L.: Psortb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5), 617–623 (2005)
Article Google Scholar
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)
Article MATH MathSciNet Google Scholar
Guimera, R., Sales-Pardo, M., Amaral Modularity, L.A.N.: from fluctuations in random graphs and complex networks. Phys. Rev. E 70(025101) (2004)
Google Scholar
Massen, C.P., Doye, J.P.K.: Identifying communities within energy landscapes. Phys. Rev. E 71(046101) (2005)
Google Scholar
Medus, A., Acuna, G., Dorso, C.: Detection of community structures in networks via global optimization. Physica A: Statistical Mechanics and its Applications 358(2), 593–604 (2005)
Article Google Scholar
Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic networks. Nature 433(7028), 895–900 (2005)
Article Google Scholar
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89(22), 10915–10919 (1992)
Article Google Scholar
Herndon, N., Caragea, D.: Naïve Bayes Domain Adaptation for Biological Sequences. In: Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pp. 62–70 (2013)
Google Scholar
Jia, C., Carson, M., Yu, J.: A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics 14(1), 1–14 (2013)
Article Google Scholar
Tangirala, K., Caragea, D.: Community detection-based features for sequence classification. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB 2014). ACM (2014)
Google Scholar
Largeron, C., Moulin, C., Géry, M.: Entropy based feature selection for text categorization. In: Proc. of the 2011 ACM Symp. on Applied Computing, SAC 2011, pp. 924–928 (2011)
Google Scholar
Dongfang, N., Xiaolong, Z.: Prediction of hot regions in protein-protein interactions based on complex network and community detection. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 17–23 (December 2013)
Google Scholar
Mahmoud, H., Masulli, F., Rovetta, S., Russo, G.: Community detection in protein-protein interaction networks using spectral and graph approaches. In: Formenti, E., Tagliaferri, R., Wit, E. (eds.) CIBB 2013. LNCS, vol. 8452, pp. 62–75. Springer, Heidelberg (2014)
Chapter Google Scholar
Mallek, S., Boukhris, I., Elouedi, Z.: Predicting proteins functional family: A graph-based similarity derived from community detection. In: Filev, D., Jabłkowski, J., Kacprzyk, J., Krawczak, M., Popchev, I., Rutkowski, L. (eds.) Intelligent Systems’2014. AISC, vol. 323, pp. 629–639. Springer, Heidelberg (2015)
Google Scholar
van Laarhoven, T., Marchiori, E.: Robust community detection methods with resolution parameter for complex detection in protein protein interaction networks. In: Shibuya, T., Kashima, H., Sese, J., Ahmad, S. (eds.) PRIB 2012. LNCS, vol. 7632, pp. 1–13. Springer, Heidelberg (2012)
Chapter Google Scholar
Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(026113) (2004)
Google Scholar
Blondel, V., Guillaume, J., Lambiotte, R., Mech, E.: Fast unfolding of communities in large networks. J. Stat. Mech, P10008 (2008)
Google Scholar
Donetti, L., Muñoz, M.A.: Improved spectral algorithm for the detection of network communities. In: Proceedings of the 8th Granada Seminar - Computational and Statistical Physics, pp. 1–2 (2005)
Google Scholar
Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76(3) (September 2007)
Google Scholar
Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105(4), 1118–1123 (2008)
Article Google Scholar
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 101(9), 2658–2663 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computing and Information Sciences, Kansas State University, Manhattan, KS, 66502, USA
Karthik Tangirala, Nic Herndon & Doina Caragea

Authors

Karthik Tangirala
View author publications
You can also search for this author in PubMed Google Scholar
Nic Herndon
View author publications
You can also search for this author in PubMed Google Scholar
Doina Caragea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Karthik Tangirala .

Editor information

Editors and Affiliations

Georgia State University, Atlanta, USA
Robert Harrison
Old Dominion University, Norfolk, USA
Yaohang Li
University of Connecticut, Storrs, Connecticut, USA
Ion Măndoiu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tangirala, K., Herndon, N., Caragea, D. (2015). Community Detection-Based Feature Construction for Protein Sequence Classification. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_28

Download citation

DOI: https://doi.org/10.1007/978-3-319-19048-8_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19047-1
Online ISBN: 978-3-319-19048-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Community Detection-Based Feature Construction for Protein Sequence Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Dimensionality Reduction via Community Detection in Small Sample Datasets

Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Community Detection-Based Feature Construction for Protein Sequence Classification

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Dimensionality Reduction via Community Detection in Small Sample Datasets

Protein Sequence Classification Based on N-Gram and K-Nearest Neighbor Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation