Abstract
Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach uses the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently uses community detection to identify groups of k-mers that appear frequently in a set of sequences. While this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extend our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)
Caragea, C., Silvescu, A., Mitra, P.: Protein sequence classification using feature hashing. Proteome Science 10(1), 1–8 (2012)
Sun, L., Luo, H., Bu, D., Zhao, G., Yu, K., Zhang, C., Liu, Y., Chen, R., Zhao, Y.: Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Research (2013)
Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Physical Review E, 1–6 (2004)
Dayhoff, M.O., Schwartz, R.M., Orcutt, B.C.: A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure 5(suppl. 3), 345–351 (1978)
Emanuelsson, O., Nielsen, H., Brunak, S., Heijne, G.: Predicting subcellular localization of proteins based on their n-terminal amino acid sequence. Journal of Molecular Biology 300(4), 1005–1016 (2000)
Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L.: Psortb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5), 617–623 (2005)
Girvan, M., Newman, M.E.J.: Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99(12), 7821–7826 (2002)
Guimera, R., Sales-Pardo, M., Amaral Modularity, L.A.N.: from fluctuations in random graphs and complex networks. Phys. Rev. E 70(025101) (2004)
Massen, C.P., Doye, J.P.K.: Identifying communities within energy landscapes. Phys. Rev. E 71(046101) (2005)
Medus, A., Acuna, G., Dorso, C.: Detection of community structures in networks via global optimization. Physica A: Statistical Mechanics and its Applications 358(2), 593–604 (2005)
Guimera, R., Amaral, L.A.N.: Functional cartography of complex metabolic networks. Nature 433(7028), 895–900 (2005)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences 89(22), 10915–10919 (1992)
Herndon, N., Caragea, D.: Naïve Bayes Domain Adaptation for Biological Sequences. In: Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pp. 62–70 (2013)
Jia, C., Carson, M., Yu, J.: A fast weak motif-finding algorithm based on community detection in graphs. BMC Bioinformatics 14(1), 1–14 (2013)
Tangirala, K., Caragea, D.: Community detection-based features for sequence classification. In: Proceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (BCB 2014). ACM (2014)
Largeron, C., Moulin, C., Géry, M.: Entropy based feature selection for text categorization. In: Proc. of the 2011 ACM Symp. on Applied Computing, SAC 2011, pp. 924–928 (2011)
Dongfang, N., Xiaolong, Z.: Prediction of hot regions in protein-protein interactions based on complex network and community detection. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 17–23 (December 2013)
Mahmoud, H., Masulli, F., Rovetta, S., Russo, G.: Community detection in protein-protein interaction networks using spectral and graph approaches. In: Formenti, E., Tagliaferri, R., Wit, E. (eds.) CIBB 2013. LNCS, vol. 8452, pp. 62–75. Springer, Heidelberg (2014)
Mallek, S., Boukhris, I., Elouedi, Z.: Predicting proteins functional family: A graph-based similarity derived from community detection. In: Filev, D., Jabłkowski, J., Kacprzyk, J., Krawczak, M., Popchev, I., Rutkowski, L. (eds.) Intelligent Systems’2014. AISC, vol. 323, pp. 629–639. Springer, Heidelberg (2015)
van Laarhoven, T., Marchiori, E.: Robust community detection methods with resolution parameter for complex detection in protein protein interaction networks. In: Shibuya, T., Kashima, H., Sese, J., Ahmad, S. (eds.) PRIB 2012. LNCS, vol. 7632, pp. 1–13. Springer, Heidelberg (2012)
Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(026113) (2004)
Blondel, V., Guillaume, J., Lambiotte, R., Mech, E.: Fast unfolding of communities in large networks. J. Stat. Mech, P10008 (2008)
Donetti, L., Muñoz, M.A.: Improved spectral algorithm for the detection of network communities. In: Proceedings of the 8th Granada Seminar - Computational and Statistical Physics, pp. 1–2 (2005)
Raghavan, U.N., Albert, R., Kumara, S.: Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76(3) (September 2007)
Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences 105(4), 1118–1123 (2008)
Radicchi, F., Castellano, C., Cecconi, F., Loreto, V., Parisi, D.: Defining and identifying communities in networks. Proceedings of the National Academy of Sciences of the United States of America 101(9), 2658–2663 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Tangirala, K., Herndon, N., Caragea, D. (2015). Community Detection-Based Feature Construction for Protein Sequence Classification. In: Harrison, R., Li, Y., Măndoiu, I. (eds) Bioinformatics Research and Applications. ISBRA 2015. Lecture Notes in Computer Science(), vol 9096. Springer, Cham. https://doi.org/10.1007/978-3-319-19048-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-19048-8_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19047-1
Online ISBN: 978-3-319-19048-8
eBook Packages: Computer ScienceComputer Science (R0)