Abstract
As an important attribute of proteins, protein subcellular location(s) can provide valuable information about their functions. Determining protein subcellular locations using experimental methods are usually expensive and time-consuming. Over the years, a variety of computational approaches have been developed to predict protein subcellular locations based on knowledge of known protein locations. However, the problem is inherently hard, especially for proteins that can exist at multiple subcellular locations. Further studies are still in great need in this area. In this paper, we propose an ensemble learning framework that utilizes a modified Weighted K-Nearest Neighbors (WKNN) as the basic learning algorithm. Two different types of features are considered and extracted from training data, which are based on protein amino acid compositions (Amphiphilic Pseudo Amino Acid Composition, or AmPseAAC) and protein sequence similarities (Protein Similarity Measure, or PSM), respectively. Two individual classifiers are trained separately based on these two types of features and each assigns a probability distribution over different locations to a query protein. Based on the outputs of the two base classifiers, a novel ensemble strategy named Maximized Probability on Label (MPoL) is proposed. The strategy produces a final set of protein locations for each protein by integrating prediction results of the base classifiers through an optimization procedure. To measure the prediction quality of the proposed approach, two different types of evaluation metrics, example-based metrics and label-based metrics, are used. To evaluate the performance of our approach objectively, we compare its results with those predicted by another popular method named iLoc-Animal on a benchmark dataset through cross-validation. Results show that in terms of absolute true success rate on multi-location prediction, MPoL has achieved much better results than iLoc-Animal. It implies that the proposed method has some potential to solve a diverse set of multi-label learning problems.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Chou K-C, Shen H-B (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370(1):1–16
Hu L-L, Feng K-Y, Cai Y-D, Chou K-C (2012) Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. Protein Pept Lett 19(6):644–651
Chou K-C (2009) REVIEW: recent advances in developing web-servers for predicting protein attributes. Nat Sci 1(2):63– 92
Zhang S, Xia X, Shen J, Zhou Y, Sun Z (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinf 9:127
Chou K-C (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol Biosyst 9(6):1092–1100
Du P, Xu C (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev Proteomics 10(3):227–237
Murphy RF, Boland MV, Velliste M (2000) Towards a systematics for protein subcelluar location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images. Proc Int Conf Intell Syst Mol Biol 251– 259
Consortium TU (2013) Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Res 41(Database issue):D43–D47
Imai K, Nakai K (2010) Prediction of subcellular locations of proteins: where to proceed. Proteomics 10(22):3970–3983
Chou K-C (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273(1):236–247
Du P, Li T, Wang X (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics 8(3):391– 404
Chou K-C, Cai Y-D (2005) Predicting protein localization in budding yeast. Bioinformatics 21(7):944–950
Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FSL (2005) PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617– 623
Blum T, Briesemeister S, Kohlbacher O (2009) MultiLoc2: integrating phylogeny and gene ontology terms improves subcellular protein localization prediction. BMC Bioinf 10:274
Wan S, Mak M-W, Kung S-Y (2012) mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinf 13(1):290
Cao J, Liu W, He J, Gu H (2013) Identifying the singleplex and multiplex proteins based on transductive learning for protein subcellular localization prediction. Biotechnol Lett 35(7):1107–1113
Lin W-Z, Fang J-A, Xiao X, Chou K-C (2013) iLoc-animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Mol Biosyst 9(4):634–644
Wang X, Li G-Z (2013) Multilabel learning via random label selection for protein subcellular multilocations prediction. IEEE/ACM Trans Comput Biol Bioinform 10(2):436–446. https://doi.org/10.1109/TCBB.2013.21
Pacharawongsakda E, Theeramunkong T (2013) Predict subcellular locations of singleplex and multiplex proteins by semi-supervised learning and dimension-reducing general mode of Chou’s PseAAC. IEEE Trans Nanobiosci 12 (4):311–320. https://doi.org/10.1109/TNB.2013.2272014
Wan S, Mak M-W, Kung S-Y (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS One 9(3):e89545
Zhang S-W, Liu Y-F, Yu Y, Zhang T-H, Fan X-N (2014) MSLoc-DT: a new method for predicting the protein subcellular location of multispecies based on decision templates. Anal Biochem 449:164–171
Simha R, Shatkay H (2014) Protein (multi-)location prediction: using location inter-dependencies in a probabilistic framework. Algorithms Mol Biol 9(1):8
Huang C, Yuan J (2013) Using radial basis function on the general form of Chou’s pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites. Biosystems 113(1):50–57
Xu Q, Pan S-J, Xue HH, Yang Q (2011) Multitask learning for protein subcellular location prediction. IEEE/ACM Trans Comput Biol Bioinform 8(3):748–759. https://doi.org/10.1109/TCBB.2010.22
Lin T, Murphy R, Bar-Joseph Z (2011) Discriminative motif finding for predicting protein subcellular localization. IEEE/ACM Trans Comput Biol Bioinform 8(2):441–451. https://doi.org/10.1109/TCBB.2009.82
Yoon Y, Lee GG (2012) Subcellular localization prediction through boosting association rules. IEEE/ACM Trans Comput Biol Bioinform 9(2):609–618. https://doi.org/10.1109/TCBB.2011.131
Qu X-M, Wang D, Chen Y-H, Qiao S-P, Zhao Q (2016) Predicting the subcellular localization of proteins with multiple sites based on multiple features fusion. IEEE/ACM Trans Comput Biol Bioinform 13(1):36–42. https://doi.org/10.1109/TCBB.2015.2485207
Dietterichl T (2002) Ensemble learning. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 405–408
Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
Brown T, Koplowitz J (1979) The weighted nearest neighbor rule for class dependent sample sizes. IEEE Trans Inf Theory 25(5):617–619
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the IEEE international conference neural networks (ICNN’95), pp 1942–1948. https://doi.org/10.1109/ICNN.1995.488968
Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53(4):331–44
Chou K-C, Shen H-B (2007) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites research articles. J Proteome Res 6(5):1728–1734
Chou K-C (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19
Saravanan V, Lakshmi PTV (2013) APSLAP: an adaptive boosting technique for predicting subcellular localization of apoptosis protein. Acta Biotheor 61(4):481–497
Nakashima H, Nishikawa K, Ooi T (1986) The folding type of a protein is relevant to the amino acid composition. J Biochem 99(1):153–162
Carr K, Murray E, Armah E, He RL, Yau SS-T (2010) A rapid method for characterization of protein relatedness using feature vectors. PLoS One 5(3):e9550
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Chou K-C, Wu Z-C, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6(3):e18258
Chou K-C, Zhang C-T (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349
Tsoumakas G, Katakis I, Vlahavas I (2010) In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer US, pp 667–685
Acknowledgment
This work was supported by the National Natural Science Foundation of China (Grant No. 61302128) and the Science and Technology Foundation of University of Jinan (Grant No. XKY1402), and JL was supported in part by the National Science Foundation grant [III1162374] and the National Institutes of Health (HG008632).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qiao, S., Yan, B. & Li, J. Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features. Appl Intell 48, 1813–1824 (2018). https://doi.org/10.1007/s10489-017-1029-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1029-6