Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

Qiao, Shanping; Yan, Baoqiang; Li, Jing

doi:10.1007/s10489-017-1029-6

Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

Published: 09 September 2017

Volume 48, pages 1813–1824, (2018)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Shanping Qiao^1,2,
Baoqiang Yan³ &
Jing Li⁴

439 Accesses
18 Citations
Explore all metrics

Abstract

As an important attribute of proteins, protein subcellular location(s) can provide valuable information about their functions. Determining protein subcellular locations using experimental methods are usually expensive and time-consuming. Over the years, a variety of computational approaches have been developed to predict protein subcellular locations based on knowledge of known protein locations. However, the problem is inherently hard, especially for proteins that can exist at multiple subcellular locations. Further studies are still in great need in this area. In this paper, we propose an ensemble learning framework that utilizes a modified Weighted K-Nearest Neighbors (WKNN) as the basic learning algorithm. Two different types of features are considered and extracted from training data, which are based on protein amino acid compositions (Amphiphilic Pseudo Amino Acid Composition, or AmPseAAC) and protein sequence similarities (Protein Similarity Measure, or PSM), respectively. Two individual classifiers are trained separately based on these two types of features and each assigns a probability distribution over different locations to a query protein. Based on the outputs of the two base classifiers, a novel ensemble strategy named Maximized Probability on Label (MPoL) is proposed. The strategy produces a final set of protein locations for each protein by integrating prediction results of the base classifiers through an optimization procedure. To measure the prediction quality of the proposed approach, two different types of evaluation metrics, example-based metrics and label-based metrics, are used. To evaluate the performance of our approach objectively, we compare its results with those predicted by another popular method named iLoc-Animal on a benchmark dataset through cross-validation. Results show that in terms of absolute true success rate on multi-location prediction, MPoL has achieved much better results than iLoc-Animal. It implies that the proposed method has some potential to solve a diverse set of multi-label learning problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Article Open access 21 June 2016

Feature Combination Methods for Prediction of Subcellular Locations of Proteins with Both Single and Multiple Sites

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins

Article Open access 24 February 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Chou K-C, Shen H-B (2007) Recent progress in protein subcellular location prediction. Anal Biochem 370(1):1–16
Article MathSciNet Google Scholar
Hu L-L, Feng K-Y, Cai Y-D, Chou K-C (2012) Using protein-protein interaction network information to predict the subcellular locations of proteins in budding yeast. Protein Pept Lett 19(6):644–651
Article Google Scholar
Chou K-C (2009) REVIEW: recent advances in developing web-servers for predicting protein attributes. Nat Sci 1(2):63– 92
Google Scholar
Zhang S, Xia X, Shen J, Zhou Y, Sun Z (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinf 9:127
Article Google Scholar
Chou K-C (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Mol Biosyst 9(6):1092–1100
Article Google Scholar
Du P, Xu C (2013) Predicting multisite protein subcellular locations: progress and challenges. Expert Rev Proteomics 10(3):227–237
Article Google Scholar
Murphy RF, Boland MV, Velliste M (2000) Towards a systematics for protein subcelluar location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images. Proc Int Conf Intell Syst Mol Biol 251– 259
Consortium TU (2013) Update on activities at the universal protein resource (UniProt) in 2013. Nucleic Acids Res 41(Database issue):D43–D47
Google Scholar
Imai K, Nakai K (2010) Prediction of subcellular locations of proteins: where to proceed. Proteomics 10(22):3970–3983
Article Google Scholar
Chou K-C (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273(1):236–247
Article MathSciNet Google Scholar
Du P, Li T, Wang X (2011) Recent progress in predicting protein sub-subcellular locations. Expert Rev Proteomics 8(3):391– 404
Article Google Scholar
Chou K-C, Cai Y-D (2005) Predicting protein localization in budding yeast. Bioinformatics 21(7):944–950
Article Google Scholar
Gardy JL, Laird MR, Chen F, Rey S, Walsh CJ, Ester M, Brinkman FSL (2005) PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21(5):617– 623
Article Google Scholar
Blum T, Briesemeister S, Kohlbacher O (2009) MultiLoc2: integrating phylogeny and gene ontology terms improves subcellular protein localization prediction. BMC Bioinf 10:274
Article Google Scholar
Wan S, Mak M-W, Kung S-Y (2012) mGOASVM: multi-label protein subcellular localization based on gene ontology and support vector machines. BMC Bioinf 13(1):290
Article Google Scholar
Cao J, Liu W, He J, Gu H (2013) Identifying the singleplex and multiplex proteins based on transductive learning for protein subcellular localization prediction. Biotechnol Lett 35(7):1107–1113
Article Google Scholar
Lin W-Z, Fang J-A, Xiao X, Chou K-C (2013) iLoc-animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. Mol Biosyst 9(4):634–644
Article Google Scholar
Wang X, Li G-Z (2013) Multilabel learning via random label selection for protein subcellular multilocations prediction. IEEE/ACM Trans Comput Biol Bioinform 10(2):436–446. https://doi.org/10.1109/TCBB.2013.21
Article Google Scholar
Pacharawongsakda E, Theeramunkong T (2013) Predict subcellular locations of singleplex and multiplex proteins by semi-supervised learning and dimension-reducing general mode of Chou’s PseAAC. IEEE Trans Nanobiosci 12 (4):311–320. https://doi.org/10.1109/TNB.2013.2272014
Article Google Scholar
Wan S, Mak M-W, Kung S-Y (2014) HybridGO-Loc: mining hybrid features on gene ontology for predicting subcellular localization of multi-location proteins. PLoS One 9(3):e89545
Article Google Scholar
Zhang S-W, Liu Y-F, Yu Y, Zhang T-H, Fan X-N (2014) MSLoc-DT: a new method for predicting the protein subcellular location of multispecies based on decision templates. Anal Biochem 449:164–171
Article Google Scholar
Simha R, Shatkay H (2014) Protein (multi-)location prediction: using location inter-dependencies in a probabilistic framework. Algorithms Mol Biol 9(1):8
Article Google Scholar
Huang C, Yuan J (2013) Using radial basis function on the general form of Chou’s pseudo amino acid composition and PSSM to predict subcellular locations of proteins with both single and multiple sites. Biosystems 113(1):50–57
Article Google Scholar
Xu Q, Pan S-J, Xue HH, Yang Q (2011) Multitask learning for protein subcellular location prediction. IEEE/ACM Trans Comput Biol Bioinform 8(3):748–759. https://doi.org/10.1109/TCBB.2010.22
Article Google Scholar
Lin T, Murphy R, Bar-Joseph Z (2011) Discriminative motif finding for predicting protein subcellular localization. IEEE/ACM Trans Comput Biol Bioinform 8(2):441–451. https://doi.org/10.1109/TCBB.2009.82
Article Google Scholar
Yoon Y, Lee GG (2012) Subcellular localization prediction through boosting association rules. IEEE/ACM Trans Comput Biol Bioinform 9(2):609–618. https://doi.org/10.1109/TCBB.2011.131
Article Google Scholar
Qu X-M, Wang D, Chen Y-H, Qiao S-P, Zhao Q (2016) Predicting the subcellular localization of proteins with multiple sites based on multiple features fusion. IEEE/ACM Trans Comput Biol Bioinform 13(1):36–42. https://doi.org/10.1109/TCBB.2015.2485207
Article Google Scholar
Dietterichl T (2002) Ensemble learning. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge, pp 405–408
Google Scholar
Schapire RE (1990) The strength of weak learnability. Mach Learn 5(2):197–227
Google Scholar
Brown T, Koplowitz J (1979) The weighted nearest neighbor rule for class dependent sample sizes. IEEE Trans Inf Theory 25(5):617–619
Article Google Scholar
Kennedy J, Eberhart R (1995) Particle swarm optimization. In: Proceedings of the IEEE international conference neural networks (ICNN’95), pp 1942–1948. https://doi.org/10.1109/ICNN.1995.488968
Mandal M, Mukhopadhyay A, Maulik U (2015) Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of Chou’s PseAAC. Med Biol Eng Comput 53(4):331–44
Article Google Scholar
Chou K-C, Shen H-B (2007) Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites research articles. J Proteome Res 6(5):1728–1734
Article Google Scholar
Chou K-C (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21(1):10–19
Article MathSciNet Google Scholar
Saravanan V, Lakshmi PTV (2013) APSLAP: an adaptive boosting technique for predicting subcellular localization of apoptosis protein. Acta Biotheor 61(4):481–497
Article Google Scholar
Nakashima H, Nishikawa K, Ooi T (1986) The folding type of a protein is relevant to the amino acid composition. J Biochem 99(1):153–162
Article Google Scholar
Carr K, Murray E, Armah E, He RL, Yau SS-T (2010) A rapid method for characterization of protein relatedness using feature vectors. PLoS One 5(3):e9550
Article Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27. https://doi.org/10.1109/TIT.1967.1053964
Article MATH Google Scholar
Chou K-C, Wu Z-C, Xiao X (2011) iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6(3):e18258
Article Google Scholar
Chou K-C, Zhang C-T (1995) Prediction of protein structural classes. Crit Rev Biochem Mol Biol 30(4):275–349
Article Google Scholar
Tsoumakas G, Katakis I, Vlahavas I (2010) In: Maimon O, Rokach L (eds) Data mining and knowledge discovery handbook. Springer US, pp 667–685

Download references

Acknowledgment

This work was supported by the National Natural Science Foundation of China (Grant No. 61302128) and the Science and Technology Foundation of University of Jinan (Grant No. XKY1402), and JL was supported in part by the National Science Foundation grant [III1162374] and the National Institutes of Health (HG008632).

Author information

Authors and Affiliations

School of Management Science and Engineering, Shandong Normal University, Jinan, 250014, China
Shanping Qiao
Shandong Provincial Key Laboratory of Network Based Intelligent Computing, School of Information Science and Engineering, University of Jinan, Jinan, 250022, China
Shanping Qiao
School of Mathematical Science, Shandong Normal University, Jinan, 250014, China
Baoqiang Yan
Department of Electrical Engineering and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH, 44106, USA
Jing Li

Authors

Shanping Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Baoqiang Yan
View author publications
You can also search for this author in PubMed Google Scholar
Jing Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shanping Qiao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qiao, S., Yan, B. & Li, J. Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features. Appl Intell 48, 1813–1824 (2018). https://doi.org/10.1007/s10489-017-1029-6

Download citation

Published: 09 September 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s10489-017-1029-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Feature Combination Methods for Prediction of Subcellular Locations of Proteins with Both Single and Multiple Sites

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Feature Combination Methods for Prediction of Subcellular Locations of Proteins with Both Single and Multiple Sites

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins

Explore related subjects

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation