Abstract
Accurately detecting linear B-cell epitopes (BCEs) makes great sense in vaccine design, immunodiagnostic test, antibody production, disease prevention and treatment. Wet-lab experiments for determining linear BCEs are both expensive and laborious, which are not able to meet the recognition needs of modern massive protein sequence data. Instead, computational methods can efficiently identify linear BCEs with low cost. Although several computational methods are available, the performance is still not satisfactory. Thus, we propose a new method, LBCE-XGB, to forecast linear BCEs based on XGBoost algorithm. To represent the biological information concealed in peptide sequences, the embeddings of the residues were obtained from a pre-trained domain-specific BERT model. In addition, the other five types of attributes comprising amino acid composition, amino acid antigenicity scale were also extracted. The best feature combination was determined according to the cross-validation results. Against the models developed by other deep learning and machine learning algorithms, LBCE-XGB achieves the top performance with an AUROC of 0.845 for fivefold cross-validation. The results on the independent test set show that our model attains an AUROC of 0.838 which is substantially higher than other state-of-the-art methods. The outcomes indicate that the representations of BERT could be an effective feature in predicting linear BCEs and we believe that LBCE-XGB could be a useful medium for detecting linear B cell epitopes with high accuracy and low cost.
Graphical Abstract
Similar content being viewed by others
Data availability
All datasets and source codes of LBCE-XGB are publicly available on https://github.com/liuyf-a/LBCE-XGB.
References
Barlow DJ, Edwards MS, Thornton JM (1986) Continuous and discontinuous protein antigenic determinants. Nature 322(6081):747–748. https://doi.org/10.1038/322747a0
Caoili SE (2014) Hybrid methods for B-cell epitope prediction. Methods Mol Biol 1184:245–283. https://doi.org/10.1007/978-1-4939-1115-8_14
Dudek NL, Perlmutter P, Aguilar MI, Croft NP, Purcell AW (2010) Epitope discovery and their use in peptide based vaccines. Curr Pharm Des 16(28):3149–3157. https://doi.org/10.2174/138161210793292447
Noya O, Patarroyo ME, Guzman F, Alarcon de Noya B (2003) Immunodiagnosis of parasitic diseases with synthetic peptides. Curr Protein Pept Sci 4(4):299–308. https://doi.org/10.2174/1389203033487153
Hoffman W, Lakkis FG, Chalasani G (2016) B cells, antibodies, and more. Clin J Am Soc Nephrol 11(1):137–154. https://doi.org/10.2215/cjn.09430915
Mangsbo SM, Fletcher EAK, van Maren WWC, Redeker A, Cordfunke RA, Dillmann I, Dinkelaar J, Ouchaou K, Codee JDC, van der Marel GA et al (2018) Linking T cell epitopes to a common linear B cell epitope: a targeting and adjuvant strategy to improve T cell responses. Mol Immunol 93:115–124. https://doi.org/10.1016/j.molimm.2017.11.004
Funaro M, Messina M, Shabbir M, Wright P, Najjar S, Tabansky I, Stern JNH (2016) The role of B cells in multiple sclerosis: more than antibodies. Discov Med 22(122):251–255
Potocnakova L, Bhide M, Pulzova LB (2016) An Introduction to B-Cell epitope mapping and in silico epitope prediction. J Immunol Res 2016:11. https://doi.org/10.1155/2016/6760830
Abbott WM, Damschroder MM, Lowe DC (2014) Current approaches to fine mapping of antigen-antibody interactions. Immunology 142(4):526–535. https://doi.org/10.1111/imm.12284
Larsen JEP, Lund O, Nielsen M (2006) Improved method for predicting linear B-cell epitopes. Immunome Res 2:2. https://doi.org/10.1186/1745-7580-2-2
Saha S, Raghava GPS (2006) Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 65(1):40–48. https://doi.org/10.1002/prot.21078
Chen J, Liu H, Yang J, Chou KC (2007) Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33(3):423–428. https://doi.org/10.1007/s00726-006-0485-9
El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting linear B-cell epitopes using string kernels. J Mol Recognit 21(4):243–255. https://doi.org/10.1002/jmr.893
El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting flexible length linear B-cell epitopes. Comput Syst Bioinformatics Conf 7:121–132. https://doi.org/10.1142/9781848162648_0011
Sweredoski MJ, Baldi P (2009) COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng Des Sel 22(3):113–120. https://doi.org/10.1093/protein/gzn075
Yao B, Zhang L, Liang SD, Zhang C (2012) SVMTriP: a method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. PLoS ONE 7(9):5. https://doi.org/10.1371/journal.pone.0045152
Singh H, Ansari HR, Raghava GPS (2013) Improved method for linear B-cell epitope prediction using antigen’s primary sequence. PLoS ONE 8(5):8. https://doi.org/10.1371/journal.pone.0062216
Shen WK, Cao Y, Cha L, Zhang XF, Ying XM, Zhang W, Ge K, Li WJ, Zhong L (2015) Predicting linear B-cell epitopes using amino acid anchoring pair composition. BioData Min 8:12. https://doi.org/10.1186/s13040-015-0047-3
Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45(W1):W24–W29. https://doi.org/10.1093/nar/gkx346
Manavalan B, Govindaraj RG, Shin TH, Kim MO, Lee G (2018) iBCE-EL: a new ensemble learning framework for improved linear B-cell epitope prediction. Front Immunol 9:11. https://doi.org/10.3389/fimmu.2018.01695
Hasan MM, Khatun MS, Kurata H (2020) iLBE for computational identification of linear B-cell epitopes by integrating sequence and evolutionary features. Genom Proteom Bioinf 18(5):593–600. https://doi.org/10.1016/j.gpb.2019.04.0041672-0229
Liu T, Shi K, Li W (2020) Deep learning methods improve linear B-cell epitope prediction. BioData Min 13:1. https://doi.org/10.1186/s13040-020-00211-0
Collatz M, Mock F, Barth E, Hoelzer M, Sachse K, Marz M (2021) EpiDope: a deep neural network for linear B-cell epitope prediction. Bioinformatics 37(4):448–455. https://doi.org/10.1093/bioinformatics/btaa773
Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC (2021) EpitopeVec: linear epitope prediction using deep protein sequence embeddings. Bioinformatics (Oxford, England). https://doi.org/10.1093/bioinformatics/btab467
Devlin J, Chang MW, Lee K, Toutanova K: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies: 2019; Minneapolis, Minnesota. Association for Computational Linguistics: 4171–4186. https://doi.org/10.18653/v1/N19-1423
Qiao Y, Zhu X, Gong H (2022) BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 38(3):648–654. https://doi.org/10.1093/bioinformatics/btab712
Le NQK, Ho QT, Nguyen TT, Ou YY (2021) A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. https://doi.org/10.1093/bib/bbab005
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, Abbeel P, Song YS (2019) Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 32:9689–9701
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J et al (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA 118(15). https://doi.org/10.1073/pnas.2016239118
Saha S, Bhasin M, Raghava GP (2005) Bcipep: a database of B-cell epitopes. BMC Genomics 6:79. https://doi.org/10.1186/1471-2164-6-79
Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47(D1):D339–D343. https://doi.org/10.1093/nar/gky1006
Zhang Y, Lin J, Zhao L, Zeng X, Liu X (2021) A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform 22(6). https://doi.org/10.1093/bib/bbab200
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43(3):246–255. https://doi.org/10.1002/prot.1035
Chen T, Guestrin C: XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, California, USA. Association for Computing Machinery 2016: 785–794. https://doi.org/10.1145/2939672.2939785
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451
Bi Y, Xiang D, Ge Z, Li F, Jia C, Song J (2020) An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids 22:362–372. https://doi.org/10.1016/j.omtn.2020.08.022
Liu K, Chen W, Lin H (2020) XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics 295(1):13–21. https://doi.org/10.1007/s00438-019-01600-9
Yu JL, Shi SP, Zhang F, Chen GD, Cao M (2019) PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 35(16):2749–2756. https://doi.org/10.1093/bioinformatics/bty1043
Wang J, Gribskov M (2019) IRESpy: an XGBoost model for prediction of internal ribosome entry sites. BMC Bioinformatics 20(1):409. https://doi.org/10.1186/s12859-019-2999-7
Vapnik VN (1995) The Nature of Statistical Learning Theory. 1,Published: Springer New York, NY, USA; https://doi.org/10.1007/978-1-4757-2440-0.
Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Zhang S, Zheng D, Hu X, Yang M: Bidirectional long short-term memory networks for relation classification. In: Proceedings of the 29th Pacific Asia conference on language, information and computation: 2015; Shanghai, China. 73–78.
Bebis G, Georgiopoulos M (1994) Feed-forward neural networks. IEEE Potentials 13(4):27–31. https://doi.org/10.1109/45.329294
Swann SL, Brown SP, Muchmore SW, Patel H, Merta P, Locklear J, Hajduk PJ (2011) A unified, probabilistic framework for structure- and ligand-based virtual screening. J Med Chem 54(5):1223–1232. https://doi.org/10.1021/jm1013677
Jiang M, Zhao B, Luo S, Wang Q, Chu Y, Chen T, Mao X, Liu Y, Wang Y, Jiang X et al (2021) NeuroPpred-Fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 22(6):bbab310. https://doi.org/10.1093/bib/bbab310
Bin Y, Zhang W, Tang W, Dai R, Li M, Zhu Q, Xia J (2020) Prediction of neuropeptides from sequence information using ensemble classifier and hybrid features. J Proteome Res 19(9):3732–3740. https://doi.org/10.1021/acs.jproteome.0c00276
Chen S, Li Q, Zhao J, Bin Y, Zheng C (2022) NeuroPred-CLQ: incorporating deep temporal convolutional networks and multi-head attention mechanism to predict neuropeptides. Brief Bioinform 23(5). https://doi.org/10.1093/bib/bbac319
Rethmeier N, Augenstein I (2022) A primer on contrastive pretraining in language processing: methods. Lessons Learned Perspectives ACM Comput Surv. https://doi.org/10.1145/3561970
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M et al (2022) ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 44(10):7112–7127. https://doi.org/10.1109/TPAMI.2021.3095381
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8):2102–2110. https://doi.org/10.1093/bioinformatics/btac020
Acknowledgements
This work was supported by National Natural Science Foundation of China [grant number: 21403002], the Young Wanjiang Scholar Program of Anhui Province and the Research Program of Education Department of Anhui Province (YJS20210223).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no competing interests.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Liu, Y., Liu, Y., Wang, S. et al. LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings. Interdiscip Sci Comput Life Sci 15, 293–305 (2023). https://doi.org/10.1007/s12539-023-00549-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12539-023-00549-z