Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Advertisement

LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Accurately detecting linear B-cell epitopes (BCEs) makes great sense in vaccine design, immunodiagnostic test, antibody production, disease prevention and treatment. Wet-lab experiments for determining linear BCEs are both expensive and laborious, which are not able to meet the recognition needs of modern massive protein sequence data. Instead, computational methods can efficiently identify linear BCEs with low cost. Although several computational methods are available, the performance is still not satisfactory. Thus, we propose a new method, LBCE-XGB, to forecast linear BCEs based on XGBoost algorithm. To represent the biological information concealed in peptide sequences, the embeddings of the residues were obtained from a pre-trained domain-specific BERT model. In addition, the other five types of attributes comprising amino acid composition, amino acid antigenicity scale were also extracted. The best feature combination was determined according to the cross-validation results. Against the models developed by other deep learning and machine learning algorithms, LBCE-XGB achieves the top performance with an AUROC of 0.845 for fivefold cross-validation. The results on the independent test set show that our model attains an AUROC of 0.838 which is substantially higher than other state-of-the-art methods. The outcomes indicate that the representations of BERT could be an effective feature in predicting linear BCEs and we believe that LBCE-XGB could be a useful medium for detecting linear B cell epitopes with high accuracy and low cost.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

All datasets and source codes of LBCE-XGB are publicly available on https://github.com/liuyf-a/LBCE-XGB.

References

  1. Barlow DJ, Edwards MS, Thornton JM (1986) Continuous and discontinuous protein antigenic determinants. Nature 322(6081):747–748. https://doi.org/10.1038/322747a0

    Article  CAS  PubMed  Google Scholar 

  2. Caoili SE (2014) Hybrid methods for B-cell epitope prediction. Methods Mol Biol 1184:245–283. https://doi.org/10.1007/978-1-4939-1115-8_14

    Article  CAS  PubMed  Google Scholar 

  3. Dudek NL, Perlmutter P, Aguilar MI, Croft NP, Purcell AW (2010) Epitope discovery and their use in peptide based vaccines. Curr Pharm Des 16(28):3149–3157. https://doi.org/10.2174/138161210793292447

    Article  CAS  PubMed  Google Scholar 

  4. Noya O, Patarroyo ME, Guzman F, Alarcon de Noya B (2003) Immunodiagnosis of parasitic diseases with synthetic peptides. Curr Protein Pept Sci 4(4):299–308. https://doi.org/10.2174/1389203033487153

    Article  CAS  PubMed  Google Scholar 

  5. Hoffman W, Lakkis FG, Chalasani G (2016) B cells, antibodies, and more. Clin J Am Soc Nephrol 11(1):137–154. https://doi.org/10.2215/cjn.09430915

    Article  CAS  PubMed  Google Scholar 

  6. Mangsbo SM, Fletcher EAK, van Maren WWC, Redeker A, Cordfunke RA, Dillmann I, Dinkelaar J, Ouchaou K, Codee JDC, van der Marel GA et al (2018) Linking T cell epitopes to a common linear B cell epitope: a targeting and adjuvant strategy to improve T cell responses. Mol Immunol 93:115–124. https://doi.org/10.1016/j.molimm.2017.11.004

    Article  CAS  PubMed  Google Scholar 

  7. Funaro M, Messina M, Shabbir M, Wright P, Najjar S, Tabansky I, Stern JNH (2016) The role of B cells in multiple sclerosis: more than antibodies. Discov Med 22(122):251–255

    PubMed  Google Scholar 

  8. Potocnakova L, Bhide M, Pulzova LB (2016) An Introduction to B-Cell epitope mapping and in silico epitope prediction. J Immunol Res 2016:11. https://doi.org/10.1155/2016/6760830

    Article  CAS  Google Scholar 

  9. Abbott WM, Damschroder MM, Lowe DC (2014) Current approaches to fine mapping of antigen-antibody interactions. Immunology 142(4):526–535. https://doi.org/10.1111/imm.12284

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Larsen JEP, Lund O, Nielsen M (2006) Improved method for predicting linear B-cell epitopes. Immunome Res 2:2. https://doi.org/10.1186/1745-7580-2-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Saha S, Raghava GPS (2006) Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 65(1):40–48. https://doi.org/10.1002/prot.21078

    Article  CAS  PubMed  Google Scholar 

  12. Chen J, Liu H, Yang J, Chou KC (2007) Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33(3):423–428. https://doi.org/10.1007/s00726-006-0485-9

    Article  CAS  PubMed  Google Scholar 

  13. El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting linear B-cell epitopes using string kernels. J Mol Recognit 21(4):243–255. https://doi.org/10.1002/jmr.893

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting flexible length linear B-cell epitopes. Comput Syst Bioinformatics Conf 7:121–132. https://doi.org/10.1142/9781848162648_0011

    Article  PubMed  PubMed Central  Google Scholar 

  15. Sweredoski MJ, Baldi P (2009) COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng Des Sel 22(3):113–120. https://doi.org/10.1093/protein/gzn075

    Article  CAS  PubMed  Google Scholar 

  16. Yao B, Zhang L, Liang SD, Zhang C (2012) SVMTriP: a method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. PLoS ONE 7(9):5. https://doi.org/10.1371/journal.pone.0045152

    Article  CAS  Google Scholar 

  17. Singh H, Ansari HR, Raghava GPS (2013) Improved method for linear B-cell epitope prediction using antigen’s primary sequence. PLoS ONE 8(5):8. https://doi.org/10.1371/journal.pone.0062216

    Article  CAS  Google Scholar 

  18. Shen WK, Cao Y, Cha L, Zhang XF, Ying XM, Zhang W, Ge K, Li WJ, Zhong L (2015) Predicting linear B-cell epitopes using amino acid anchoring pair composition. BioData Min 8:12. https://doi.org/10.1186/s13040-015-0047-3

    Article  CAS  Google Scholar 

  19. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45(W1):W24–W29. https://doi.org/10.1093/nar/gkx346

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Manavalan B, Govindaraj RG, Shin TH, Kim MO, Lee G (2018) iBCE-EL: a new ensemble learning framework for improved linear B-cell epitope prediction. Front Immunol 9:11. https://doi.org/10.3389/fimmu.2018.01695

    Article  CAS  Google Scholar 

  21. Hasan MM, Khatun MS, Kurata H (2020) iLBE for computational identification of linear B-cell epitopes by integrating sequence and evolutionary features. Genom Proteom Bioinf 18(5):593–600. https://doi.org/10.1016/j.gpb.2019.04.0041672-0229

    Article  Google Scholar 

  22. Liu T, Shi K, Li W (2020) Deep learning methods improve linear B-cell epitope prediction. BioData Min 13:1. https://doi.org/10.1186/s13040-020-00211-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Collatz M, Mock F, Barth E, Hoelzer M, Sachse K, Marz M (2021) EpiDope: a deep neural network for linear B-cell epitope prediction. Bioinformatics 37(4):448–455. https://doi.org/10.1093/bioinformatics/btaa773

    Article  CAS  PubMed  Google Scholar 

  24. Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC (2021) EpitopeVec: linear epitope prediction using deep protein sequence embeddings. Bioinformatics (Oxford, England). https://doi.org/10.1093/bioinformatics/btab467

  25. Devlin J, Chang MW, Lee K, Toutanova K: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies: 2019; Minneapolis, Minnesota. Association for Computational Linguistics: 4171–4186. https://doi.org/10.18653/v1/N19-1423

  26. Qiao Y, Zhu X, Gong H (2022) BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 38(3):648–654. https://doi.org/10.1093/bioinformatics/btab712

    Article  CAS  PubMed  Google Scholar 

  27. Le NQK, Ho QT, Nguyen TT, Ou YY (2021) A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. https://doi.org/10.1093/bib/bbab005

    Article  PubMed  Google Scholar 

  28. Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, Abbeel P, Song YS (2019) Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 32:9689–9701

    PubMed  PubMed Central  Google Scholar 

  29. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J et al (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA 118(15). https://doi.org/10.1073/pnas.2016239118

  30. Saha S, Bhasin M, Raghava GP (2005) Bcipep: a database of B-cell epitopes. BMC Genomics 6:79. https://doi.org/10.1186/1471-2164-6-79

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47(D1):D339–D343. https://doi.org/10.1093/nar/gky1006

    Article  CAS  PubMed  Google Scholar 

  32. Zhang Y, Lin J, Zhao L, Zeng X, Liu X (2021) A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform 22(6). https://doi.org/10.1093/bib/bbab200

  33. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43(3):246–255. https://doi.org/10.1002/prot.1035

    Article  CAS  PubMed  Google Scholar 

  34. Chen T, Guestrin C: XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, California, USA. Association for Computing Machinery 2016: 785–794. https://doi.org/10.1145/2939672.2939785

  35. Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451

    Article  Google Scholar 

  36. Bi Y, Xiang D, Ge Z, Li F, Jia C, Song J (2020) An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids 22:362–372. https://doi.org/10.1016/j.omtn.2020.08.022

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Liu K, Chen W, Lin H (2020) XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics 295(1):13–21. https://doi.org/10.1007/s00438-019-01600-9

    Article  CAS  PubMed  Google Scholar 

  38. Yu JL, Shi SP, Zhang F, Chen GD, Cao M (2019) PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 35(16):2749–2756. https://doi.org/10.1093/bioinformatics/bty1043

    Article  CAS  PubMed  Google Scholar 

  39. Wang J, Gribskov M (2019) IRESpy: an XGBoost model for prediction of internal ribosome entry sites. BMC Bioinformatics 20(1):409. https://doi.org/10.1186/s12859-019-2999-7

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Vapnik VN (1995) The Nature of Statistical Learning Theory. 1,Published: Springer New York, NY, USA; https://doi.org/10.1007/978-1-4757-2440-0.

  41. Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324

    Article  Google Scholar 

  42. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  43. Zhang S, Zheng D, Hu X, Yang M: Bidirectional long short-term memory networks for relation classification. In: Proceedings of the 29th Pacific Asia conference on language, information and computation: 2015; Shanghai, China. 73–78.

  44. Bebis G, Georgiopoulos M (1994) Feed-forward neural networks. IEEE Potentials 13(4):27–31. https://doi.org/10.1109/45.329294

    Article  Google Scholar 

  45. Swann SL, Brown SP, Muchmore SW, Patel H, Merta P, Locklear J, Hajduk PJ (2011) A unified, probabilistic framework for structure- and ligand-based virtual screening. J Med Chem 54(5):1223–1232. https://doi.org/10.1021/jm1013677

    Article  CAS  PubMed  Google Scholar 

  46. Jiang M, Zhao B, Luo S, Wang Q, Chu Y, Chen T, Mao X, Liu Y, Wang Y, Jiang X et al (2021) NeuroPpred-Fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 22(6):bbab310. https://doi.org/10.1093/bib/bbab310

    Article  PubMed  Google Scholar 

  47. Bin Y, Zhang W, Tang W, Dai R, Li M, Zhu Q, Xia J (2020) Prediction of neuropeptides from sequence information using ensemble classifier and hybrid features. J Proteome Res 19(9):3732–3740. https://doi.org/10.1021/acs.jproteome.0c00276

    Article  CAS  PubMed  Google Scholar 

  48. Chen S, Li Q, Zhao J, Bin Y, Zheng C (2022) NeuroPred-CLQ: incorporating deep temporal convolutional networks and multi-head attention mechanism to predict neuropeptides. Brief Bioinform 23(5). https://doi.org/10.1093/bib/bbac319

  49. Rethmeier N, Augenstein I (2022) A primer on contrastive pretraining in language processing: methods. Lessons Learned Perspectives ACM Comput Surv. https://doi.org/10.1145/3561970

    Article  Google Scholar 

  50. Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M et al (2022) ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 44(10):7112–7127. https://doi.org/10.1109/TPAMI.2021.3095381

    Article  PubMed  Google Scholar 

  51. Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8):2102–2110. https://doi.org/10.1093/bioinformatics/btac020

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China [grant number: 21403002], the Young Wanjiang Scholar Program of Anhui Province and the Research Program of Education Department of Anhui Province (YJS20210223).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaolei Zhu.

Ethics declarations

Conflict of interest

The authors have no competing interests.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 296 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Liu, Y., Wang, S. et al. LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings. Interdiscip Sci Comput Life Sci 15, 293–305 (2023). https://doi.org/10.1007/s12539-023-00549-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-023-00549-z

Keywords