LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings

Liu, Yufeng; Liu, Yinbo; Wang, Shuyu; Zhu, Xiaolei

doi:10.1007/s12539-023-00549-z

LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings

Original research article
Published: 16 January 2023

Volume 15, pages 293–305, (2023)
Cite this article

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Yufeng Liu¹,
Yinbo Liu¹,
Shuyu Wang¹ &
…
Xiaolei Zhu ORCID: orcid.org/0000-0002-1967-2806¹

662 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Accurately detecting linear B-cell epitopes (BCEs) makes great sense in vaccine design, immunodiagnostic test, antibody production, disease prevention and treatment. Wet-lab experiments for determining linear BCEs are both expensive and laborious, which are not able to meet the recognition needs of modern massive protein sequence data. Instead, computational methods can efficiently identify linear BCEs with low cost. Although several computational methods are available, the performance is still not satisfactory. Thus, we propose a new method, LBCE-XGB, to forecast linear BCEs based on XGBoost algorithm. To represent the biological information concealed in peptide sequences, the embeddings of the residues were obtained from a pre-trained domain-specific BERT model. In addition, the other five types of attributes comprising amino acid composition, amino acid antigenicity scale were also extracted. The best feature combination was determined according to the cross-validation results. Against the models developed by other deep learning and machine learning algorithms, LBCE-XGB achieves the top performance with an AUROC of 0.845 for fivefold cross-validation. The results on the independent test set show that our model attains an AUROC of 0.838 which is substantially higher than other state-of-the-art methods. The outcomes indicate that the representations of BERT could be an effective feature in predicting linear BCEs and we believe that LBCE-XGB could be a useful medium for detecting linear B cell epitopes with high accuracy and low cost.

Graphical Abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings

Article Open access 30 January 2024

BeeTLe: A Framework for Linear B-Cell Epitope Prediction and Classification

Deep learning methods improve linear B-cell epitope prediction

Article Open access 17 April 2020

Data availability

All datasets and source codes of LBCE-XGB are publicly available on https://github.com/liuyf-a/LBCE-XGB.

References

Barlow DJ, Edwards MS, Thornton JM (1986) Continuous and discontinuous protein antigenic determinants. Nature 322(6081):747–748. https://doi.org/10.1038/322747a0
Article CAS PubMed Google Scholar
Caoili SE (2014) Hybrid methods for B-cell epitope prediction. Methods Mol Biol 1184:245–283. https://doi.org/10.1007/978-1-4939-1115-8_14
Article CAS PubMed Google Scholar
Dudek NL, Perlmutter P, Aguilar MI, Croft NP, Purcell AW (2010) Epitope discovery and their use in peptide based vaccines. Curr Pharm Des 16(28):3149–3157. https://doi.org/10.2174/138161210793292447
Article CAS PubMed Google Scholar
Noya O, Patarroyo ME, Guzman F, Alarcon de Noya B (2003) Immunodiagnosis of parasitic diseases with synthetic peptides. Curr Protein Pept Sci 4(4):299–308. https://doi.org/10.2174/1389203033487153
Article CAS PubMed Google Scholar
Hoffman W, Lakkis FG, Chalasani G (2016) B cells, antibodies, and more. Clin J Am Soc Nephrol 11(1):137–154. https://doi.org/10.2215/cjn.09430915
Article CAS PubMed Google Scholar
Mangsbo SM, Fletcher EAK, van Maren WWC, Redeker A, Cordfunke RA, Dillmann I, Dinkelaar J, Ouchaou K, Codee JDC, van der Marel GA et al (2018) Linking T cell epitopes to a common linear B cell epitope: a targeting and adjuvant strategy to improve T cell responses. Mol Immunol 93:115–124. https://doi.org/10.1016/j.molimm.2017.11.004
Article CAS PubMed Google Scholar
Funaro M, Messina M, Shabbir M, Wright P, Najjar S, Tabansky I, Stern JNH (2016) The role of B cells in multiple sclerosis: more than antibodies. Discov Med 22(122):251–255
PubMed Google Scholar
Potocnakova L, Bhide M, Pulzova LB (2016) An Introduction to B-Cell epitope mapping and in silico epitope prediction. J Immunol Res 2016:11. https://doi.org/10.1155/2016/6760830
Article CAS Google Scholar
Abbott WM, Damschroder MM, Lowe DC (2014) Current approaches to fine mapping of antigen-antibody interactions. Immunology 142(4):526–535. https://doi.org/10.1111/imm.12284
Article CAS PubMed PubMed Central Google Scholar
Larsen JEP, Lund O, Nielsen M (2006) Improved method for predicting linear B-cell epitopes. Immunome Res 2:2. https://doi.org/10.1186/1745-7580-2-2
Article CAS PubMed PubMed Central Google Scholar
Saha S, Raghava GPS (2006) Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 65(1):40–48. https://doi.org/10.1002/prot.21078
Article CAS PubMed Google Scholar
Chen J, Liu H, Yang J, Chou KC (2007) Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33(3):423–428. https://doi.org/10.1007/s00726-006-0485-9
Article CAS PubMed Google Scholar
El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting linear B-cell epitopes using string kernels. J Mol Recognit 21(4):243–255. https://doi.org/10.1002/jmr.893
Article CAS PubMed PubMed Central Google Scholar
El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting flexible length linear B-cell epitopes. Comput Syst Bioinformatics Conf 7:121–132. https://doi.org/10.1142/9781848162648_0011
Article PubMed PubMed Central Google Scholar
Sweredoski MJ, Baldi P (2009) COBEpro: a novel system for predicting continuous B-cell epitopes. Protein Eng Des Sel 22(3):113–120. https://doi.org/10.1093/protein/gzn075
Article CAS PubMed Google Scholar
Yao B, Zhang L, Liang SD, Zhang C (2012) SVMTriP: a method to predict antigenic epitopes using support vector machine to integrate tri-peptide similarity and propensity. PLoS ONE 7(9):5. https://doi.org/10.1371/journal.pone.0045152
Article CAS Google Scholar
Singh H, Ansari HR, Raghava GPS (2013) Improved method for linear B-cell epitope prediction using antigen’s primary sequence. PLoS ONE 8(5):8. https://doi.org/10.1371/journal.pone.0062216
Article CAS Google Scholar
Shen WK, Cao Y, Cha L, Zhang XF, Ying XM, Zhang W, Ge K, Li WJ, Zhong L (2015) Predicting linear B-cell epitopes using amino acid anchoring pair composition. BioData Min 8:12. https://doi.org/10.1186/s13040-015-0047-3
Article CAS Google Scholar
Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45(W1):W24–W29. https://doi.org/10.1093/nar/gkx346
Article CAS PubMed PubMed Central Google Scholar
Manavalan B, Govindaraj RG, Shin TH, Kim MO, Lee G (2018) iBCE-EL: a new ensemble learning framework for improved linear B-cell epitope prediction. Front Immunol 9:11. https://doi.org/10.3389/fimmu.2018.01695
Article CAS Google Scholar
Hasan MM, Khatun MS, Kurata H (2020) iLBE for computational identification of linear B-cell epitopes by integrating sequence and evolutionary features. Genom Proteom Bioinf 18(5):593–600. https://doi.org/10.1016/j.gpb.2019.04.0041672-0229
Article Google Scholar
Liu T, Shi K, Li W (2020) Deep learning methods improve linear B-cell epitope prediction. BioData Min 13:1. https://doi.org/10.1186/s13040-020-00211-0
Article CAS PubMed PubMed Central Google Scholar
Collatz M, Mock F, Barth E, Hoelzer M, Sachse K, Marz M (2021) EpiDope: a deep neural network for linear B-cell epitope prediction. Bioinformatics 37(4):448–455. https://doi.org/10.1093/bioinformatics/btaa773
Article CAS PubMed Google Scholar
Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC (2021) EpitopeVec: linear epitope prediction using deep protein sequence embeddings. Bioinformatics (Oxford, England). https://doi.org/10.1093/bioinformatics/btab467
Devlin J, Chang MW, Lee K, Toutanova K: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies: 2019; Minneapolis, Minnesota. Association for Computational Linguistics: 4171–4186. https://doi.org/10.18653/v1/N19-1423
Qiao Y, Zhu X, Gong H (2022) BERT-Kcr: Prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 38(3):648–654. https://doi.org/10.1093/bioinformatics/btab712
Article CAS PubMed Google Scholar
Le NQK, Ho QT, Nguyen TT, Ou YY (2021) A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information. Brief Bioinform. https://doi.org/10.1093/bib/bbab005
Article PubMed Google Scholar
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, Abbeel P, Song YS (2019) Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 32:9689–9701
PubMed PubMed Central Google Scholar
Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J et al (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA 118(15). https://doi.org/10.1073/pnas.2016239118
Saha S, Bhasin M, Raghava GP (2005) Bcipep: a database of B-cell epitopes. BMC Genomics 6:79. https://doi.org/10.1186/1471-2164-6-79
Article CAS PubMed PubMed Central Google Scholar
Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47(D1):D339–D343. https://doi.org/10.1093/nar/gky1006
Article CAS PubMed Google Scholar
Zhang Y, Lin J, Zhao L, Zeng X, Liu X (2021) A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform 22(6). https://doi.org/10.1093/bib/bbab200
Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43(3):246–255. https://doi.org/10.1002/prot.1035
Article CAS PubMed Google Scholar
Chen T, Guestrin C: XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, California, USA. Association for Computing Machinery 2016: 785–794. https://doi.org/10.1145/2939672.2939785
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232. https://doi.org/10.1214/aos/1013203451
Article Google Scholar
Bi Y, Xiang D, Ge Z, Li F, Jia C, Song J (2020) An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids 22:362–372. https://doi.org/10.1016/j.omtn.2020.08.022
Article CAS PubMed PubMed Central Google Scholar
Liu K, Chen W, Lin H (2020) XG-PseU: an eXtreme gradient boosting based method for identifying pseudouridine sites. Mol Genet Genomics 295(1):13–21. https://doi.org/10.1007/s00438-019-01600-9
Article CAS PubMed Google Scholar
Yu JL, Shi SP, Zhang F, Chen GD, Cao M (2019) PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 35(16):2749–2756. https://doi.org/10.1093/bioinformatics/bty1043
Article CAS PubMed Google Scholar
Wang J, Gribskov M (2019) IRESpy: an XGBoost model for prediction of internal ribosome entry sites. BMC Bioinformatics 20(1):409. https://doi.org/10.1186/s12859-019-2999-7
Article CAS PubMed PubMed Central Google Scholar
Vapnik VN (1995) The Nature of Statistical Learning Theory. 1,Published: Springer New York, NY, USA; https://doi.org/10.1007/978-1-4757-2440-0.
Breiman L (2001) Random Forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Article Google Scholar
Zhang S, Zheng D, Hu X, Yang M: Bidirectional long short-term memory networks for relation classification. In: Proceedings of the 29th Pacific Asia conference on language, information and computation: 2015; Shanghai, China. 73–78.
Bebis G, Georgiopoulos M (1994) Feed-forward neural networks. IEEE Potentials 13(4):27–31. https://doi.org/10.1109/45.329294
Article Google Scholar
Swann SL, Brown SP, Muchmore SW, Patel H, Merta P, Locklear J, Hajduk PJ (2011) A unified, probabilistic framework for structure- and ligand-based virtual screening. J Med Chem 54(5):1223–1232. https://doi.org/10.1021/jm1013677
Article CAS PubMed Google Scholar
Jiang M, Zhao B, Luo S, Wang Q, Chu Y, Chen T, Mao X, Liu Y, Wang Y, Jiang X et al (2021) NeuroPpred-Fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 22(6):bbab310. https://doi.org/10.1093/bib/bbab310
Article PubMed Google Scholar
Bin Y, Zhang W, Tang W, Dai R, Li M, Zhu Q, Xia J (2020) Prediction of neuropeptides from sequence information using ensemble classifier and hybrid features. J Proteome Res 19(9):3732–3740. https://doi.org/10.1021/acs.jproteome.0c00276
Article CAS PubMed Google Scholar
Chen S, Li Q, Zhao J, Bin Y, Zheng C (2022) NeuroPred-CLQ: incorporating deep temporal convolutional networks and multi-head attention mechanism to predict neuropeptides. Brief Bioinform 23(5). https://doi.org/10.1093/bib/bbac319
Rethmeier N, Augenstein I (2022) A primer on contrastive pretraining in language processing: methods. Lessons Learned Perspectives ACM Comput Surv. https://doi.org/10.1145/3561970
Article Google Scholar
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M et al (2022) ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 44(10):7112–7127. https://doi.org/10.1109/TPAMI.2021.3095381
Article PubMed Google Scholar
Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M (2022) ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 38(8):2102–2110. https://doi.org/10.1093/bioinformatics/btac020
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China [grant number: 21403002], the Young Wanjiang Scholar Program of Anhui Province and the Research Program of Education Department of Anhui Province (YJS20210223).

Author information

Authors and Affiliations

School of Sciences, Anhui Agricultural University, Hefei, 230036, Anhui, China
Yufeng Liu, Yinbo Liu, Shuyu Wang & Xiaolei Zhu

Authors

Yufeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yinbo Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shuyu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolei Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaolei Zhu.

Ethics declarations

Conflict of interest

The authors have no competing interests.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 296 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Y., Liu, Y., Wang, S. et al. LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings. Interdiscip Sci Comput Life Sci 15, 293–305 (2023). https://doi.org/10.1007/s12539-023-00549-z

Download citation

Received: 14 September 2022
Revised: 28 December 2022
Accepted: 03 January 2023
Published: 16 January 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s12539-023-00549-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LBCE-XGB: A XGBoost Model for Predicting Linear B-Cell Epitopes Based on BERT Embeddings