Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model

Shen, Junbo; Yu, Qinze; Chen, Shenyang; Tan, Qingxiong; Li, Jingchen; Li, Yu

doi:10.1038/s43588-023-00576-2

Article
Published: 13 December 2023

Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model

Junbo ShenÂ ORCID: orcid.org/0009-0000-6259-7509^1,2^Â na1,
Qinze Yu¹^Â na1,
Shenyang Chen^1,3,4^Â na1,
Qingxiong Tan¹,
Jingchen Li¹ &
â¦
Yu LiÂ ORCID: orcid.org/0000-0002-3664-6722^1,3,5,6,7,8Â

Nature Computational Science volumeÂ 4,Â pages 29â42 (2024)Cite this article

1319 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

Signal peptides (SPs) are essential to target and transfer transmembrane and secreted proteins to the correct positions. Many existing computational tools for predicting SPs disregard the extreme data imbalance problem and rely on additional group information of proteins. Here we introduce Unbiased Organism-agnostic Signal Peptide Network (USPNet), an SP classification and cleavage-site prediction deep learning method. Extensive experimental results show that USPNet substantially outperforms previous methods on classification performance by 10%. An SP-discovering pipeline with USPNet is designed to explore unprecedented SPs from metagenomic data. It reveals 347 SP candidates, with the lowest sequence identity between our candidates and the closest SP in the training dataset at only 13%. In addition, the template modeling scores between candidates and SPs in the training set are mostly above 0.8. The results showcase that USPNet has learnt the SP structure with raw amino acid sequences and the large protein language model, thereby enabling the discovery of unknown SPs.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: USPNet workflow for predicting SP and cleavage site.**

**Fig. 2: USPNet shows robust performance across different SP types and organism groups.**

**Fig. 3: Embedding and ablation study performance analysis of USPNet compared with alternative models.**

**Fig. 4: Performance of USPNet on domain-shift data.**

**Fig. 5: The exploration of metagenomics data for SP discovery.**

SignalP 6.0 predicts all five types of signal peptides using protein language models

Article Open access 03 January 2022

Deep embeddings to comprehend and visualize microbiome protein space

Article Open access 20 June 2022

The proteome landscape of the kingdoms of life

Article 17 June 2020

Data availability

All the datasets we used, including training data, benchmark data, independent test data, proteome-wide study results and metagenomic study results are listed in Methods and are available at https://doi.org/10.17605/OSF.IO/NH3CF ref. ⁴⁹. Source data are provided with this paper.

Code availability

The open-source codes of USPNet can be found at https://github.com/ml4bio/USPNet and the Code Ocean software capsule https://doi.org/10.24433/CO.8184163.v1 ref. ⁵⁰.

References

von Heijne, G. Life and death of a signal peptide. Nature 396, 111â113 (1998).
ArticleÂ Google ScholarÂ
Heijne, G. V. The signal peptide. J. Membr. Biol. 115, 195â201 (1990).
ArticleÂ Google ScholarÂ
Bradshaw, N., Neher, S. B., Booth, D. S. & Walter, P. Signal sequences activate the catalytic switch of SRP RNA. Science 323, 127â130 (2009).
ArticleÂ Google ScholarÂ
von Heijne, G. Patterns of amino acids near signal-sequence cleavage sites. Eur. J. Biochem. 133, 17â21 (1983).
ArticleÂ Google ScholarÂ
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099â1106 (2023).
Frank, K. & Sippl, M. J. High-performance signal peptide prediction based on sequence alignment techniques. Bioinformatics 24, 2172â2176 (2008).
ArticleÂ Google ScholarÂ
Petersen, T. N., Brunak, S., Von Heijne, G. & Nielsen, H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods 8, 785â786 (2011).
ArticleÂ Google ScholarÂ
Savojardo, C., Martelli, P. L., Fariselli, P. & Casadio, R. DeepSig: deep learning improves signal peptide detection in proteins. Bioinformatics 10, 1690â1696 (2017).
Google ScholarÂ
Armenteros, J. J. A. et al. SignalP 5.0 improves signal peptide predictions using deep neural networks. Nat. Biotechnol. 37, 420â423 (2019).
ArticleÂ Google ScholarÂ
Teufel, F. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 40, 1023â1025 (2022).
Juncker, A. S. et al. Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci. 12, 1652â1662 (2003).
ArticleÂ Google ScholarÂ
Bagos, P. G., Tsirigos, K. D., Liakopoulos, T. D. & Hamodrakas, S. J. Prediction of lipoprotein signal peptides in Gram-positive bacteria with a hidden Markov model. J. Proteome Res. 7, 5082â5093 (2008).
ArticleÂ Google ScholarÂ
Bendtsen, J. D., Nielsen, H., Widdick, D., Palmer, T. & Brunak, S. Prediction of twin-arginine signal peptides. BMC Bioinformatics 6, 167 (2005).
ArticleÂ Google ScholarÂ
Pasolli, E. et al. Accessible, curated metagenomic data through experimenthub. Nat. Methods 14, 1023â1024 (2017).
ArticleÂ Google ScholarÂ
Sczyrba, A. et al. Critical assessment of metagenome interpretationâa benchmark of metagenomics software. Nat. Methods 14, 1063â1071 (2017).
ArticleÂ Google ScholarÂ
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
ArticleÂ Google ScholarÂ
Rao, R. M. et al. MSA transformer. In Proc. 38th International Conference on Machine Learning, Proc. Machine Learning Research Vol. 139 (eds Meila, M. & Zhang, T.) 8844â8856 (PMLR, 2021).
Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M. & Church, G. M. Low-N protein engineering with data-efficient deep learning. Nat. Methods 18, 389â396 (2021).
ArticleÂ Google ScholarÂ
Thireou, T. & Reczko, M. Bidirectional long short-term memory networks for predicting the subcellular localization of eukaryotic proteins. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 441â446 (2007).
ArticleÂ Google ScholarÂ
Cao, K., Wei, C., Gaidon, A., Arechiga, N. & Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Process. Syst. 32, 1567â1578 (2019).
Google ScholarÂ
Mnih, V. et al. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 27, 2204â2212 (2014).
Google ScholarÂ
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123â1130 (2023).
ArticleÂ MathSciNetÂ Google ScholarÂ
Lin, T.-Y., Goyal, P., Girshick, R., He, K. & DollÃ¡r, P. Focal loss for dense object detection. In Proc. IEEE International Conference on Computer Vision 2980â2988 (IEEE, 2017).
Elnaggar, A. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112â7127 (2021).
ArticleÂ Google ScholarÂ
Armenteros, J. J. A. et al. Detecting sequence signals in targeting peptides using deep learning. Life Sci. Alliance 2, e201900429 (2019).
ArticleÂ Google ScholarÂ
Ma, Y. et al. Identification of antimicrobial peptides from the human gut microbiome using deep learning. Na. Biotechnol. 40, 921â931 (2022).
ArticleÂ Google ScholarÂ
Han, S. et al. Novel signal peptides improve the secretion of recombinant Staphylococcus aureus alpha toxin_H35L in Escherichia coli. AMB Express 7, 93 (2017).
ArticleÂ Google ScholarÂ
Consortium, T. U. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 51, D523âD531 (2022).
ArticleÂ Google ScholarÂ
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583â589 (2021).
ArticleÂ Google ScholarÂ
Consortium, U. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506âD515 (2019).
ArticleÂ Google ScholarÂ
Sigrist, C. J. et al. New and continuing developments at prosite. Nucleic Acids Res. 41, D344âD347 (2012).
ArticleÂ Google ScholarÂ
Dobson, L., Lango, T., RemÃ©nyi, I. & TusnÃ¡dy, G. E. Expediting topology data gathering for the TOPDB database. Nucleic Acids Res. 43, D283âD289 (2015).
ArticleÂ Google ScholarÂ
GÃslason, M. H., Nielsen, H., Armenteros, J. J. A. & Johansen, A. R. Prediction of GPI-anchored proteins with pointer neural networks. Curr. Res. Biotechnol. 3, 6â13 (2021).
ArticleÂ Google ScholarÂ
Li, W. & Godzik, A. CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658â1659 (2006).
ArticleÂ Google ScholarÂ
Youngblut, N. D. et al. Large-scale metagenome assembly reveals novel animal-associated microbial genomes, biosynthetic gene clusters, and other genetic diversity. mSystems 5, e01045-20 (2020).
ArticleÂ Google ScholarÂ
Looft, T., Bayles, D., Alt, D. & Stanton, T. Complete genome sequence of Coriobacteriaceae strain 68-1-3, a novel mucus-degrading isolate from the swine intestinal tract. Genome Announc. 3, e01143-15 (2015).
ArticleÂ Google ScholarÂ
Zhou, S. et al. Characterization of metagenome-assembled genomes and carbohydrate-degrading genes in the gut microbiota of Tibetan pig. Front. Microbiol. 11, 595066 (2020).
ArticleÂ Google ScholarÂ
Chen, C. et al. Prevotella copri increases fat accumulation in pigs fed with formula diets. Microbiome 9, 175 (2021).
ArticleÂ Google ScholarÂ
Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053â2067 (2021).
ArticleÂ Google ScholarÂ
Tilocca, B. et al. Dietary changes in nutritional studies shape the structural and functional composition of the pigsâ fecal microbiomeâfrom days to weeks. Microbiome 5, 144 (2017).
ArticleÂ Google ScholarÂ
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884âi890 (2018).
ArticleÂ Google ScholarÂ
Steinegger, M., Mirdita, M. & SÃ¶ding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603â606 (2019).
ArticleÂ Google ScholarÂ
Mirdita, M. et al. UniCclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170âD176 (2017).
ArticleÂ Google ScholarÂ
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics 20, 473 (2019).
ArticleÂ Google ScholarÂ
McInnes, L., Healy, J. & Melville, J. UMAP: uniform manifold approximation and projection for dimension reduction. Preprint at arXiv https://doi.org/10.48550/arXiv.1802.03426 (2018).
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679â682 (2022).
ArticleÂ Google ScholarÂ
DeLano, W. L. et al. PyMOL: an open-source molecular graphics tool. CCP4 Newsl. Protein Crystallogr. 40, 82â92 (2002).
Google ScholarÂ
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302â2309 (2005).
ArticleÂ Google ScholarÂ
Shen, J. et al. Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. OSF https://doi.org/10.17605/OSF.IO/NH3CF (2023).
Shen, J. et al. Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. Code Ocean https://doi.org/10.24433/CO.8184163.v1 (2023).

Download references

Acknowledgements

Special thanks to the people who suggested that we evaluate models on the 40% cut-off benchmark set. The work described in this paper was supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (project number CUHK 24204023, to Y.L.) and a grant from Innovation and Technology Commission of the Hong Kong Special Administrative Region, China (project number GHP/065/21SZ, to Y.L.). The work was partially supported by the National Key R&D Program of China (NO.2022ZD0160101).

Author information

These authors contributed equally: Junbo Shen, Qinze Yu, Shenyang Chen.

Authors and Affiliations

Department of Computer Science and Engineering, CUHK, Hong Kong SAR, China
Junbo Shen,Â Qinze Yu,Â Shenyang Chen,Â Qingxiong Tan,Â Jingchen LiÂ &Â Yu Li
Department of Computer Science and Engineering, Washington University, St. Louis, MO, US
Junbo Shen
The CUHK Shenzhen Research Institute, Shenzhen, China
Shenyang ChenÂ &Â Yu Li
Georgia Institute of Technology, Atlanta, GA, US
Shenyang Chen
Shanghai Artificial Intelligence Laboratory, Shanghai, China
Yu Li
Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
Yu Li
Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, USA
Yu Li
Broad Institute of MIT and Harvard, Cambridge, MA, USA
Yu Li

Authors

Junbo Shen
View author publications
You can also search for this author in PubMedÂ Google Scholar
Qinze Yu
View author publications
You can also search for this author in PubMedÂ Google Scholar
Shenyang Chen
View author publications
You can also search for this author in PubMedÂ Google Scholar
Qingxiong Tan
View author publications
You can also search for this author in PubMedÂ Google Scholar
Jingchen Li
View author publications
You can also search for this author in PubMedÂ Google Scholar
Yu Li
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

Y.L., J.S. and S.C. designed the computational method. J.S., Q.Y. and S.C. implemented the main algorithm. J.S., Q.Y., S.C., Q.T. and J.L. did the experiments. J.S. and Q.Y. performed the analysis. J.S., Q.Y. and S.C. wrote the paper. Y.L. supervised the project. All authors read and approved the paper.

Corresponding author

Correspondence to Yu Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Rita Casadio and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.

Additional information

Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1â7, Supplementary Figs. 1â6 and Tables 1â21.

Reporting Summary

Peer Review File

Source data

Source Data Fig. 2

Statistical source data.

Source Data Fig. 3

Statistical source data.

Source Data Fig. 4

Statistical source data.

Source Data Fig. 5

Statistical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Shen, J., Yu, Q., Chen, S. et al. Unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model. Nat Comput Sci 4, 29â42 (2024). https://doi.org/10.1038/s43588-023-00576-2

Download citation

Received: 24 July 2023
Accepted: 22 November 2023
Published: 13 December 2023
Issue Date: January 2024
DOI: https://doi.org/10.1038/s43588-023-00576-2