Abstract
In de novo protein sequencing, we often could only obtain an incomplete protein sequence, namely scaffold, from top-down and bottom-up tandem mass spectrometry. While most sections of the proteins can be inferred from its homologous sequences, some specific section of proteins is always missing and it is hard to predict the missing amino acids in the gaps of the scaffold. Thus, we only focus on predicting the gaps based on a probabilistic algorithm and machine learning models instead predicting the complete protein sequence using generative AI models in this paper. We study two versions of the protein scaffold filling problem with known size gaps and known mass gaps. For the known size gaps version, we develop several machine learning models based on random forest, k-nearest neighbors, decision tree and fully connected neural network. For the known mass gap problem, we design a probabilistic algorithm to predict the missing amino acids in the gaps. The experimental results on both real and simulation data show that our proposed algorithms show promising results of 100% and close to 100% accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198–207 (2003)
Bricas, E., Van Heijenoort, J., Barber, M., Wolstenholme, W., Das, B., Lederer, E.: Determination of amino acid sequences in oligopeptides by mass spectrometry. IV. Synthetic n-acyl oligopeptide methyl esters. Biochemistry 4(10), 2254–2260 (1965)
Dupré, M., et al.: De novo sequencing of antibody light chain proteoforms from patients with multiple myeloma. Anal. Chem. 93(30), 10627–10634 (2021). pMID: 34292722. https://doi.org/10.1021/acs.analchem.1c01955
Kinter, M., Sherman, N.E.: Protein Sequencing and Identification Using Tandem Mass Spectrometry. Wiley, Hoboken (2005)
Liu, X., et al.: De novo protein sequencing by combining top-down and bottom-up tandem mass spectra. J. Proteome Res. 13(7), 3241–3248 (2014)
National Center for Biotechnology Information: Blast (2023). https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins
Qingge, L., Liu, X., Zhong, F., Zhu, B.: Filling a protein scaffold with a reference. IEEE Trans. Nanobiosci. 16(2), 123–130 (2017)
Standing, K.G.: Peptide and protein de novo sequencing by mass spectrometry. Curr. Opin. Struct. Biol. 13(5), 595–601 (2003)
Sturtz, J., Annan, R., Zhu, B., Liu, X., Qingge, L.: A convolutional denoising autoencoder for protein scaffold filling. In: Guo, X., Mangul, S., Patterson, M., Zelikovsky, A. (eds.) Bioinformatics Research and Applications, ISBRA 2023. LNCS, vol. 14248, pp. 518–529. Springer, Singapore (2023). https://doi.org/10.1007/978-981-99-7074-2_42
Sturtz, J., Zhu, B., Liu, X., Fu, X., Yuan, X., Qingge, L.: Deep learning approaches for the protein scaffold filling problem. In: 2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 1055–1061. IEEE (2022)
Tran, N.H., Rahman, M.Z., He, L., Xin, L., Shan, B., Li, M.: Complete de novo assembly of monoclonal antibody sequences. Sci. Rep. 6(1), 1–10 (2016)
Wulfson, N., et al.: Mass spectrometric determination of the amino (hydroxy) acid sequence in peptides and depsipeptides. Tetrahedron Lett. 6(32), 2805–2812 (1965)
Acknowledgements
This work is supported by the NSF of the United States under Award 2307571, 2307572 and 2307573. We also thank anonymous reviewers for their insightful comments and inputs.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Badal, K., Qingge, L., Liu, X., Zhu, B. (2024). Probabilistic and Machine Learning Models for the Protein Scaffold Gap Filling Problem. In: Peng, W., Cai, Z., Skums, P. (eds) Bioinformatics Research and Applications. ISBRA 2024. Lecture Notes in Computer Science(), vol 14956. Springer, Singapore. https://doi.org/10.1007/978-981-97-5087-0_3
Download citation
DOI: https://doi.org/10.1007/978-981-97-5087-0_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-5086-3
Online ISBN: 978-981-97-5087-0
eBook Packages: Computer ScienceComputer Science (R0)