Variational autoencoder for design of synthetic viral vector serotypes

Lyu, Suyue; Sowlati-Hashjin, Shahin; Garton, Michael

doi:10.1038/s42256-023-00787-2

Article
Published: 23 January 2024

Variational autoencoder for design of synthetic viral vector serotypes

Nature Machine Intelligence volumeÂ 6,Â pages 147â160 (2024)Cite this article

2708 Accesses
60 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Recent, rapid advances in deep generative models for protein design have focused on small proteins with lots of data. Such models perform poorly on large proteins with limited natural sequences, for instance, the capsid protein of adenoviruses and adeno-associated virus, which are common delivery vehicles for gene therapy. Generating synthetic viral vector serotypes could overcome the potent pre-existing immune responses that most gene therapy recipients exhibitâa consequence of previous environmental exposure. We present a variational autoencoder (ProteinVAE) that can generate synthetic viral vector serotypes without epitopes for pre-existing neutralizing antibodies. A pre-trained protein language model was incorporated into the encoder to improve data efficiency, and deconvolution-based upsampling was used for decoding to avoid degenerate repetition seen in long protein sequence generation. ProteinVAE is a compact generative model with just 12.4âmillion parameters and was efficiently trained on the limited natural sequences. Viral protein sequences generated were used to produce structures with thermodynamic stability and viral assembly capability indistinguishable from natural vector counterparts. ProteinVAE can be used to generate a broad range of synthetic serotype sequences without epitopes for pre-existing neutralizing antibodies in the human population, effectively addressing one of the major challenges of gene therapy. It could be used more broadly to generate different types of viral vector, and any large, therapeutically valuable proteins, where available data are sparse.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Comparing sequential and structural characteristics with natural hexons.**

**Fig. 3: Comparing sequence diversity against sequence quality across models.**

**Fig. 4: Molecular dynamics simulations.**

**Fig. 5: Phylogenetic analysis and imputed serotyping for generated human adenovirus hexon.**

**Fig. 6: ProteinVAE latent space allows interpolation.**

Structure-guided AAV capsid evolution strategies for enhanced CNS gene delivery

Article 21 September 2023

Deep diversification of an AAV capsid protein by machine learning

Article 11 February 2021

Protein design and variant prediction using autoregressive generative models

Article Open access 23 April 2021

Data availability

Sequences of all 711 natural hexons can be found at /data/hexon_711.fasta in the CodeOcean capsule (https://doi.org/10.24433/CO.2530457.v2 (ref. ⁹¹)). All natural hexon sequences were downloaded from the UniprotKB^26,37 database. Source data are provided with this paper.

Code availability

The code is provided at https://doi.org/10.24433/CO.2530457.v2 (ref. ⁹¹). ProtBert is used for extracting embeddings, and its code can be accessed at https://huggingface.co/Rostlab/prot_bert.

References

Vokinger, K.N., Glaus, C.E.G. & Kesselheim, A.S. Approval and therapeutic value of gene therapies in the US and Europe. Gene Ther. 30, 756â760 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Mendell, J. R. et al. Single-dose gene-replacement therapy for spinal muscular atrophy. N. Engl. J. Med. 377, 1713â1722 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179â189 (2020).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Seregin, S. S. & Amalfitano, A. Overcoming pre-existing adenovirus immunity by genetic engineering of adenovirus-based vectors. Expert Opin. Biol. Ther. 9, 1521â1531 (2009).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Verdera, H. C., Kuranda, K. & Mingozzi, F. AAV vector immunogenicity in humans: a long journey to successful gene transfer. Mol. Ther. 28, 723â746 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zhao, Z., Anselmo, A. C. & Mitragotri, S. Viral vector-based gene therapies in the clinic. Bioeng. Transl. Med. 7, e10258 (2022).
ArticleÂ PubMedÂ Google ScholarÂ
Bulcha, J. T., Wang, Y., Ma, H., Tai, P. W. & Gao, G. Viral vector platforms within the gene therapy landscape. Signal Transduct. Target. Ther. 6, 1â24 (2021).
Google ScholarÂ
Bouvet, M. et al. Adenovirus-mediated wild-type p53 tumor suppressor gene therapy induces apoptosis and suppresses growth of human pancreatic cancer. Ann. Surg. Oncol. 5, 681â688 (1998).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Chillon, M. et al. Group D adenoviruses infect primary central nervous system cells more efficiently than those from group C. J. Virol. 73, 2537â2540 (1999).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Stevenson, S. C., Rollence, M., Marshall-Neff, J. & McClelland, A. Selective targeting of human cells by a chimeric adenovirus vector containing a modified fiber protein. J. Virol. 71, 4782â4790 (1997).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Xiang, Z. et al. Chimpanzee adenovirus antibodies in humans, sub-Saharan Africa. Emerg. Infect. Dis. 12, 1596 (2006).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Dâambrosio, E., Del Grosso, N., Chicca, A. & Midulla, M. Neutralizing antibodies against 33 human adenoviruses in normal children in Rome. Epidemiol. Infect. 89, 155â161 (1982).
Google ScholarÂ
Sumida, S. M. et al. Neutralizing antibodies to adenovirus serotype 5 vaccine vectors are directed primarily against the adenovirus hexon protein. J. Immunol. 174, 7179â7185 (2005).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lee, C. S. et al. Adenovirus-mediated gene delivery: potential applications for gene and cell-based therapies in the new era of personalized medicine. Genes Dis. 4, 43â63 (2017).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Ogden, P. J., Kelsic, E. D., Sinai, S. & Church, G. M. Comprehensive AAV capsid fitness landscape reveals a viral gene and enables machine-guided design. Science 366, 1139â1143 (2019).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Sarkisyan, K. S. et al. Local fitness landscape of the green fluorescent protein. Nature 533, 397â401 (2016).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Castro, E. et al. Transformer-based protein generation with regularized latent space optimization. Nat. Mach. Intell. 4, 840â851 (2022).
ArticleÂ Google ScholarÂ
Ding, X., Zou, Z. & Brooks, C. L. III. Deciphering protein evolution and fitness landscapes with latent space models. Nat. Commun. 10, 5644 (2019).
Hawkins-Hooker, A. et al. Generating functional protein variants with variational autoencoders. PLoS Comput. Biol. 17, e1008736 (2021).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Nijkamp, E., Ruffolo, J. A., Weinstein, E. N., Naik, N. & Madani, A. ProGen2: exploring the boundaries of protein language models. Cell Syst. 14, 968â978 (2023).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Repecka, D. et al. Expanding functional protein sequence spaces using generative adversarial networks. Nat. Mach. Intell. 3, 324â333 (2021).
ArticleÂ Google ScholarÂ
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816â822 (2018).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Sevgen, E. et al. ProT-VAE: Protein Transformer Variational AutoEncoder for functional protein design. Preprint at bioRxiv https://doi.org/10.1101/2023.01.23.525232 (2023).
Sinai, S., Jain, N., Church, G. M. & Kelsic, E. D. Generative AAV capsid diversification by latent interpolation. Preprint at bioRxiv https://doi.org/10.1101/2021.04.16.440236 (2021).
Dhingra, A. et al. Molecular evolution of human adenovirus (HAdV) species C. Sci. Rep. 9, 1039 (2019).
ArticleÂ ADSÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Consortium, U. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204âD212 (2015).
ArticleÂ Google ScholarÂ
Bejani, M. M. & Ghatee, M. A systematic review on overfitting control in shallow and deep neural networks. Artif. Intell. Rev. 54, 6391â6438 (2021).
ArticleÂ Google ScholarÂ
Montero, I., Pappas, N. & Smith, N. A. Sentence bottleneck autoencoders from transformer language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021).
Khandelwal, U., Clark, K., Jurafsky, D. & Kaiser, L. Sample efficient text summarization using a single pre-trained transformer. Preprint at https://arxiv.org/abs/1905.08836 (2019).
Elnaggar, A. et al. ProtTrans: towards cracking the language of lifeâs code through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 7112â7127 (2022).
ArticleÂ PubMedÂ Google ScholarÂ
Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The curious case of neural text degeneration. In International Conference on Learning Representations (2019).
Tan, B., Yang, Z., AI-Shedivat, M., Xing, E. P. & Hu, Z. Progressive generation of long text with pretrained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021).
Semeniuta, S., Severyn, A. & Barth, E. A hybrid convolutional variational autoencoder for text generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017).
Iandola, F. et al. DenseNet: implementing efficient ConvNet descriptor pyramids. Preprint at https://arxiv.org/abs/1404.1869 (2014).
Bahir, I., Fromer, M., Prat, Y. & Linial, M. Viral adaptation to host: a proteome-based analysis of codon usage and amino acid preferences. Mol. Syst. Biol. 5, 311 (2009).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hanson, J., Paliwal, K., Litfin, T., Yang, Y. & Zhou, Y. Improving prediction of protein secondary structure, backbone angles, solvent accessibility and contact numbers by using predicted contact maps and an ensemble of recurrent and residual convolutional neural networks. Bioinformatics 35, 2403â2410 (2019).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Boutet, E. et al. UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view. In Plant Bioinformatics: Methods and Protocols Vol. 1374 (ed. Edwards, D) (Humana Press, 2016).
Ferruz, N., Schmidt, S. & HÃ¶cker, B. ProtGPT2 is a deep unsupervised language model for protein design. Nat. Commun. 13, 4348 (2022).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Hsu, C. et al. Learning inverse folding from millions of predicted structures. In Proc. Int. Conf. Mach. Learn. (eds Chaudhuri, K. et al.) 8946â8970 (PMLR, 2022).
Jeliazkov, J. R., del Alamo, D. & Karpiak, J. D. ESMFold hallucinates native-like protein sequences. Preprint at bioRxiv https://doi.org/10.1101/2023.05.23.541774 (2023).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123â1130 (2023).
ArticleÂ ADSÂ MathSciNetÂ CASÂ PubMedÂ Google ScholarÂ
Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. Preprint at https://arxiv.org/abs/1712.03346 (2017).
Santoni, D., Felici, G. & Vergni, D. Natural vs. random protein sequences: discovering combinatorics properties on amino acid words. J. Theor. Biol. 391, 13â20 (2016).
ArticleÂ ADSÂ MathSciNetÂ CASÂ PubMedÂ Google ScholarÂ
Zheng, L. et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. Preprint at https://arxiv.org/abs/2306.05685 (2023).
Wang, Y. et al. How far can camels go? Exploring the state of instruction tuning on open resources. Preprint at https://arxiv.org/abs/2306.04751 (2023).
Li, R., Patel, T. & Du, X. PRD: peer rank and discussion improve large language model based evaluations. Preprint at https://arxiv.org/abs/2307.02762 (2023).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
ArticleÂ ADSÂ MathSciNetÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Jorda, J., Xue, B., Uversky, V. N. & Kajava, A. V. Protein tandem repeatsâthe more perfect, the less structured. FEBS J. 277, 2673â2682 (2010).
CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583â589 (2021).
ArticleÂ ADSÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Drew, E. D. & Janes, R. W. PDBMD2CD: providing predicted protein circular dichroism spectra from multiple molecular dynamics-generated protein structures. Nucleic Acids Res. 48, W17âW24 (2020).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Echave, J., Spielman, S. J. & Wilke, C. O. Causes of evolutionary rate variation among protein sites. Nat. Rev. Genet. 17, 109â121 (2016).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Franzosa, E. A. & Xia, Y. Structural determinants of protein evolution are context-sensitive at the residue level. Mol. Biol. Evol. 26, 2387â2395 (2009).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Madisch, I., Harste, G., Pommer, H. & Heim, A. Phylogenetic analysis of the main neutralization and hemagglutination determinants of all human adenovirus prototypes as a basis for molecular classification and taxonomy. J. Virol. 79, 15265â15276 (2005).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Youil, R. et al. Hexon gene switch strategy for the generation of chimeric recombinant adenovirus. Hum. Gene Ther. 13, 311â320 (2002).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Roberts, A., Engel, J., Raffel, C., Hawthorne, C. & Eck, D. A hierarchical latent vector model for learning long-term structure in music. In Proc. Int. Conf. Mach. Learn. (eds Dy, J. & Krause, A) 4364â4373 (PMLR, 2018).
Wang, R. E., Durmus, E., Goodman, N. &Hashimoto, T. Language modeling via stochastic processes. In International Conference on Learning Representations (2021).
Russ, W. P. et al. An evolution-based model for designing chorismate mutase enzymes. Science 369, 440â445 (2020).
ArticleÂ ADSÂ MathSciNetÂ CASÂ PubMedÂ Google ScholarÂ
Goodfellow, I. et al. Generative adversarial networks. Commun. ACM 63, 139â144 (2020).
ArticleÂ Google ScholarÂ
Madani, A. et al. Large language models generate functional protein sequences across diverse families. Nat. Biotechnol. 41, 1099â1106 (2023).
ArticleÂ CASÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825â2830 (2011).
MathSciNetÂ Google ScholarÂ
Kingma, D. P. & Welling, M. Auto-encoding variational bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
Bowman, S. R. et al. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (ACL, 2016).
Shao, H. et al. Controlvae: controllable variational autoencoder. In Proc. Int. Conf. Mach. Learn. (eds DaumÃ©, H. III & Singh, A) 8655â8664 (PMLR, 2020).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735â1780 (1997).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024â8035 (2019).
Falcon, W. & The PyTorch Lightning team. PyTorch Lightning. Zenodo https://doi.org/10.5281/zenodo.3828935 (2019).
Biewald, L. Experiment tracking with weights and biases. Weights & Biases https://www.wandb.com/ (2020).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLRâ15) (2015).
Smith, L. N. & Topin, N. Super-convergence: very fast training of neural networks using large learning rates. In Proc. Vol. 11006. Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications 369â386 (SPIE, 2019).
Detlefsen, N. S. et al. TorchMetricsâmeasuring reproducibility in PyTorch. Journal of Open Source Software 7, 4101 (2022).
Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135â145 (2018).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Steinegger, M. & SÃ¶ding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026â1028 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Etherington, T. R. Mahalanobis distances and ecological niche modelling: correcting a chi-squared probability error. PeerJ 7, e6678 (2019).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Mahalanobis, P. C. On the generalized distance in statistics. Proc. of the National Institute of Science of India 2, 4955 (1936).
Google ScholarÂ
Teich, J. Pareto-front exploration with uncertain objectives. In International Conference on Evolutionary Multi-Criterion Optimization (eds Zitzler, E., Thiele, L., Deb, K., Coello Coello, C.A., Corne, D.) 314â328 (Springer, 2001).
Mitternacht, S. FreeSASA: an open source C library for solvent accessible surface area calculations. F1000Research 5, 189 (2016).
ArticleÂ PubMedÂ PubMed CentralÂ Google ScholarÂ
Zimmerman, D. W. A note on preliminary tests of equality of variances. Br. J. Math. Stat. Psychol. 57, 173â181 (2004).
ArticleÂ MathSciNetÂ PubMedÂ Google ScholarÂ
Vallat, R. Pingouin: statistics in Python. J. Open Source Softw. 3, 1026 (2018).
ArticleÂ ADSÂ Google ScholarÂ
Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433â459 (2010).
ArticleÂ Google ScholarÂ
Jelinek, F., Mercer, R. L., Bahl, L. R. & Baker, J. K. Perplexityâa measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am. 62, S63 (1977).
ArticleÂ ADSÂ Google ScholarÂ
Lee, J. et al. CHARMM-GUI input generator for NAMD, GROMACS, AMBER, OpenMM, and CHARMM/OpenMM simulations using the CHARMM36 additive force field. J. Chem. Theory Comput. 12, 405â413 (2016).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L. Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926â935 (1983).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Darden, T., York, D. & Pedersen, L. Particle mesh Ewald: an NÂ· log (N) method for Ewald sums in large systems. J. Chem. Phys. 98, 10089â10092 (1993).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Essmann, U. et al. A smooth particle mesh Ewald method. J. Chem. Phys. 103, 8577â8593 (1995).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Hess, B. P-LINCS: a parallel linear constraint solver for molecular simulation. J. Chem. Theory Comput. 4, 116â122 (2008).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Hoover, W. G. Canonical dynamics: equilibrium phase-space distributions. Phys. Rev. A 31, 1695 (1985).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Parrinello, M. & Rahman, A. Polymorphic transitions in single crystals: a new molecular dynamics method. J. Appl. Phys. 52, 7182â7190 (1981).
ArticleÂ ADSÂ CASÂ Google ScholarÂ
Huang, J. et al. CHARMM36m: an improved force field for folded and intrinsically disordered proteins. Nat. Methods 14, 71â73 (2017).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lindahl, E., Abraham M. J., Hess, B. & van der Spoel, D. GROMACS 2021.3 Source code. Zenodo https://doi.org/10.5281/zenodo.5053201 (2021).
Tomasello, G., Armenia, I. & Molla, G. The Protein Imager: a full-featured online molecular viewer interface with server-side HQ-rendering capabilities. Bioinformatics 36, 2909â2911 (2020).
ArticleÂ CASÂ PubMedÂ Google ScholarÂ
Lyu, S., Sowlati-Hashjin, S. & Garton, M. ProteinVAE: variational autoencoder for design of synthetic viral vector serotypes. Code Ocean https://doi.org/10.24433/CO.2530457.v2 (2023).

Download references

Acknowledgements

We thank Z. Wen for engaging in discussions and sharing ideas pertaining to the application of protein language models in the context of this research. This work was supported by grants from the Canadian Institute of Health Research (CIHR) and the Natural Sciences and Engineering Research Council of Canada. We also thank SciNet and the Digital Research Alliance of Canada for providing essential computing resources, without which this study could not be conducted.

Author information

Authors and Affiliations

Institute of Biomedical Engineering, University of Toronto, Toronto, Ontario, Canada
Suyue Lyu,Â Shahin Sowlati-HashjinÂ &Â Michael Garton

Authors

Suyue Lyu
View author publications
You can also search for this author in PubMedÂ Google Scholar
Shahin Sowlati-Hashjin
View author publications
You can also search for this author in PubMedÂ Google Scholar
Michael Garton
View author publications
You can also search for this author in PubMedÂ Google Scholar

Contributions

M.G. and S.L. conceived the project. S.L. designed the generative model, performed all experiments and analysed the results. S.S.-H. conducted molecular dynamics simulations, analysed simulation results with S.L.âs assistance and contributed to the corresponding section. S.L. and M.G wrote, revised and edited the paper. M.G. supervised the project.

Corresponding author

Correspondence to Michael Garton.

Ethics declarations

Competing interests

The University of Toronto is in the process of filing for a patent on this method. All authors declare that there are no competing interests aside from the patent pending.

Peer review

Peer review information

Nature Machine Intelligence thanks Jinwoo Leem and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Detailed Architecture of Encoder and Decoder CNN.

(a) Encoder CNNs used a series of dilated 3âÃâ3âconvolution layers along the sequence length dimension to reduce dimensionality of the pretrained language model amino-acid level embeddings. The flattened matrix is then transformed to the same length as the latent size of the pretrained language model embeddings to be used as the query in bottleneck attention. (b) Decoder CNNs used 8 UpBlocks to upsample the VAE latent vector length (equals 1) to maximal sequence length. In each UpBlock, a 1âÃâ1 convolutional layer is used to transform input to a lower dimension, which reduces the number of parameters needed in the following layer with large kernels. The dilated 3âÃâ3âdeconvolutional layer with stride of 2 is used to upsample the low-dimensional input. To prevent gradient vanishing, the input is also passed through a linear layer to get an identity matrix (T) of the same length as the deconvolutional output (U). The upsampled matrix U and the identity matrix is then concatenated as the input for the following UpBlock. The output of the final UpBlock is transformed to the decoder hidden dimension with another 1âÃâ1 convolutional layer.

Extended Data Fig. 2 Helix-to-strand Ratio in Sequences Generated by Base and Final version of ProteinVAE.

(a) Natural hexon helix-strand ratio (nâ=â711). (b) helix-strand ratio from sequences generated using the final version of ProteinVAE model (nâ=â1000). (c) helix-strand ratio from sequences generated using the base version of ProteinVAE model (nâ=â1000). Secondary structure state was predicted from sequences directly using SPOT-1D.

Extended Data Fig. 3 Hypervariable regions in Natural and ProteinVAE-generated Sequences.

Sequence logo of all 7 hypervariable regions for MSA of natural sequences and MSA of ProteinVAE generated sequences. As can be seen, both MSA have similar amino acid usage in majority of the columns.

Extended Data Fig. 4 Molecular Dynamics Representative Structures.

Each column shows a hexon homotrimer from one hexon sequence. Side, top, and bottom views of all structures were shown in the first, second, and third row, respectively. Red, green, and blue colouring represent different subunits of the homotrimer. Column (a) is a wild-type structure. Columns (bâd) each display structure of a ProteinVAE generated sequence at 91.5%, 85.6%, 75.4% sequence identity with respect to their respective closest natural sequence.

Extended Data Fig. 5 RMSD for Simulated Sequences.

RMSD for all natural representative sequences, ProteinVAE generated sequences, ProGen2 generated sequences (3 generated structures had structural clashes), and ProteinGPT2 generated sequences (3 generated structures had structural clashes). Each box-plot shows the first and third quartiles, central line is median, and whiskers show range of data with outliers are omitted for readability. For each sample, the RMSD value for every picosecond from 5âns to 100âns were analyzed (nâ=â950).

Extended Data Fig. 6 RMSF Aligned According to MSA with Gaps Preserved.

Top: Average RMSF for ProteinVAE generated sequences (blue) and natural representative sequences (pink) Middle: Average RMSF for ProtGen2 generated sequences (blue) and natural representative sequences (pink). Bottom: Average RMSF for ProtGPT2 generated sequences (blue) and natural representative sequences (pink). ProtGen2 and ProtGPT2 generated sequences inserted long fragments that are not homologous to any natural hexon. These fragments also have increased flexibility which could reduce structure stability. Data in (a-c) are presented as mean values +/â SD.

Extended Data Fig. 7 Human AdV Classifier.

(a) Receiver operating characteristic (ROC) curve of latent human adenovirus hexon classifier. Area under the ROC curve is 0.97. (b) Predicted human AdV hexon likelihood for all sequences generated from each cluster. Sequences predicted to be human AdV hexon were shown as a red dot, and predicted non-human AdV hexon were shown as a blue dot. Percentages of human AdV in corresponding natural sequences were labeled as Nat_HAd% in each cluster. Clusters with more than 90% natural human AdV hexons were colored with a pink background. Predicted percentages of human AdV for generated sequences were labeled as Gen_HAd%. Decision threshold is shown as a dashed red line.

Supplementary information

Supplementary Information

Supplementary Notes 1â10, Figs. 1â10, Tables 1â9 and references.

Reporting Summary

Source data

Source Data Fig. 2a

Amino acid association score.

Source Data Fig. 2b

ProtGPT2 and ProGen2 perplexity for all groups of sequences.

Source Data Fig. 2c

Raw HMMER output.

Source Data Fig. 2d

Shannon entropy of each column in the filtered MSA.

Source Data Fig. 2e

Location and number of gaps in MSA with respect to AdV5 hexon.

Source Data Fig. 2f

Helix, strand and coil ratio in each individual sequence in each group.

Source Data Fig. 5

PhyML output for visualization with branch length and support.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lyu, S., Sowlati-Hashjin, S. & Garton, M. Variational autoencoder for design of synthetic viral vector serotypes. Nat Mach Intell 6, 147â160 (2024). https://doi.org/10.1038/s42256-023-00787-2

Download citation

Received: 11 May 2023
Accepted: 18 December 2023
Published: 23 January 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s42256-023-00787-2