Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3584371.3613013acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

CodonBERT: Using BERT for Sentiment Analysis to Better Predict Genes with Low Expression

Published: 04 October 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Synonymous codons, which encode the same amino acid in a protein, are known to be used unequally in organisms. Prior research has been able to uncover "preferred" codons that are often found in more highly expressed genes. This has enabled different computational models that can predict gene expression of protein-coding genes; however, their performance is often affected by more diverse gene expression in higher organisms, i.e., high expression in only specific tissues or cell types. In this paper, we use a Natural Language Processing (NLP) algorithm, Bidirectional Encoder Representations from Transformers (BERT), to develop a new framework for predicting gene expression. Notably, our model architecture relies on the idea of sentiment analysis, i.e., assigning an overall "emotion" (sentiment) to protein-coding sequences. Our new framework, CodonBERT, is a a pre-trained model that better captures more intrinsic relationships between sequences and their expression, and we show that our model is capable of making substantially better predictions for a diverse collection of model organisms. Additionally, we show that our model learns inherent patterns of codon usage that can be traced using explainable AI (XAI) algorithms.

    References

    [1]
    Ashley Babjac, Jun Li, and Scott Emrich. 2021. Fine-Grained Synonymous Codon Usage Patterns and their Potential Role in Functional Protein Production. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2187--2193.
    [2]
    Adrien Bibal, Rémi Cardon, David Alfter, Rodrigo Wilkens, Xiaoou Wang, Thomas François, and Patrick Watrin. 2022. Is Attention Explanation? An Introduction to the Debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3889--3900.
    [3]
    J.L. Chaney, A. Steele, R. Carmichael, A. Rodriguez, A.T. Specht, K. Ngo, J. Li, S. Emrich, and P.L. Clark. 2017. Widespread position-specific conservation of synonymous rare codons within coding sequences. PLOS Computational Biology 13, 5 (05 2017), 1--19.
    [4]
    Matthieu Chartier, Francis Gaudreault, and Rafael Najmanovich. 2012. Large-Scale analysis of conserved rare codon clusters suggests an involvement in co-translational molecular recognition events. Bioinformatics 28, 11 (2012), 1438--1445.
    [5]
    KR1442 Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence (2020), 603--649.
    [6]
    Patrick Cramer. 2021. AlphaFold2 and the future of structural biology. Nature structural & molecular biology 28, 9 (2021), 704--705.
    [7]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
    [8]
    Aysu Ezen-Can. 2020. A Comparison of LSTM and BERT for Small Corpus. arXiv preprint arXiv:2009.05451 (2020).
    [9]
    Masaki Stanley Fujimoto, Paul M. Bodily, Cole A. Lyman, Andrew J. Jacobsen, Quinn Snell, and Mark J. Clement. 2017. Modeling Global and Local Codon Bias with Deep Language Models. In 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE). 151--156.
    [10]
    Justin Gardin, Rukhsana Yeasmin, Alisa Yurovsky, Ying Cai, Steve Skiena, Bruce Futcher, and Nahum Sonenberg. 2014. Measurement of average decoding rates of the 61 sense codons in vivo. eLife 3 (2014), e03735.
    [11]
    M.A. Gilchrist, W.-C. Chen, P. Shah, C. L. Landerer, and R. Zaretzki. 2015. Estimating gene expression and codon-specific translational efficiencies, mutation biases, and selection coefficients from genomic data alone. Genome Biology and Evolution 7, 6 (05 2015), 1559--1579.
    [12]
    Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112--2120.
    [13]
    Brendan Juba and Hai S Le. 2019. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 4039--4048.
    [14]
    Zena A Kadhuim and Samaher Al-Janabi. 2023. Codon-mRNA prediction using deep optimal neurocomputing technique (DLSTM-DSN-WOA) and multivariate analysis. Results in Engineering 17 (2023), 100847.
    [15]
    Enja Kokalj, Blaž Škrlj, Nada Lavrač, Senja Pollak, and Marko Robnik-Šikonja. 2021. BERT meets shapley: Extending SHAP explanations to transformer-based classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. 16--21.
    [16]
    Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 6637 (2023), 1123--1130.
    [17]
    Zhixiu Lu, Michael Gilchrist, and Scott Emrich. 2020. Analysis of mutation bias in shaping codon usage bias and its association with gene expression across species. In EPiC Series in Computing, Vol. 70. EasyChair, 139--148.
    [18]
    Muhammad Nabeel Asim, Muhammad Imran Malik, Andreas Dengel, and Sheraz Ahmed. 2020. K-mer Neural Embedding Performance Analysis Using Amino Acid Codons. In 2020 International Joint Conference on Neural Networks (IJCNN). 1--8.
    [19]
    Ananthan Nambiar, John Malcolm Forsyth, Simon Liu, and Sergei Maslov. 2023. DR-BERT: A Protein Language Model to Annotate Disordered Regions. bioRxiv (2023), 2023--02.
    [20]
    Sarang Narkhede. 2018. Understanding auc-roc curve. Towards Data Science 26, 1 (2018), 220--227.
    [21]
    Michal Perach, Zohar Zafrir, Tamir Tuller, and Oded Lewinson. 2021. Identification of conserved slow codons that are important for protein expression and function. RNA Biology 18, 12 (2021), 2296--2307.
    [22]
    Abdul Muntakim Rafi, Dmitry Penzar, Daria Nogina, Dohoon Lee, Eeshit Dhaval Vaishnav, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Georgy Meshcheryakov, Andrey Lando, et al. 2023. Evaluation and optimization of sequence-based gene regulatory deep learning models. bioRxiv (2023), 2023--04.
    [23]
    Istvan Redl, Carlo Fisicaro, Oliver Dutton, Falk Hoffmann, Louie Henderson, Benjamin MJ Owens, Matthew Heberling, Emanuele Paci, and Kamil Tamiola. 2022. ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers. bioRxiv (2022), 2022--05.
    [24]
    Andrea Riba, Noemi Di Nanni, Nitish Mittal, Erik Arhné, Alexander Schmidt, and Mihaela Zavolan. 2019. Protein synthesis rates and ribosome occupancies reveal determinants of translation elongation rates. Proceedings of the National Academy of Sciences 116, 30 (2019), 15023--15032.
    [25]
    Anabel Rodriguez, Gabriel Wright, Scott Emrich, and Patricia L Clark. 2018. % MinMax: A versatile tool for calculating and comparing synonymous codon usage and its impact on protein folding. Protein Science 27, 1 (2018), 356--362.
    [26]
    Benedek Rozemberczki, Lauren Watson, Péter Bayer, Hao-Tsung Yang, Olivér Kiss, Sebastian Nilsson, and Rik Sarkar. 2022. The shapley value in machine learning. arXiv preprint arXiv:2202.05594 (2022).
    [27]
    Eric W Sayers, Mark Cavanaugh, Karen Clark, Kim D Pruitt, Conrad L Schoch, Stephen T Sherry, and Ilene Karsch-Mizrachi. 2022. GenBank. Nucleic acids research 50, D1 (2022), D161.
    [28]
    P.M. Sharp and W.-H. Li. 1987. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research 15, 3 (1987), 1281--1295.
    [29]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
    [30]
    Gabriel Wright, Anabel Rodriguez, Jun Li, Patricia L Clark, Tijana Milenković, and Scott J Emrich. 2020. Analysis of computational codon usage models and their association with translationally slow codons. PloS one 15, 4 (2020), e0232003.
    [31]
    He Zhang, Liang Zhang, Ang Lin, Congcong Xu, Ziyu Li, Kaibo Liu, Boxiang Liu, Xiaopin Ma, Fanfan Zhao, Huiling Jiang, Chunxiu Chen, Haifa Shen, Hangwen Li, David H. Mathews, Yujian Zhang, and Liang Huang. 2023. Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity. Nature (2023).

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics
    September 2023
    626 pages
    ISBN:9798400701269
    DOI:10.1145/3584371
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. CUB
    2. expression
    3. sentiment analysis
    4. transformers
    5. SHAP

    Qualifiers

    • Research-article

    Conference

    BCB '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 254 of 885 submissions, 29%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 184
      Total Downloads
    • Downloads (Last 12 months)184
    • Downloads (Last 6 weeks)25

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media