research-article

CodonBERT: Using BERT for Sentiment Analysis to Better Predict Genes with Low Expression

Authors:

Ashley Nicole Babjac,

Scott J EmrichAuthors Info & Claims

BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

September 2023

Article No.: 29, Pages 1 - 6

https://doi.org/10.1145/3584371.3613013

Published: 04 October 2023 Publication History

Abstract

Synonymous codons, which encode the same amino acid in a protein, are known to be used unequally in organisms. Prior research has been able to uncover "preferred" codons that are often found in more highly expressed genes. This has enabled different computational models that can predict gene expression of protein-coding genes; however, their performance is often affected by more diverse gene expression in higher organisms, i.e., high expression in only specific tissues or cell types. In this paper, we use a Natural Language Processing (NLP) algorithm, Bidirectional Encoder Representations from Transformers (BERT), to develop a new framework for predicting gene expression. Notably, our model architecture relies on the idea of sentiment analysis, i.e., assigning an overall "emotion" (sentiment) to protein-coding sequences. Our new framework, CodonBERT, is a a pre-trained model that better captures more intrinsic relationships between sequences and their expression, and we show that our model is capable of making substantially better predictions for a diverse collection of model organisms. Additionally, we show that our model learns inherent patterns of codon usage that can be traced using explainable AI (XAI) algorithms.

References

[1]

Ashley Babjac, Jun Li, and Scott Emrich. 2021. Fine-Grained Synonymous Codon Usage Patterns and their Potential Role in Functional Protein Production. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2187--2193.

[2]

Adrien Bibal, Rémi Cardon, David Alfter, Rodrigo Wilkens, Xiaoou Wang, Thomas François, and Patrick Watrin. 2022. Is Attention Explanation? An Introduction to the Debate. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 3889--3900.

[3]

J.L. Chaney, A. Steele, R. Carmichael, A. Rodriguez, A.T. Specht, K. Ngo, J. Li, S. Emrich, and P.L. Clark. 2017. Widespread position-specific conservation of synonymous rare codons within coding sequences. PLOS Computational Biology 13, 5 (05 2017), 1--19.

[4]

Matthieu Chartier, Francis Gaudreault, and Rafael Najmanovich. 2012. Large-Scale analysis of conserved rare codon clusters suggests an involvement in co-translational molecular recognition events. Bioinformatics 28, 11 (2012), 1438--1445.

Digital Library

[5]

KR1442 Chowdhary. 2020. Natural language processing. Fundamentals of artificial intelligence (2020), 603--649.

[6]

Patrick Cramer. 2021. AlphaFold2 and the future of structural biology. Nature structural & molecular biology 28, 9 (2021), 704--705.

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[8]

Aysu Ezen-Can. 2020. A Comparison of LSTM and BERT for Small Corpus. arXiv preprint arXiv:2009.05451 (2020).

[9]

Masaki Stanley Fujimoto, Paul M. Bodily, Cole A. Lyman, Andrew J. Jacobsen, Quinn Snell, and Mark J. Clement. 2017. Modeling Global and Local Codon Bias with Deep Language Models. In 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE). 151--156.

[10]

Justin Gardin, Rukhsana Yeasmin, Alisa Yurovsky, Ying Cai, Steve Skiena, Bruce Futcher, and Nahum Sonenberg. 2014. Measurement of average decoding rates of the 61 sense codons in vivo. eLife 3 (2014), e03735.

[11]

M.A. Gilchrist, W.-C. Chen, P. Shah, C. L. Landerer, and R. Zaretzki. 2015. Estimating gene expression and codon-specific translational efficiencies, mutation biases, and selection coefficients from genomic data alone. Genome Biology and Evolution 7, 6 (05 2015), 1559--1579.

[12]

Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics 37, 15 (2021), 2112--2120.

[13]

Brendan Juba and Hai S Le. 2019. Precision-recall versus accuracy and the role of large data sets. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 4039--4048.

Digital Library

[14]

Zena A Kadhuim and Samaher Al-Janabi. 2023. Codon-mRNA prediction using deep optimal neurocomputing technique (DLSTM-DSN-WOA) and multivariate analysis. Results in Engineering 17 (2023), 100847.

[15]

Enja Kokalj, Blaž Škrlj, Nada Lavrač, Senja Pollak, and Marko Robnik-Šikonja. 2021. BERT meets shapley: Extending SHAP explanations to transformer-based classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation. 16--21.

[16]

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 6637 (2023), 1123--1130.

[17]

Zhixiu Lu, Michael Gilchrist, and Scott Emrich. 2020. Analysis of mutation bias in shaping codon usage bias and its association with gene expression across species. In EPiC Series in Computing, Vol. 70. EasyChair, 139--148.

[18]

Muhammad Nabeel Asim, Muhammad Imran Malik, Andreas Dengel, and Sheraz Ahmed. 2020. K-mer Neural Embedding Performance Analysis Using Amino Acid Codons. In 2020 International Joint Conference on Neural Networks (IJCNN). 1--8.

[19]

Ananthan Nambiar, John Malcolm Forsyth, Simon Liu, and Sergei Maslov. 2023. DR-BERT: A Protein Language Model to Annotate Disordered Regions. bioRxiv (2023), 2023--02.

[20]

Sarang Narkhede. 2018. Understanding auc-roc curve. Towards Data Science 26, 1 (2018), 220--227.

[21]

Michal Perach, Zohar Zafrir, Tamir Tuller, and Oded Lewinson. 2021. Identification of conserved slow codons that are important for protein expression and function. RNA Biology 18, 12 (2021), 2296--2307.

[22]

Abdul Muntakim Rafi, Dmitry Penzar, Daria Nogina, Dohoon Lee, Eeshit Dhaval Vaishnav, Danyeong Lee, Nayeon Kim, Sangyeup Kim, Georgy Meshcheryakov, Andrey Lando, et al. 2023. Evaluation and optimization of sequence-based gene regulatory deep learning models. bioRxiv (2023), 2023--04.

[23]

Istvan Redl, Carlo Fisicaro, Oliver Dutton, Falk Hoffmann, Louie Henderson, Benjamin MJ Owens, Matthew Heberling, Emanuele Paci, and Kamil Tamiola. 2022. ADOPT: intrinsic protein disorder prediction through deep bidirectional transformers. bioRxiv (2022), 2022--05.

[24]

Andrea Riba, Noemi Di Nanni, Nitish Mittal, Erik Arhné, Alexander Schmidt, and Mihaela Zavolan. 2019. Protein synthesis rates and ribosome occupancies reveal determinants of translation elongation rates. Proceedings of the National Academy of Sciences 116, 30 (2019), 15023--15032.

[25]

Anabel Rodriguez, Gabriel Wright, Scott Emrich, and Patricia L Clark. 2018. % MinMax: A versatile tool for calculating and comparing synonymous codon usage and its impact on protein folding. Protein Science 27, 1 (2018), 356--362.

[26]

Benedek Rozemberczki, Lauren Watson, Péter Bayer, Hao-Tsung Yang, Olivér Kiss, Sebastian Nilsson, and Rik Sarkar. 2022. The shapley value in machine learning. arXiv preprint arXiv:2202.05594 (2022).

[27]

Eric W Sayers, Mark Cavanaugh, Karen Clark, Kim D Pruitt, Conrad L Schoch, Stephen T Sherry, and Ilene Karsch-Mizrachi. 2022. GenBank. Nucleic acids research 50, D1 (2022), D161.

[28]

P.M. Sharp and W.-H. Li. 1987. The codon adaptation index-a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Research 15, 3 (1987), 1281--1295.

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).

[30]

Gabriel Wright, Anabel Rodriguez, Jun Li, Patricia L Clark, Tijana Milenković, and Scott J Emrich. 2020. Analysis of computational codon usage models and their association with translationally slow codons. PloS one 15, 4 (2020), e0232003.

[31]

He Zhang, Liang Zhang, Ang Lin, Congcong Xu, Ziyu Li, Kaibo Liu, Boxiang Liu, Xiaopin Ma, Fanfan Zhao, Huiling Jiang, Chunxiu Chen, Haifa Shen, Hangwen Li, David H. Mathews, Yujian Zhang, and Liang Huang. 2023. Algorithm for Optimized mRNA Design Improves Stability and Immunogenicity. Nature (2023).

Index Terms

CodonBERT: Using BERT for Sentiment Analysis to Better Predict Genes with Low Expression
1. Applied computing
  1. Life and medical sciences
    1. Computational biology
      1. Computational genomics
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
  2. Machine learning
    1. Machine learning approaches
      1. Learning latent representations

Recommendations

Relational Analysis of CpG Islands Methylation and Gene Expression in Human Lymphomas Using Possibilistic C-Means Clustering and Modified Cluster Fuzzy Density

Heterogeneous genetic and epigenetic alterations are commonly found in human non-Hodgkin's lymphomas (NHL). One such epigenetic alteration is aberrant methylation of gene promoter-related CpG islands, where hypermethylation frequently results in ...
Read More
Genome-wide identification and expression analysis of SWI1 genes in Boechera species

Display Omitted We found no evidence for a naturally-occurring apomixis SWI1 allele.The expression levels of SWI1 mRNA in apomicts was higher than in sexuals.The molecular components controlling apomixis in Boechera spp. are not related to the SWI ...
Read More
Gene expression patterns combined with bioinformatics analysis identify genes associated with cholangiocarcinoma

Microarray technology was used to find biomarkers for early detection and diagnosis.204 differentially co-expressed genes (DCGs) in CC patients were identified.Providing a set of targets for molecular biomarker studies. To explore the molecular ...
Read More

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BCB '23: Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

September 2023

626 pages

ISBN:9798400701269

DOI:10.1145/3584371

General Chairs:
May D. Wang
SGeorgia Institute of Technology
,
Byung-Jun Yoon

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGBio: ACM Special Interest Group on Bioinformatics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

BCB '23

Sponsor:

SIGBio

BCB '23: 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

September 3 - 6, 2023

TX, Houston, USA

Acceptance Rates

Overall Acceptance Rate 254 of 885 submissions, 29%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
184
Total Downloads

Downloads (Last 12 months)184
Downloads (Last 6 weeks)25

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents