A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Özçelik, Rıza; Grisoni, Francesca

Quantitative Biology > Biomolecules

arXiv:2407.12152 (q-bio)

[Submitted on 16 Jul 2024]

Title:A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Authors:Rıza Özçelik, Francesca Grisoni

View PDF HTML (experimental)

Abstract:Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP learns from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP training, to provide guidelines for newcomers and experts alike. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological choices, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.

Subjects:	Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Cite as:	arXiv:2407.12152 [q-bio.BM]
	(or arXiv:2407.12152v1 [q-bio.BM] for this version)
	https://doi.org/10.48550/arXiv.2407.12152

Submission history

From: Rıza Özçelik [view email]
[v1] Tue, 16 Jul 2024 20:13:31 UTC (229 KB)

Quantitative Biology > Biomolecules

Title:A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Biomolecules

Title:A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators