t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Wu, Juan-Ni; Wang, Tong; Chen, Yue; Tang, Li-Juan; Wu, Hai-Long; Yu, Ru-Qin

Computer Science > Machine Learning

arXiv:2301.01829v2 (cs)

[Submitted on 4 Jan 2023 (v1), revised 23 Dec 2023 (this version, v2), latest version 21 May 2024 (v4)]

Title:t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Authors:Juan-Ni Wu, Tong Wang, Yue Chen, Li-Juan Tang, Hai-Long Wu, Ru-Qin Yu

View PDF

Abstract:Effective representation of molecules is a crucial factor affecting the performance of artificial intelligence models. This study introduces a flexible, fragment-based, multiscale molecular representation framework called t-SMILES (tree-based SMILES) with three code algorithms: TSSA (t-SMILES with Shared Atom), TSDY (t-SMILES with Dummy Atom) and TSID (t-SMILES with ID). It describes molecules using SMILES-type strings obtained by performing a breadth-first search on a full binary tree formed from a fragmented molecular graph. Systematic evaluations using JTVAE, BRICS, MMPA, and Scaffold show the feasibility to construct a multilingual molecular description system, where various descriptions complement each other, enhancing the overall performance. Additionally, it exhibits impressive performance on low-resource datasets, whether the model is original, data augmented, or pre-training fine-tuned. It significantly outperforms classical SMILES, DeepSMILES, SELFIES and baseline models in goal-directed tasks. Furthermore, it surpasses start-of-the-art fragment, graph and SMILES based approaches on ChEMBL, Zinc, and QM9.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
Cite as:	arXiv:2301.01829 [cs.LG]
	(or arXiv:2301.01829v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2301.01829

Submission history

From: Juanni Wu [view email]
[v1] Wed, 4 Jan 2023 21:41:01 UTC (1,395 KB)
[v2] Sat, 23 Dec 2023 07:54:57 UTC (1,899 KB)
[v3] Wed, 10 Jan 2024 00:53:43 UTC (1,254 KB)
[v4] Tue, 21 May 2024 02:19:13 UTC (826 KB)

Computer Science > Machine Learning

Title:t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:t-SMILES: A Scalable Fragment-based Molecular Representation Framework for De Novo Molecule Generation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators