Abstract
Molecular machine learning bears promise for efficient molecular property prediction and drug discovery. However, labelled molecule data can be expensive and time consuming to acquire. Due to the limited labelled data, it is a great challenge for supervised-learning machine learning models to generalize to the giant chemical space. Here we present MolCLR (Molecular Contrastive Learning of Representations via Graph Neural Networks), a self-supervised learning framework that leverages large unlabelled data (~10 million unique molecules). In MolCLR pre-training, we build molecule graphs and develop graph-neural-network encoders to learn differentiable representations. Three molecule graph augmentations are proposed: atom masking, bond deletion and subgraph removal. A contrastive estimator maximizes the agreement of augmentations from the same molecule while minimizing the agreement of different molecules. Experiments show that our contrastive learning framework significantly improves the performance of graph-neural-network encoders on various molecular property benchmarks including both classification and regression tasks. Benefiting from pre-training on the large unlabelled database, MolCLR even achieves state of the art on several challenging benchmarks after fine-tuning. In addition, further investigations demonstrate that MolCLR learns to embed molecules into representations that can distinguish chemically reasonable molecular similarities.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 /Â 30Â days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The pre-training data and molecular property prediction benchmarks used in this work are available in the both the CodeOcean capsule at https://doi.org/10.24433/CO.8582800.v149 and the GitHub repository at https://github.com/yuyangw/MolCLR.
Code availability
The code accompanying this work is available in both the CodeOcean capsule at https://doi.org/10.24433/CO.8582800.v149 and the GitHub repository at https://github.com/yuyangw/MolCLR.
References
Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87, 184115 (2013).
Huang, B. & Von Lilienfeld, O. A. Communication: Understanding molecular representations in machine learning: the role of uniqueness and target similarity. J. Chem. Phys. 145, 161102 (2016).
David, L., Thakkar, A., Mercado, R. & Engkvist, O. Molecular representations in AI-driven drug discovery: a review and practical guide. J. Cheminform. 12, 56 (2020).
Oprea, T. I. & Gottfries, J. Chemography: the art of navigating in chemical space. J. Comb. Chem. 3, 157â166 (2001).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742â754 (2010).
Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. In Proc. 28th International Conference on Neural Information Processing Systems 2224â2232 (MIT Press, 2015).
Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning 1263â1272 (PMLR, 2017).
Karamad, M. et al. Orbital graph convolutional neural network for material property prediction. Phys. Rev. Mater. 4, 093801 (2020).
Chmiela, S., Sauceda, H. E., Müller, K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. Nat. Commun. 9, 3887 (2018).
Deringer, V. L. et al. Realistic atomistic structure of amorphous silicon from machine-learning-driven molecular dynamics. J. Phys. Chem. Lett. 9, 2879â2885 (2018).
Wang, W. & Gómez-Bombarelli, R. Coarse-graining auto-encoders for molecular dynamics. npj Comput. Mater. 5, 125 (2019).
Altae-Tran, H., Ramsundar, B., Pappu, A. S. & Pande, V. Low data drug discovery with one-shot learning. ACS Cent. Sci. 3, 283â293 (2017).
Magar, R., Yadav, P. & Farimani, A. B. Potential neutralizing antibodies discovered for novel corona virus using machine learning. Sci. Rep. 11, 5261 (2021).
Wang, Y., Cao, Z. & Farimani, A. B. Efficient water desalination with graphene nanopores obtained using artificial intelligence. npj 2D Mater. Appl. 5, 66 (2021).
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comp. Sci. 28, 31â36 (1988).
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Mach. Learn. Sci. Technol. 1, 045024 (2020).
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (2017).
Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations (2019).
Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNetâa deep learning architecture for molecules and materials. J. Chem. Phys. 148, 241722 (2018).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370â3388 (2019).
Kirkpatrick, P. & Ellis, C. Chemical space. Nature 432, 823 (2004).
Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16, 3â50 (1996).
Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59, 1096â1108 (2019).
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513â530 (2018).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436â444 (2015).
Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18, 463â477 (2019).
Unterthiner, T. et al. Deep learning as an opportunity in virtual screening. In Proc. Deep Learning Workshop at NIPS Vol. 27 (2014).
Ma, J., Sheridan, R. P., Liaw, A., Dahl, G. E. & Svetnik, V. Deep neural nets as a method for quantitative structureâactivity relationships. J. Chem. Inf. Model. 55, 263â274 (2015).
Ramsundar, B. et al. Massively multitask networks for drug discovery. Preprint at https://arxiv.org/abs/1502.02072 (2015).
Kusner, M. J., Paige, B. & Hernández-Lobato, J. M. Grammar variational autoencoder. In International Conference on Machine Learning 1945â1954 (PMLR, 2017).
Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inf. 37, 1700111 (2018).
Xu, Z., Wang, S., Zhu, F. & Huang, J. Seq2seq fingerprint: an unsupervised deep molecular embedding for drug discovery. In Proc. 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics 285â294 (ACM, 2017).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268â276 (2018).
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572â1583 (2019).
Maziarka, Å. et al. Molecule attention transformer. Preprint at https://arxiv.org/abs/2002.08264 (2020).
Feinberg, E. N. et al. PotentialNet for molecular property prediction. ACS Cent. Sci. 4, 1520â1530 (2018).
Klicpera, J., GroÃ, J. & Günnemann, S. Directional message passing for molecular graphs. Preprint at https://arxiv.org/abs/2003.03123 (2020).
Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100âD1107 (2012).
Sterling, T. & Irwin, J. J. Zinc 15âligand discovery for everyone. J. Chem. Inf. Model. 55, 2324â2337 (2015).
Kim, S. et al. Pubchem 2019 update: improved access to chemical data. Nucleic Acids Res. 47, D1102âD1109 (2019).
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint at https://arxiv.org/abs/1810.04805 (2018).
Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. Preprint at https://arxiv.org/abs/2010.09885 (2020).
Wang, S., Guo, Y., Wang, Y., Sun, H. & Huang, J. SMILES-BERT: large scale unsupervised pre-training for molecular property prediction. In Proc. 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 429â436 (ACM, 2019).
Liu, S., Demirel, M. F. & Liang, Y. N-gram graph: simple unsupervised representation for graphs, with applications to molecules. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS, 2019).
Hu, W. et al. Strategies for pre-training graph neural networks. In International Conference on Learning Representations (2020).
You, Y. et al. Graph contrastive learning with augmentations. Adv. Neural Inf. Process. Syst. 33, 5812â5823 (2020).
van den Oord, A., Li, Y. & Vinyals, O. Representation learning with contrastive predictive coding. Preprint at https://arxiv.org/abs/1807.03748 (2018).
Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning 1597â1607 (PMLR, 2020).
Wang, Y., Wang, J., Cao, Z. & Farimani, A. B. MolCLR: molecular contrastive learning of representations via graph neural networks. CodeOcean https://doi.org/10.24433/CO.8582800.v1 (2021).
Chen, T., Kornblith, S., Swersky, K., Norouzi, M. & Hinton, G. Big self-supervised models are strong semi-supervised learners. Preprint at https://arxiv.org/abs/2006.10029 (2020).
Do, K., Tran, T. & Venkatesh, S. Graph transformation policy network for chemical reaction prediction. In Proc. 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 750â760 (ACM, 2019).
Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In International Conference on Machine Learning 4839â4848 (PMLR, 2020).
Lu, C. et al. Molecular property prediction: a multilevel quantum interactions modeling perspective. In Proc. AAAI Conference on Artificial Intelligence Vol. 33, 1052â1060 (AAAI, 2019).
Van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579â2605 (2008).
Yun, S., Jeong, M., Kim, R., Kang, J. & Kim, H. J. Graph transformer networks. In Advances in Neural Information Processing Systems Vol. 32 (eds. Wallach, H. et al.) (Curran Associates, 2019).
Pope, P. E., Kolouri, S., Rostami, M., Martin, C. E. & Hoffmann, H. Explainability methods for graph convolutional neural networks. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10772â10781 (IEEE, 2019).
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. Geometric deep learning: going beyond Euclidean data. IEEE Signal Process. Mag. 34, 18â42 (2017).
He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 9729â9738 (IEEE, 2020).
Gao, T., Yao, X. & Chen, D. SimCSE: simple contrastive learning of sentence embeddings. Preprint at https://arxiv.org/abs/2104.08821 (2021).
Wang, J., Lu, Y. & Zhao, H. CLOUD: contrastive learning of unsupervised dynamics. Preprint at https://arxiv.org/abs/2010.12488 (2020).
Landrum, G. RDKit: open-source cheminformatics (2006); https://www.rdkit.org/
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds (2019).
Ho, T. K. Random decision forests. In Proc. 3rd International Conference on Document Analysis and Recognition Vol. 1, 278â282 (IEEE, 1995).
Cortes, C. & Vapnik, V. Support-vector networks. Mach. Learn. 20, 273â297 (1995).
Acknowledgements
We thank the start-up fund provided by the Department of Mechanical Engineering at Carnegie Mellon University. The work is also funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), US Department of Energy, under award no. DE-AR0001221.
Author information
Authors and Affiliations
Contributions
Y.W., J.W. and A.B.F. designed the research study. Y.W., J.W. and Z.C. developed the method, wrote the code and performed the analysis. All authors wrote and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Alán Aspuru-Guzik and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisherâs note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1â3, Discussion and Tables 1â5.
Rights and permissions
About this article
Cite this article
Wang, Y., Wang, J., Cao, Z. et al. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 4, 279â287 (2022). https://doi.org/10.1038/s42256-022-00447-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00447-x
This article is cited by
-
Multi-channel learning for integrating structural hierarchies into context-dependent molecular representation
Nature Communications (2025)
-
Coverage bias in small molecule machine learning
Nature Communications (2025)
-
MultiGranDTI: an explainable multi-granularity representation framework for drug-target interaction prediction
Applied Intelligence (2025)
-
Enhancing molecular property prediction with auxiliary learning and task-specific adaptation
Journal of Cheminformatics (2024)
-
Drug-target interaction prediction with collaborative contrastive learning and adaptive self-paced sampling strategy
BMC Biology (2024)