0% found this document useful (0 votes)

371 views

Machine Learning in Chemistry Data-Driven Algorithms, Learning Systems, and Predictions

Uploaded by

vietnampetrochemical

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

371 views

Machine Learning in Chemistry Data-Driven Algorithms, Learning Systems, and Predictions

Uploaded by

vietnampetrochemical

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 140

1326

ACS SYMPOSIUM SERIES

ACS
SYMPOSIUM
SERIES
MACHINE

DATA-DRIVEN ALGORITHMS, LEARNING SYSTEMS, AND PREDICTIONS

MACHINE LEARNING IN CHEMISTRY

C O M P U T A T I O N
VOLUME 1326

LEARNING IN
NEW POSSIBILITIES FOR ARTIFICIAL
INTELLIGENCE IN CHEMICAL RESEARCH
CHEMISTRY
DATA-DRIVEN ALGORITHMS, LEARNING
SYSTEMS, AND PREDICTIONS
Artificial intelligence, and especially its application to chemistry, is an exciting
and rapidly expanding area of research. This volume presents groundbreaking
work in this field to facilitate researcher engagement and to serve as a solid
base from which new researchers can break into this exciting and rapidly
transforming field. This interdisciplinary volume will be a valuable tool for
those working in cheminformatics, physical chemistry, and computational
chemistry.

PUBLISHED BY THE
American Chemical Society
SPONSORED BY THE
ACS Division of Computers in Chemistry

PYZER-KNAPP
PYZER-KNAPP

& LAINO
& LAINO
Machine Learning in Chemistry: Data-Driven
Algorithms, Learning Systems, and Predictions
ACS SYMPOSIUM SERIES 1326

Machine Learning in Chemistry: Data-Driven

Algorithms, Learning Systems, and Predictions

Edward O. Pyzer-Knapp, Editor

IBM Research—UK
Daresbury, UK

Teodoro Laino, Editor

IBM Research—Zurich
Rueschlikon, Switzerland

Sponsored by the
ACS Division of Computers in Chemistry

American Chemical Society, Washington, DC

Library of Congress Cataloging-in-Publication Data
Names: Pyzer-Knapp, Edward O., editor. | Laino, Teodoro, editor. | American
Chemical Society. Division of Computers in Chemistry, sponsoring body.
Title: Machine learning in chemistry / Edward O. Pyzer-Knapp, editor ;
Teodoro Laino, editor.
Description: Washington, DC : American Chemical Society, [2019] | Series:
ACS symposium series ; 1326 | "Sponsored by the ACS Division of
Computers in Chemistry." | Includes bibliographical references and
index.
Identifiers: LCCN 2019048154 (print) | LCCN 2019048155 (ebook) | ISBN
9780841235052 (hardcover) | ISBN 9780841235045 (ebook other)
Subjects: LCSH: Chemistry--Data processing. | Machine learning.
Classification: LCC QD39.3.E46 M33 2019 (print) | LCC QD39.3.E46 (ebook)
| DDC 540.285/631--dc23
LC record available at https://lccn.loc.gov/2019048154
LC ebook record available at https://lccn.loc.gov/2019048155

The paper used in this publication meets the minimum requirements of American National Standard for Information
Sciences—Permanence of Paper for Printed Library Materials, ANSI Z39.48n1984.
Copyright © 2019 American Chemical Society
All Rights Reserved. Reprographic copying beyond that permitted by Sections 107 or 108 of the U.S. Copyright Act
is allowed for internal use only, provided that a per-chapter fee of $40.25 plus $0.75 per page is paid to the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. Republication or reproduction for sale of pages in
this book is permitted only under license from ACS. Direct these and other permission requests to ACS Copyright Office,
Publications Division, 1155 16th Street, N.W., Washington, DC 20036.
The citation of trade names and/or names of manufacturers in this publication is not to be construed as an endorsement or
as approval by ACS of the commercial products or services referenced herein; nor should the mere reference herein to any
drawing, specification, chemical process, or other data be regarded as a license or as a conveyance of any right or permission
to the holder, reader, or any other person or corporation, to manufacture, reproduce, use, or sell any patented invention or
copyrighted work that may in any way be related thereto. Registered names, trademarks, etc., used in this publication, even
without specific indication thereof, are not to be considered unprotected by law.
PRINTED IN THE UNITED STATES OF AMERICA
Foreword
The purpose of the series is to publish timely, comprehensive books developed from the ACS
sponsored symposia based on current scientific research. Occasionally, books are developed from
symposia sponsored by other organizations when the topic is of keen interest to the chemistry
audience.

Before a book proposal is accepted, the proposed table of contents is reviewed for appropriate
and comprehensive coverage and for interest to the audience. Some papers may be excluded to better
focus the book; others may be added to provide comprehensiveness. When appropriate, overview
or introductory chapters are added. Drafts of chapters are peer-reviewed prior to final acceptance or
rejection.

As a rule, only original research papers and original review papers are included in the volumes.
Verbatim reproductions of previous published papers are not accepted.

ACS Books Department

Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1. Atomic-Scale Representation and Statistical Learning of Tensorial Properties . . . . . . . . . . . . . 1

Andrea Grisafi, David M. Wilkins, Michael J. Willatt, and Michele Ceriotti

2. Prediction of Mohs Hardness with Machine Learning Methods Using Compositional

Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Joy C. Garnett

3. High-Dimensional Neural Network Potentials for Atomistic Simulations . . . . . . . . . . . . . . . . . . . . . 49

Matti Hellström and Jörg Behler

4. Data-Driven Learning Systems for Chemical Reaction Prediction: An Analysis of

Recent Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Philippe Schwaller and Teodoro Laino

5. Using Machine Learning To Inform Decisions in Drug Discovery: An Industry

Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
Darren V. S. Green

6. Cognitive Materials Discovery and Onset of the 5th Discovery Paradigm . . . . . . . . . . . . . . . . . . . . . . 103
Dmitry Y. Zubarev and Jed W. Pitera

Editors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Indexes

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

vii
Preface
Recent prominent successes within areas such as natural language processing, voice and image
analysis, enabled by growth in accessible computational power driven by accelerators such as the
GPU and accessibility to larger and more complex datasets has revived the excitement around
machine learning.
It is inevitable that these developments have affected disciplines outside of computer science,
and one of the front-runners has been the field of chemistry and chemical discovery. With several
successful applications, machine learning – and more generally the field of Artificial Intelligence (AI)
- is more and more considered the answer to boost the development of new molecules, materials,
formulations or processes. In fact, by reasoning on a large data volume practitioners can gain insight
perhaps hidden in the sheer volume of data now available, thus reducing the number of trial-and-
error experiments, thereby saving time and money.
Machine learning and artificial intelligence offer the possibility of training computers by using
the properties of materials that we already know, to describe and reason complex physical systems
without the need to have an analytical representation. Recently, we have seen an explosion in range
of applications of machine learning in chemistry, including areas such as QSAR, chemical reaction
prediction, protein structure prediction, quantum chemistry, and inverse materials design. In fact,
due the great amount of information contained within chemical databases arising from research and
industry, machine learning can ensure that useful, but often hidden, information contained in the
data is interpreted effectively and utilized to its fullest potential. When carefully utilized, this has the
potential to drive a paradigm shift in research, by finding trends that a human researcher may miss
due to bias towards a given interpretation.
If machine learning and AI are the vehicles for such a transformation, then data is the fuel.
Key drivers that strongly fuel application of these approaches in chemistry are the growth of open
databases and the quality of data which is now being recorded, enabled by the introduction of
automated data collection systems in the lab. To labour a metaphor, all the fuel in the world is useless
without a good engine to convert it into useful energy. In the case of machine learning, this role is
played by the data representation (aka descriptos). Recently, we have seen the extensive development
of a large number of molecular descriptors – some of which are learned directly from the data - that
enable the combination of chemical knowledge and domain expertise that can be used to represent
the complexity of chemistry in a way which is understandable to an AI system.
We are experiencing a paradigm change on how chemists will do research in the future: the use
of AI and machine learning in chemistry is evolving from a somewhat isolated area of research into
an integral part of the scientific method. Intelligent or cognitive modeling will enable the creation of
tools that can be easily implemented into laboratory equipment. With the capability of integrating
optimized software to carry out many tasks, these algorithms can be coupled with data generating
systems and directly provide outputs for many purposes in the chemical fields. On the other hand,
the intrinsic nature of computational artificial intelligence allows for the ability to update and refresh
the models as new data is generated, leading to more robust tools that cover larger windows of
operation and eliminate negative confounding factors.

ix
Here, we review in six different chapters, cutting-edge applications of AI in chemistry and
material science without professing completeness. In these pages we try to capture the excitement of
few selected contributions that show different domains of applicability of AI across the spectrum of
chemical research.
The chapters have been organized starting from methodological contribution connected to data
representation to high-level overviews highlighting industrial impact. In Chapter 1, we present a
work demonstrating the importance of incorporating three-dimensional symmetries in the context
of Gaussian process regression models (statistical learning models) geared towards the interpolation
of the tensorial properties of atomic-scale structures. To provide an example of the usage of machine,
or statistical, learning approaches to predict material properties, we present in Chapter 2, a work
focusing on the inference of hardness in naturally occurring ceramic materials, which integrates
atomic and electronic features from composition directly across a wide variety of mineral
compositions and crystal systems.
Machine learning is not confined to providing, high quality surrogate models, however. In
Chapter 3, we review the high-dimensional neural network potentials (HDNNPs) for material
simulations. These types of neural networks are a general-purpose reactive potential method that
can be used for simulations of an arbitrary number of atoms and can describe all types of chemical
interactions (e.g., covalent, metallic, and dispersion), including the breaking and forming of
chemical bonds.
The journey about the impact of AI in chemistry would not be complete without looking into
applications of neural networks and machine learning in the chemoinformatic space. In Chapter
4, we present a review of the state-of-the-art of data-driven learning systems for forward chemical
reaction prediction, analyzing the reaction representations, the data and the model architectures.
We will discuss the advantages and limitations of the different AI models’ strategies and make
comparisons on standard open-source benchmark datasets. Chapter 5 shows how AI and machine
learning are impacting the pharmaceutical industry – one of the first chemical industries to embrace
AI and machine learning techniques. Here, the author provides an overview of how methods and
models are conceived, built, validated and their benefits quantified.
Finally, in Chapter 6, we present a new way of doing materials discovery, by integrating natural
language processing, knowledge representation, and automated reasoning. The authors present how
this revolution will bring the entire chemical R&D from the current “4th paradigm” of discovery
driven by data science and machine learning to a “5th paradigm” era where cognitive systems
seamlessly integrate information from human experts, experimental data, physics-based models, and
data-driven models to speed discovery.
We hope that this book will benefit graduate students and researchers in chemistry, computer
scientists interested in applications of AI and machine learning to chemistry and scientists who are
interested in understanding the application possibilities of AI and machine learning in chemistry in
different environments from Universities to Industrial companies.

Edward O. Pyzer-Knapp
IBM Research—UK
Daresbury, UK

Teodoro Laino
IBM Research—Zurich
Rueschlikon, Switzerland

x
Chapter 1

Atomic-Scale Representation and Statistical Learning of

Tensorial Properties
Andrea Grisafi, David M. Wilkins, Michael J. Willatt, and Michele Ceriotti*

Laboratory of Computational Science and Modeling, IMX, École Polytechnique Fédérale de

Lausanne, 1015 Lausanne, Switzerland
*E-mail: michele.ceriotti@epfl.ch

This chapter discusses the importance of incorporating three-dimensional

symmetries in the context of statistical learning models geared towards the
interpolation of the tensorial properties of atomic-scale structures. We focus on
Gaussian process regression, and in particular on the construction of structural
representations, and the associated kernel functions, that are endowed with the
geometric covariance properties compatible with those of the learning targets. We
summarize the general formulation of such a symmetry-adapted Gaussian process
regression model, and how it can be implemented based on a scheme that
generalizes the popular smooth overlap of atomic positions representation. We
give examples of the performance of this framework when learning the
polarizability, the hyperpolarizability, and the ground-state electron density of a
molecule.

Introduction
The purpose of a statistical learning model is the prediction of regression targets by means
of simple and easily accessible input parameters (1). In chemistry, physics and materials science,
regression targets are usually scalars or tensors, including electronic energies (2–5), quantum-
mechanical forces (6–8), electronic multipoles (9–11), response functions and scalar fields like the
electron density (12–18). For ground-state properties, the regression input usually consists of all the
information connected with the atomic structure at a given point of the Born-Oppenheimer surface
(e.g., nuclear charges and atomic positions). A more or less complex manipulation of these primitive
inputs leads to what is usually called a structural descriptor, or representation (Figure 1).

© 2019 American Chemical Society

Figure 1. Structural descriptors should identify unequivocally and concisely the geometry and composition of
a molecule or condensed phase.

It is widely recognized that an essential ingredient for maximizing the efficiency of machine
learning models is to use representations that mirror the properties one wants to predict. Here
we discuss an effective approach to build linear regression models for tensors. The notion that the
representation should mirror the property means when a symmetry operation is applied to an atomic
structure, the associated representation should transform in a way that mimics the transformation
of the properties of the structure. It should be stressed that it is completely possible to build a ML
model that does not incorporate such transformation properties. The universal symmetries of the
property must then be learned by the model through exposure to data in the training set, making
the training process less efficient. A crucial focus of this chapter is the creation of symmetry-adapted
representations. Once one has a symmetry-adapted representation at hand, the linear regression
model is bound to fulfill the symmetry requirements imposed by the property (19–23). There is,
however, another important consideration when building a model for tensors, expressed in terms
of a Cartesian reference system. It is well known that any tensor can be decomposed into a set of
spherical components that transform independently under rotations (24, 25). Particularly for high-
order tensors, the irreducible spherical decomposition of a tensor simplifies greatly the learning task,
compared to the Cartesian representation, as we will discuss later on.
The process of symmetry-adapting a representation is general but rather abstract, and for it
to be practical one must choose the initial representation with care. For this purpose we use the
smooth overlap of atomic positions (SOAP) framework, which is based on the representation of
atom-centered environments constructed from a smooth atom density built up using Gaussians
centered on each neighbor of the central atom. This density-based representation can be adapted
to incorporate correlations between atoms to any order. It has been applied successfully to a vast
number of ML investigations for physical properties of atomic structures (26–28). After summarizing
the derivation and efficient implementation of an extension to SOAP, called λ-SOAP, which is
particularly well-suited to the learning of tensorial properties, we present a few examples to
demonstrate its effectiveness for this task.

Linear Regression
Suppose one wanted to build a linear regression model to predict a scalar property y(X) for an
input X,

2
In this equation |w〉 represents the weight vector we wish to learn and |X〉 is a representation
of the input. The usual approach for learning the weight vector is to suppose the properties are
independently and normally distributed, that is,

One then maximizes the log likelihood of a set of N observations {yn} with respect to the weight
vector. The log likelihood (loss or cost function) is given by

where the regularizer α2 〈w|w〉 appears if one introduces a Gaussian prior on w with variance α−2.
L(w) attains its maximum at

where the covariance Ĉ is

and η = α/σ.
The preceding linear regression scheme in which one handles the representation |X〉 explicitly is
often called the primal formulation. There is in fact another, complementary formulation called the
dual (kernel ridge regression [KRR] or Gaussian process regression [GPR]) in which the equations
take a slightly different form. In the dual, one does not handle the representation explicitly but rather
introduces a kernel function which, roughly speaking, measures the similarity between two inputs.
The link between the primal and dual lies in the observation that a positive-definite kernel k(X,X′)
can always be written as an inner product (1),

This means that given a kernel one can always construct a representation and vice versa. From
the perspective of GPR, the kernel is interpreted as the covariance between the properties of its two
arguments,

The properties are assumed to be normally distributed, which means one can straightforwardly
find the conditional distribution of the property y(X) given a set of observations in a training set {yn}.
The mean of this distribution is given by

3
where the jth component of k(X) is k(X, Xj), Kjk = k(Xj, Xk) and y is a vector formed from {yn}.
When the feature space associated with a kernel is known explicitly, and finite-dimensional, the
primal and dual formulations are formally equivalent, and the choice of which to use is an important
but purely practical question. Constructing a primal model requires inversion of the covariance
matrix, while the dual requires inversion of the kernel matrix K. If the feature space (i.e., the space
occupied by the representation) is larger than the training set then the GPR approach is more
convenient. Of course, the real utility of the kernel trick becomes apparent when the kernel is a
complex, non-linear function for which the feature space is unknown and/or infinite-dimensional.
In these circumstances, working in the dual makes it possible to formulate regression as a linear
problem, where reference configurations (or a sparse set of representative states) is used to define a
basis for the target, as in the right hand side of eq 8. As such, all the complexity of the input space
representation is contained in the definition of the kernel function.

Tensors, Symmetries and Correlations

The previous discussion defines the general architecture of regression models which can be
used to predict any scalar quantity associated with the molecular geometry. We now discuss the
implications of learning tensors, or, similarly, any quantity that is not invariant under a rigid rotation
or reflection of the atomic structure. In so doing, we will introduce a formalism which is general
enough to encompass both proper Cartesian tensors, such as molecular polarizabilities, and three-
dimensional scalar fields that can be conveniently decomposed in atom-centered contributions, such
as the ground-state charge density of a molecule.
Let us start by considering the prototypical case of a Cartesian tensor y ≡ yαβ... of rank r,
with the combination of indices {αβ...} running over a number of Cartesian components equal
to 3r. Given any arbitrary distorted atomic structure with no particular internal symmetry, we are
interested in characterizing the transformations of the tensor under only three families of symmetry
operations (viz., translations, rotations and reflections). Since these symmetry operations do not
affect the internal geometry of an atomic structure, we can think equivalently in terms of active
transformations, in which the system undergoes the symmetry operation and the reference frame
remains fixed, or in terms of passive transformations, in which the reference frame undergoes the
symmetry operation and the system remains fixed. In the following, we summarize the symmetry
operations by adopting an active picture and assume the system is not subjected to an external field.

Translations
Any physical property of an atomic structure X remains unchanged under a rigid translation t̂ of
atomic positions, that is,

Rotations
Under the application of a rigid rotation R̂ to an atomic structure X, we assume that each
Cartesian component of the tensor undergoes a covariant linear transformation. Using Einstein

4
notation for convenience, and representing by R the rotation matrix corresponding to R̂, the rotated
tensor is

Reflections
Applying a reflection operator Q̂ to an atomic structure X through any mirror plane leads to the
following reflected tensor,

Covariant Descriptors
In general terms, a primitive representation that mirrors a tensor of a given rank r could formally
be built by considering

where |X〉 is an arbitrary description of the system, while |α〉 represents a set of Cartesian axes which is
rigidly attached to the system. When using this primitive representation in a linear regression model,
the tensor component corresponding to αβ... would be

After maximizing the log likelihood, the former possibility leads to a model that predicts every
component to be the same, while the latter ignores the known correlations between the components
and is therefore likely to overfit. For example, consider a training set in which only one of the
tensor components is non-zero. All but one of the regression weights {|wαβ...〉} would be driven
towards zero to maximize the log likelihood, so the trained model would only predict a finite value
for the component it had been explicitly exposed to in the training set. The model would therefore
incorrectly predict the tensor components for a structure differing only by a rigid rotation from one
in the training set.
To address these problems, one should adapt the primitive descriptor so that it fulfills each of the
symmetries detailed in eqs (9-11). Since the Cartesian basis vectors are invariant under translations,
eq 9 implies the core representation should itself be invariant under translations. Using Haar
integration one can construct a core representation that is invariant under translations by integrating
an arbitrary representation over the translation operator t̂ (29). One can then proceed to consider
covariance under SO(3) group operations. Eq 10 implies that a covariant representation for |Xαβ...〉
should satisfy the invariance relationship

5
for any rotation R̂. Starting from the primitive definition of eq 12, there are a variety of ways to
enforce this invariance relationship. One possibility is to use

where the operator R̂X→ is defined to rotate X into a specified orientation which is common to all
the molecules of the dataset (Figure 2).

Figure 2. Provided that one can define a local reference system, it is possible to learn tensorial properties by
aligning each molecule (or environment) into a fixed reference frame.

This works under the assumption that it is always possible to define a unique (and therefore
unambiguous) internal reference frame to rotate X into a specified orientation, which might be
possible when the system involved has a particularly rigid internal structure. A more general strategy,
which does not require any assumption on the molecular geometry to be made, consists in
considering the covariant integration over the operator R̂ (Haar integration),

On the top of this definition, the requirement that a representation be covariant in O(3),
including the reflection symmetry of the tensor as in eq 11, means that improper rotations must be
included (i.e., O(3) = SO(3) × {Î, Q̂}) with Q̂ representing a reflection operator. This this is done
by a simple linear combination of the SO(3) descriptor with its reflected counterpart with respect to
any arbitrary mirror plane of the system; that is,

Any other reflection operation can be automatically included by having made the descriptor
covariant under rotations.

6
Covariant Regression
Having shown how to build a symmetry-adapted representation of the system, let us see the
implications of this procedure for linear regression. Using a symmetry-adapted representation in a
linear regression model leads to the following solution for the regression weight,

where the covariance is

Note that the solution for the linear regression weight does not change when the training
structures and corresponding tensors simultaneously undergo a symmetry operation that the
representation has been adapted to. In other words, the same model results regardless of the arbitrary
orientation of structures in the training set.
When moving to the dual, we find the kernel to be

This result corresponds to

As stressed earlier, performing the linear regression in the dual using this kernel leads to a
formally-equivalent model to that resulting from the primal formulation described above, yet this
kernel appears to be more complicated than a symmetry-adapted descriptor since it involves two
integrations over rotations. If, however, we assume the core representation |X〉 undergoes a unitary
transformation when the system is rotated,

the kernel reduces to

where k(X, X′) = (X|X′) is the kernel corresponding to the core representation. The requirement
that the core representation should undergo a unitary transformation when the system is rotated
is reasonable since, if it were not true, the autocorrelation k(X, X) would depend on the absolute
orientation of X, which is unphysical given our assumption of the absence of external fields. Note
that upon defining a collective tensorial index {αβ...}, a kernel matrix of size 3rN × 3rN can be

7
constructed by stacking together each of the 3r × 3r vector-valued correlation functions. Then, a
covariant tensorial prediction of the property of interest can eventually be carried out according to
the GPR prescription of eq 8. It should be noted that the symmetry-adapted kernel of eq 24 is just
a generalization of the covariant kernels that have been introduced in Glielmo et al. (7) to learn
forces. Taking scalar products of symmetry-adapted representations provide a route to design easy-
to-compute covariant kernels for tensors of arbitrary order.
It is instructive to compare the symmetry-adapted kernel definition of eq 24 to the kernel that
one gets from the aligned descriptors of eq 16. In this case, building a kernel function on the top of
this descriptor effectively means carrying out the structural comparison in a common reference frame
where the two molecules are mutually aligned. One can then conveniently learn the tensor of interest
component-by-component through a much simpler scalar regression framework. For the simple case
of rank-1 tensors, for instance, we would get,

where we have defined the best alignment operator as . This strategy has
been successfully used in the learning of electronic multipoles of organic molecules as well as for
predicting optical response functions of water molecules in their liquid environments (10, 12). For
the latter example, a representation of the best-alignment structural comparison is reported in Figure
3.
This method for tensor learning has the clear drawback of relying on the definition of a rigid
molecular geometry, for which an internal reference frame can be effectively used to perform the
procedure of best alignment. Following this line of thought, the availability of a covariant kernel
function allows us to implicitly carry out both the structural comparison and the geometric alignment
of two molecules simultaneously, neglecting any prior consideration about the internal structure of
the molecule at hand.

Figure 3. Representation of the reciprocal alignment between water environments.

Spherical Representation
The family of Cartesian symmetry-adapted descriptors previously introduced can be effectively
used, in principle, to predict any Cartesian tensor of arbitrary order. However, we should notice
that having a tensor product for each additional Cartesian axis makes the cost of the regression
scale unfavorably with the tensor order, producing a global kernel matrix of dimension (3r)2. In
fact, it is well established that a more natural representation of Cartesian tensors is given by their

8
irreducible spherical components (ISC) (25). As described in Stone (25), the transformation matrix
from Cartesian to spherical tensors can be found recursively, starting from the known transformation
for rank-2 tensors.
Upon trivial manipulations, that might account for the non-symmetric nature of the tensor,
each ISC transforms separately as spherical harmonics . Spherical harmonics form a complete
basis set of the SO(3) group. In particular, each λ-component of the tensor spans an orthogonal
subspace of dimension 2λ + 1. For instance, the 9 components of a rank-2 tensor separate out into
a term (proportional to the trace) that transforms like a scalar, three terms that transform like ,
and five terms that transform like . When using a spherical representation, the kernel matrix is
block diagonal, which greatly reduces the number of non-zero entries, and makes it possible to learn
separately the different components. An additional advantage is that the possible symmetry of the
tensor can be naturally incorporated by retaining only the spherical components λ that have the
same parity as the tensor rank r. For instance, the λ = 1 component of a symmetric rank-2 tensor
vanishes identically, meaning that only the 6 surviving elements of the tensor need to be considered
when doing the regression. Especially for high rank tensors, this property means that the number of
components can be cut down significantly.
In light of the discussion carried out for Cartesian tensors, it is straightforward to realize how a
symmetry-adapted descriptor that transforms covariantly with spherical harmonics of order λ should
look. Since each ISC is effectively a vector of dimension 2λ + 1, we can first write a primitive spherical
harmonic representation as

where |λµ〉 is an angular momentum state of order λ, such that . Its symmetry-
adapted counterpart, which is covariant in SO(3), is

Finally, since the parity of |λµ〉 with respect to the inversion operator î is determined by λ, a
spherical tensor descriptor that is covariant in O(3) can be obtained by considering

Note that a tensorial kernel function built on the top of this descriptor would transform under
rotations as the Wigner-D matrix of order λ, :

In addition to being the most natural strategy to perform the regression of Cartesian tensors,
using a representation like that of eq 28 comes in handy when building regression models for the
many physical properties that can be decomposed in a basis of atom-centered spherical harmonics. In

9
the following sections, we will give an example of this kind by predicting the ground-state electronic
charge density of molecular systems.

SOAP Representation
We now proceed to characterize the exact functional form of a symmetry-adapted representation
of order λ which can be used to carry out a covariant prediction of any property that transforms
as a spherical harmonic. In the section above, it was pointed out that, within a framework of linear
regression, both the primal and the dual formulation can be adopted to actually implement the
interpolation of a given tensorial property. In what follows, however, we will focus our attention
on the dual formulation, discussing in parallel the feature vector associated with the λ-SOAP
representation and the corresponding kernel function. This choice is justified by the greater flexibility
of the kernel formulation, allowing a non-linear extension of the framework as discussed below.
An atom-centered environment Xj describes the set of atoms that are included within a spherical
cutoff rcut around the central atom j. We will label as |Xj〉 the abstract vector which describes the local
structure. A convenient definition of |Xj〉 in real space can be obtained by writing a smooth probability
amplitude, for each atomic species α, as a superposition of Gaussians with spread σ that are centered
on the positions {ri} of the atoms that surround the central atom j:

This definition descends naturally from the requirement of translational invariance of a

representation of the entire structure and corresponds to the construction that is used in Bartok et al.
(21) to define the SOAP kernel (29). Formally, one can then write

with the ket |α〉 tagging the identity of each species. Even though it might be convenient to use a
lower-dimensional chemical space (30), particularly when building models for dataset containing
many elements, in what follows we will assume that each element is associated with an orthogonal
subspace (i.e., 〈α|β〉 = δαβ). This implies that, when using this representation to define a scalar-
product kernel, only the density distributions of the same atomic type are overlapped,

With this choice, the two adjustable parameters rcut and σ determine respectively the range and
the resolution of the representation. To simplify the notation, we will omit the α labels, assuming
that a single element is present. The extension to the case with multiple chemical species follows
straightforwardly.

10
λ-SOAP(1) Representation
To the first order in structural correlations, including the environmental state |Xj〉 in the
definition of a local symmetry-adapted descriptor of order λ reads

The real space representation of can be understood as a rotational average of the

environmental density which is rigidly attached to a spherical harmonic of order λ,

A more concise, and easily-computed version of this representation results from projecting

on a basis of spherical harmonics, in which the integral over rotations can be

performed analytically,

It is clear that many of the indices in this representation are redundant, and would have no
effect when taking an inner product between two such representations. The most concise form that
produces the same scalar product kernel as eq 34 is

where we introduced the spherical density component

This ket corresponds to the kernel

which is straightforward to calculate using a quadrature in r or an expansion on a radial basis.

It is insightful to consider the explicit expression for eq 36 in terms of the atom density. Taking
for instance the case of λ = 1, µ = 0, for which :

11
One sees that the 2-body λ-SOAP representation corresponds to moments of the smooth atom
density, resolved over different shells around the central atom. A linear model built on these features
can respond to changes in the atomic density at different distances, simultaneously adapting the
magnitude and geometric orientation of the target property.

λ-SOAP(2) Representation
Describing an atomic environment in a way that goes beyond the two-body structural
correlations (ν > 1) is of fundamental importance, because information on distances alone is not
sufficient to uniquely determine an atomic structure. Building on the definition of eq 33, and on the
symmetrized-atom-density framework of Willatt et al. (29), this can be achieved by introducing an
additional tensor product in the environmental state |Xj〉 within the rotational average,

By projecting on a real-space basis, the representation becomes

Similarly to the ν = 1 case, one can compute the ket without an explicit rotational average by
projecting on a basis of spherical harmonics,

where the parentheses denote a Wigner 3j symbol. Just as for the λ-SOAP(1) case considered earlier,
it is clear that many of the indices in this expression are redundant. When taking an inner product
between two such representations, one can use orthogonality of Wigner 3j symbols to simplify to an
inner product between two objects with the following form,

The Clebsch-Gordan coefficient 〈lk, lk′|λµ〉 has the role of combining two angular momentum
components of the atomic environment Xj to be compatible with the spherical tensor order λ. This

contains all the essential information of the abstract representation that is needed
for λ-SOAP(2) linear regression. Note that 〈lk, lk'|λµ〉 is zero unless k + k′ = µ, that the indices l, l'

12
and λ must satisfy the inequality |l − l'| ≤ λ ≤ l + l' and that the representation is invariant under
transposition of r and r'.
Let us see how the representation changes under inversion. Given the parity of the spherical
harmonics,

it follows that

This condition implies that a representation that is covariant in O(3), eq 28, can be easily
obtained by retaining only the components of the feature vectors for which l + l′ +λ is even.
In practice it is often more convenient to use real spherical harmonics instead of |λν〉 in the
representation. Using real spherical harmonics ensures the kernel is purely real, but the components
of the representation need not be because the phases are unimportant. In fact, what one finds upon
replacing |λµ〉 with a real spherical harmonic is that the components are either purely real or purely
imaginary, depending on whether l + l′ + λ is even or odd. For example, the representation for µ > 0
becomes

which satisfies

and the same relation also holds for the other real spherical harmonics

for ν < 0 and |λ0〉 for ν = 0). One can therefore discard all imaginary
components of the representation to enforce inversion invariance.
Generalization of this procedure to higher orders of λ-SOAP is tedious but straightforward using
well-known formulae for integrals of products of Wigner-D matrices over rotations.

Non-linearity
As already mentioned in the introduction, a crucial aspect to improve regression performance is
to incorporate non-linearities in the construction of the representation. For instance, tensor products
of the scalar representation introduce higher body order correlations, in a way that can be easily
implemented in a kernel framework by raising the kernel to an integer power (29). When working
with tensorial representations, however, one has to be careful to avoid breaking the covariant

transformation properties of the feature vector. Taking products of kets would

require re-projecting the product onto the irreducible representations of the group, which would be
as cumbersome as increasing the body order exponent ν. One obvious solution to this problem is

13
to multiply the spherical kernel of order λ by its scalar and rotationally invariant counterpart, which
can then be raised to an integer power ζ without breaking the tensorial nature of the kernel. For any
generic order ν and ν′ in structural correlations, this procedure consists in considering the tensor
product

which leads to the kernel definition

For ζ = 1, one recovers the original tensorial kernel, while a non-linear behavior is introduced
for ζ > 1. A considerable improvement of the learning power is usually obtained when using ζ = 2,
while negligible further improvement is observed for ζ > 2.
These considerations also apply to the use of fully non-linear ML models like a neural network.
To guarantee that the prediction of the model is consistent with the group covariances, the tensorial
λ-SOAP features must enter the network at the last layer, and all the previous non-linear layers can
only contribute to different linear combinations of the tensorial features, for example,

where each of the frr′ll′ can be an arbitrary non-linear combination of the scalar SOAP features Similar
ideas have already been implemented in the context of generalizing the construction of spherical
convolutional neural networks (31).

Implementation
In the previous discussion it was pointed out that beyond the formal definition of the structural
descriptor in real space, the kernel evaluation eventually requires the computation of the SOAP

density power spectrum . In turn, computing this quantity requires the evaluation
of the density expansion coefficients 〈rlm|Xj〉. In practice, the continuous variable r can be replaced
by an expansion over a discrete set of orthogonal radial functions Rn(r) that are defined within the
spherical cutoff rcut. For this reason, we will refer, from now on, to the density expansion coefficients
as 〈nlm|Xj〉.
Having represented the environmental density distribution as a superposition of Gaussian
functions centered on each atom, the spherical harmonics projection can be carried out analytically
(32), leading to:

14
where the sum over i runs over the neighboring atoms of a given chemical element, and ιl represents
a modified spherical Bessel function of the first kind. Under suitable choices of the functions Rn(r),
the radial integration can also be carried out analytically, too.
One possibility is to start with non-orthogonal Gaussian type functions, R̃k(r), reminiscent of
Gaussian-type orbitals commonly used in quantum chemistry:

where Nk is a normalization factor, such that . The set of Gaussian widths

{σk} can be chosen to effectively span the radial interval involved in the environment definition.
For instance, one can take , obtaining functions that have equally-
spaced peaks between 0 and rcut. The explicit functional form of the primitive radial integrals is

where Γ is the Gamma function, while 1F1 is the confluent hypergeometric function of the first kind.
These primitive integrals can be finally orthogonalized by applying the orthogonalization matrix
S−1/2, with S representing the overlap matrix between primitive functions,

for which well-known analytical expressions exist (33).

Examples
In this section, the effectiveness of a KRR model that is adapted to the fundamental physical
symmetries of the target is demonstrated, considering two very different quantities as examples. As a
first example, we consider the prediction of the dielectric response series of the Zundel cation, when
training a λ-SOAP(2) regression model on the ISC of the tensors at hand. In the second example,
we show how to predict the charge density ρ(r) of a small, yet flexible, hydrocarbon molecule like

15
butane, by decomposing ρ(r) into atom-centered spherical harmonics. In both cases, a comparison
of the prediction performance is carried out between λ-SOAP(2) descriptors that are covariant in
SO(3), which were used in previous work, and those that have been made fully O(3) compliant by
symmetrization over î.

Dielectric Response Series

Consider the dielectric response series of a molecule including the dipole µ, the polarizability
α and the hyperpolarizability β. The latter, for instance, is a rank-3 tensor describing the third-
order response of the molecular energy U with respect to an applied electric field E, with

components . By construction this tensor is symmetric, meaning that it can be

decomposed into two spherical components, 3 of λ = 1 symmetry and 7 of λ = 3 symmetry. The
total number of components to be learned is thus 10, consistently with the number of non-equivalent
components of the Cartesian tensor. The dataset is made of 1000 configurations, of which 800
are randomly selected to train the regression model, while the remaining 200 are using to test the
prediction performances. λ-SOAP(2) kernels that are adapted to SO(3) and O(3) group symmetry
were constructed using a Gaussian smearing of σ = 0.3 Å and an environment cutoff of rcut = 4.0 Å.
The performances of each independent learning exercises (λ = 1, 2, 3) are reported in Figure
4. For all the spherical components we observe a systematic improvement of the regression when
endowing the kernel with the inversion symmetry about the atomic centers.

Figure 4. Learning curves of the Zundel cation dielectric response series µ,α and β as decomposed in their
anisotropic (λ > 0) spherical tensor components. Full and dashed lines refer to predictions that are carried
out with λ-SOAP kernel functions that are covariant in SO(3) and O(3) respectively.

This improvement is particularly pronounced for few training points, while it becomes less
relevant for larger training set sizes, where the symmetry under inversion is eventually learned by the
SO(3)-kernel as well.

16
Electronic Charge Densities
Another learning task that can benefit from a symmetry-adapted regression scheme involves
the learning of scalar fields such as the electron charge density. ML models for the charge density
have been proposed based on the coefficients in a plane wave basis–this is convenient due to
orthogonality, but leads to poor transferability when considering flexible molecules, or learning
across different molecular species–or based on direct prediction of the density on a real-space grid
(16, 17, 34). By expanding the density on an atom-centred basis set, composed of radial functions
multiplied by spherical harmonics,

one obtains a model that is localized and transferable, concise, and easily integrated with the many
electronic structure codes that are based on atom-centered basis functions. The coefficients in the
expansion transform under rotations like spherical harmonics, and can therefore be learned
efficiently using a symmetry adapted GPR model,

where the sum runs over a set of reference environments Zi centered around atoms of the same kind
as i, and the weights are computed by a regression procedure that is complicated by the fact that the
basis set is not orthogonal (18).

Figure 5. Learning curves of the predicted charge density of 200 randomly selected butane molecules, when
considering up to 800 reference molecules to train the model. The molecular geometries and computational
details are the same as in Grisafi et al. (18) The black full line refers to the prediction error as reported in
Grisafi et al. (18) Blue lines refer to the result obtained with the RI-cc-pV5Z basis, both with a λ-SOAP(2)
descriptor covariant in SO (3) (full) and O(3) (dashed). Dotted lines refer to the basis set error. In both
cases, 100 reference atomic environments have been used to define the problem dimensionality.

17
In Figure 5 we report the result obtained for a dataset of butane molecules (C4H10), for which
1000 reference pseudo-valence densities have been computed at the DFT/PBE level. The
dimensionality of the regression problem is defined by considering the 100 most diverse atomic
environments, out of a total of 14,000, selected by farthest point sampling through the 0-SOAP(2)
distance metric (35). Given that in our previous work the learning performance was essentially
limited by the basis set expansion error for the density, we decided to compare the optimized basis
set used in Grisafi et al. (18) with a resolution of the identity (RI) basis set, usually adopted in the
context of avoiding the computation of the four-center Hartree integral in electronic structure theory
(36). When considering in particular the RI-cc-pV5Z basis, which accounts for basis functions up
to l = 4 of angular momentum, we find that the basis set decomposition error is almost halved
(~0.6%) with respect to Grisafi et al. (18), as shown by the asymptotic convergence in Figure 5.
The figure also compares, in the case of the RI basis, the learning performances associated with λ-
SOAP(2) descriptors that have been made covariant in SO(3) and O(3) respectively. As seen for the
case of polarizability, the O(3) features improve, although only slightly, the prediction accuracy. The
improvement is more substantial at the smallest training set size, where the incorporation of prior
knowledge on the symmetries of the system can make up for the scarcity of data.

Conclusions
The previous examples show how statistical learning of a tensorial quantity across the
configurational space of atomic coordinates and composition represents a challenging
methodological task which requires considerable modifications to the architecture of more familiar
scalar learning models. The efficiency of a regression model benefits greatly from the incorporation
of symmetry, as it effectively reduces the dimensionality of the space in which the algorithm is
asked to interpolate the values of the target property. Symmetry of tensorial quantities should be
included in two distinct ways. First, one should decompose the tensor into ISC, so as to minimize the
amount of information that is needed to account for geometric covariance. Particularly for high-rank
Cartesian tensors, the matrix of correlations between tensor elements can be made block diagonal,
which reflects on the size and complexity of the associated kernel matrices. Second, by constructing
representations of the molecular structure that are made isomorphic with the tensor of interest, one
can obtain a linear basis that satisfies the expected covariant transformations. An important aspect to
consider is that, in order to preserve the properties of the symmetry-matched basis, non-linearities
have to be treated with care. We discuss how it is possible to do so in the context of KRR models,
and how one should proceed to design a covariant neural network that can be used to efficiently
accomplish a symmetry-adapted regression task.
We discuss a practical implementation of these ideas within the framework of the SOAP
representations, that uses a spherical-harmonics representation of the atom density and is therefore
particularly well-suited to incorporate SO(3) covariance. We discuss an extension, that we refer to
as λ-SOAP, that provides a natural linear basis to regress quantities that transform like spherical
harmonics, and can be made to represent arbitrarily high body-order correlations between atomic
coordinates. As an original result of this work, we also discuss how to satisfy the inversion symmetry
of the tensor, showing that representations that incorporate the full O(3) covariances improve the
performance of the ML model, particularly in the limit of a small training set. We also show an
example of the use of λ-SOAP representations to learn a scalar field in three-dimension as a sum
of atom-centered contributions, choosing the electron density as a physically relevant example. We
believe that this strategy–although more complex than alternatives that use orthogonal basis

18
functions or a real-space grid–has the best promise to be transferable across different systems, and to
be combined with standard electronic structure packages.

Acknowledgments
The Authors acknowledge support by the European Research Council under the European
Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 677013-
HBMAP).

References
1. Williams, C. K. I.; Rasmussen, C. E. Gaussian Processes for Machine Learning; MIT Press, 2006.
2. Bartók, A. P.; Payne, M. C.; Kondor, R.; Csányi, G. Gaussian Approximation Potentials: The
Accuracy of Quantum Mechanics, without the Electrons. Phys. Rev. Lett. 2010, 104, 136403.
3. Jain, A.; Ong, S. P.; Hautier, G.; Chen, W.; Richards, W. D.; Dacek, S.; Cholia, S.; Gunter,
D.; Skinner, D.; Ceder, G.; Persson, K. A. Commentary: The Materials Project: A materials
genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002.
4. Calderon, C. E.; Plata, J. J.; Toher, C.; Oses, C.; Levy, O.; Fornari, M.; Natan, A.; Mehl, M.
J.; Hart, G.; Nardelli, M. B.; Curtarolo, S. The AFLOW standard for high-throughput materials
science calculations. Comput. Mater. Sci. 2015, 108, 233–238.
5. Ward, L.; Wolverton, C. Atomistic calculations and materials informatics: A review. Curr. Opin.
Solid State Mater. Sci. 2017, 21, 167–176.
6. Li, Z.; Kermode, J. R.; De Vita, A. Molecular Dynamics with On-the-Fly Machine Learning of
Quantum-Mechanical Forces. Phys. Rev. Lett. 2015, 114, 096405.
7. Glielmo, A.; Sollich, P.; De Vita, A. Accurate interatomic force fields via machine learning with
covariant kernels. Phys. Rev. B 2017, 95, 214302.
8. Glielmo, A.; Zeni, C.; De Vita, A. Efficient nonparametric n-body force fields from machine
learning. Phys. Rev. B 2018, 97, 184307.
9. Yuan, Y.; Mills, M. J.; Popelier, P. L. Multipolar electrostatics based on the Kriging machine
learning method: an application to serine. J. Mol. Model. 2014, 20, 2172.
10. Bereau, T.; Andrienko, D.; von Lilienfeld, O. A. Transferable Atomic Multipole Machine
Learning Models for Small Organic Molecules. J. Chem. Theory Comput. 2015, 11, 3225–3233.
11. Bereau, T.; DiStasio, R. A.; Tkatchenko, A.; von Lilienfeld, O. A. Non-covalent interactions
across organic and biological subsets of chemical space: Physics-based potentials parametrized
from machine learning. J. Chem. Phys. 2018, 148, 241706.
12. Liang, C.; Tocci, G.; Wilkins, D. M.; Grisafi, A.; Roke, S.; Ceriotti, M. Solvent fluctuations
and nuclear quantum effects modulate the molecular hyperpolarizability of water. Phys. Rev. B
2017, 96, 041407.
13. Grisafi, A.; Wilkins, D. M.; Csányi, G.; Ceriotti, M. Symmetry-Adapted Machine Learning for
Tensorial Properties of Atomistic Systems. Phys. Rev. Lett. 2018, 120, 036002.
14. Wilkins, D. M.; Grisafi, A.; Yang, Y.; Lao, K. U.; DiStasio, R. A.; Ceriotti, M. Accurate
molecular polarizabilities with coupled cluster theory and machine learning. Proc. Natl. Acad.
Sci. 2019, 116, 3401–3406.
15. Christensen, A. S.; Faber, F. A.; von Lilienfeld, O. A. Operators in quantum machine learning:
Response properties in chemical space. J. Chem. Phys. 2019, 150, 064105.

19
16. Brockherde, F.; Vogt, L.; Li, L.; Tuckerman, M. E.; Burke, K.; Mu¨ller, K.-R. Bypassing the
Kohn-Sham equations with machine learning. Nat. Commun. 2017, 8, 872.
17. Alred, J. M.; Bets, K. V.; Xie, Y.; Yakobson, B. I. Machine learning electron density in sulfur
crosslinked carbon nanotubes. Compos. Sci. Technol. 2018, 166, 3–9.
18. Grisafi, A.; Fabrizio, A.; Meyer, B.; Wilkins, D. M.; Corminboeuf, C.; Ceriotti, M.
Transferable Machine-Learning Model of the Electron Density. ACS Cent. Sci. 2019, 5, 57–64.
19. Braams, B. J.; Bowman, J. M. Permutationally invariant potential energy surfaces in high
dimensionality. Int. Rev. Phys. Chem. 2009, 28, 577–606.
20. Behler, J.; Parrinello, M. Generalized Neural-Network Representation of High-Dimensional
Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401.
21. Bartók, A. P.; Kondor, R.; Csányi, G. On representing chemical environments. Phys. Rev. B
2013, 87, 184115.
22. Shapeev, A. Moment Tensor Potentials: A Class of Systematically Improvable Interatomic
Potentials. Multiscale Model. Sim. 2016, 14, 1153–1173.
23. Zhang, L.; Han, J.; Wang, H.; Car, R.; E, W. Deep Potential Molecular Dynamics: A Scalable
Model with the Accuracy of Quantum Mechanics. Phys. Rev. Lett. 2018, 120, 143001.
24. Weinert, U. Spherical tensor representation. Arch. Ration. Mech. Anal. 1980, 74, 165–196.
25. Stone, A. J. Transformation between cartesian and spherical tensors. Mol. Phys. 1975, 29,
1461–1471.
26. De, S.; Bartók, A. P.; Csányi, G.; Ceriotti, M. Comparing molecules and solids across structural
and alchemical space. Phys. Chem. Chem. Phys. 2016, 18, 13754–13769.
27. Musil, F.; De, S.; Yang, J.; Campbell, J. E. J.; Day, G. G. M.; Ceriotti, M. Machine learning
for the structure-energy-property landscapes of molecular crystals. Chem. Sci. 2018, 9,
1289–1300.
28. Bartók, A. P.; De, S.; Poelking, C.; Bernstein, N.; Kermode, J. R.; Csányi, G.; Ceriotti, M.
Machine learning unifies the modeling of materials and molecules. Sci. Adv. 2017, 3.
29. Willatt, M. J.; Musil, F.; Ceriotti, M. Atom-density representations for machine learning. J.
Chem. Phys. 2019, 150, 154110.
30. Willatt, M. J.; Musil, F.; Ceriotti, M. Feature Optimization for Atomistic Machine Learning
Yields a Data-Driven Construction of the Periodic Table of the Elements. Phys. Chem. Chem.
Phys. 2018, 20, 29661–29668.
31. Kondor, R.; Zhen, L.; Trivedi, S. Clebsch-Gordan Nets: a Fully Fourier Space Spherical
Convolutional Neural Network. arXiv:1806.09231 2018.
32. Kaufmann, K.; Baumeister, W. Single-centre expansion of Gaussian basis functions and the
angular decomposition of their overlap integrals. J. Phys. B: At. Mol. Opt. 1989, 22, 1.
33. Gradshteyn, I. S.; Ryzhik, I. M. Table of integrals, series, and products, 7th ed.; Elsevier/
Academic Press, Amsterdam, 2007; pp xlviii+1171, Translated from the Russian, Translation
edited and with a preface by Alan Jeffrey and Daniel Zwillinger, With one CD-ROM (Windows,
Macintosh and UNIX).
34. Chandrasekaran, A.; Kamal, D.; Batra, R.; Kim, C.; Chen, L.; Ramprasad, R. Solving the
electronic structure problem with machine learning. Npj Comput. Mater. 2019, 5, 22.
35. Ceriotti, M.; Tribello, G. A.; Parrinello, M. Demonstrating the Transferability and the
Descriptive Power of Sketch-Map. J. Chem. Theory Comput. 2013, 9, 1521–1532.

20
36. Hättig, C. Optimization of auxiliary basis sets for RI-MP2 and RI-CC2 calculations:
Corevalence and quintuple- basis sets for H to Ar and QZVPP basis sets for Li to Kr. Phys.
Chem. Chem. Phys. 2005, 7, 59–66.

21
Chapter 2

Prediction of Mohs Hardness with Machine Learning Methods

Using Compositional Features
Joy C. Garnett*

Fisk University, Department of Life and Physical Sciences, Nashville, Tennessee 37208,
United States
Vanderbilt University, Department of Physics and Astronomy, Nashville, Tennessee 37212,
United States
*E-mail: jgarnett@fisk.edu, joy.garnett@vanderbilt.edu

Hardness, or the quantitative value of resistance to permanent or plastic

deformation, plays a crucial role in materials design for many applications, such
as ceramic coatings and abrasives. Hardness testing is an especially useful method
because it is nondestructive and simple to implement and gauge the plastic
properties of a material. In this study, I proposed a machine, or statistical, learning
approach to predict hardness in naturally occurring ceramic materials, which
integrates atomic and electronic features from composition directly across a wide
variety of mineral compositions and crystal systems. First, atomic and electronic
features, such as van der Waals, covalent radii, and the number of valence
electrons, were extracted from composition. The results showed that this proposed
method is very promising for predicting Mohs hardness with F1-scores >0.85.
The dataset in this study included modeling across a larger set of materials and
hardness values, which have never been predicted in previous studies. Next,
feature importances were used to identify the strongest contributions of these
compositional features across multiple regimes of hardness. Finally, the models
that were trained on naturally occurring ceramic minerals were applied to
synthetic, artificially grown single crystal ceramics.

Introduction
Hardness plays a key role in materials design for many industrial applications, such as drilling
(1, 2), boring (3, 4), abrasives (5–7), medical implants (8–10), and protective coatings (11–13).
Increased manufacturing demand fuels the drive for new materials of varying hardnesses, which
makes the fundamental understanding of the physical origin of this property necessary. Hardness
testing is a nondestructive measurement of a material’s resistance to permanent or plastic
deformation. One such hardness test is the Mohs scratch test, in which one material is scratched with

© 2019 American Chemical Society

another of a specified hardness number between 1 and 10. Materials that are easily scratched, such
as talc, are given a low Mohs number (talc’s is 1) while materials that are highly resistant to plastic
deformation and difficult to scratch, such as diamond, are given a high Mohs number (diamond’s is
10).
In the 1950s, Tabor established that Mohs scratch hardness is associated with deformation
during the plastic indentation process and found that indentation hardness rises monotonically about
60% for each increment of the Mohs scale (14, 15). With this correlation, Tabor identified a
relationship between Mohs hardness and Vickers and Knoop indentation hardness. Tabor then
correlated the stress-strain characteristics of a material to the stress that produces plastic flow (16).
The relationship between indentation hardness (H) and Mohs scratch-hardness number (M) is

where k=1.6, based on experimental data comparing the indentation hardness numbers found by
Vickers or Knoop measurements to the Mohs hardness value. It is unclear which atomic, electronic,
or structural factors contribute to k or hardness as a whole. So, identifying the key features of a
material that are involved in hardness can broaden our understanding of the mechanism of plastic
deformation, and therefore guide the design of novel materials.
The Mohs hardness of a material is influenced by many factors. Material hardness for single-
crystal brittle materials like minerals can depend on the type of chemical bonding, which can affect
a material’s ability to start dislocations under stress (17–19). Materials low on the Mohs scale, such
as talc (M = 1) and gypsum (M = 2), exhibit van der Waals bonding between molecular chains or
sheets. Materials with ionic or electrostatic bonding have a larger Mohs hardness. Materials at the
top of the Mohs scale, such as boron nitride (M = 9) and diamond (M = 10), have large covalent
components. Covalent bonding restricts the start of dislocations under stress, producing a resistance
to plastic deformation. Hardness is also related to the correlation of composition and bond strength
(20–24). Light elements have extremely short and strong bonds, as do transition metals which have a
high number of valence bonds. Higher Mohs hardness is correlated to high average bond length, high
number of bonds per unit volume, and a higher average number of valence electrons per atom.
Typically, calculations of hardness include multiple length scales to account for atomic
interactions that contribute to intrinsic hardness and microstructure which in turn contribute to
extrinsic hardness. Computational methodologies combined with high-performance computing
methods have been utilized to see deformation and compositional factors for hardness on multiple
length scales (25). Specific methodologies include molecular dynamics (MD), density functional
theory, and machine learning (ML). MD to compute indentation hardness involve the evolution
of atomic configuration over an atomistic system of millions of atoms. By calculating the evolution
of the system to observe the nucleation of dislocations, MD allow the investigation of the relation
between mechanical properties and microstructure. Recently, deformation processes are commonly
due to dislocation generation with respect to grain sizes (26, 27) . This gives great insight into
how atomic interactions and microstructure contribute to deformation, however there is a major
challenge with respect to how that translates into indentation hardness. Specifically, atomic
relaxations calculated using MD happen faster than experimental indentation rates. This is due to a
lack of force fields, which produce unreliable interatomic potentials. To address these issues, MD
approaches have previously been combined with ML to compute more accurate atomic forces and
interatomic potentials (28–31). Even so, there would still be the major challenge of the

24
computational cost to implement a molecular dynamic methodology across hundreds of materials in
a varied chemical and structural space.
Modeling hardness has been proven difficult since hardness is not a fundamental property of a
material and cannot be directly evaluated from quantum mechanics across all crystal systems with
one model. Complications previously found in energy-based calculations of other properties include
sensitivity to bond length, overestimation in density functionals (20), finite-size supercell effects (32,
33), and choice of exchange-correlation function (32, 33), which leads to multiple issues in large-
scale implementation across different crystal structures and varied chemical spaces. Issues include
large computational costs when considering several materials at a time, especially when considering
the costs of optimization and of energy determination of multiple deformations across one material
class.
There have been multiple computational approaches to connect hardness to bond behavior.
For instance, Gao et al. introduced the concept that hardness of polar covalent crystals is related to
three factors (23): bond density or electronic density, bond length, and degree of covalent bonding.
This approach utilized first principles calculations to uncover a link between hardness and electronic
structure with multiple semiempirical relationships depending on the type of bonding in the
material. The advantage of this approach is that this link that was demonstrated for 29 polar and
nonpolar covalent solids could be extended across a broad class of materials. One disadvantage of this
approach is that there are different semiempirical relationships of microhardness depending on if the
material is a pure covalent, polar covalent, or a multicomponent solid.
Šimůnek and Vackář extends this concept of expressing the hardness of covalent and ionic
crystals through the bond strength (24), which is determined by the number of valence electron
in the component atoms, crystal valence electron density, the number of bonds, and bond length
between pairs of atoms. They predicted the hardness of 30 covalent and ionic crystals including
binary AIII-BV and AII-BVI compounds and nitride spinel crystals (C3N4 and Si3N4). While their
results were close to experimental values for nitride spinels and AIII-BV materials, there was deviation
from experiment for the AII-BVI materials reported. One drawback of both methods is that they
depend on first principles calculations, which can become computationally expensive when
expanded to calculate all the bonds for hundreds of materials.
Mukhanov et al. circumvented ab initio calculations by utilizing thermodynamic properties to
find a simple quantitative dependence of hardness and compressibility of 9 materials (34). They
employ the standard Gibbs energy of formation and the Gibbs energy of atomization of elements
in the material. In addition, they introduce the factors of bond rigidity and bond covalency, which
are based on the electronegativites of the elements, as well as the ratio between the mean number
of valence electrons per atom and the number of bonds with neighboring atoms. One advantage is
that this method can be applied to a large number of compounds with various types of chemical
bonding and structures. The hardness predictions for refractory crystalline compounds agree within
7% of experimental values. Another advantage is the flexibility of this method to calculate hardness
as a function of temperature. However, there are factors that are estimated from experimental values
of hardness of other materials. For instance, the coefficient of relative plasticity varies for elementary
substances, compounds of second period elements, and compounds for period elements greater than
3. This coefficient is attributed to reflect the difference in bond strength depending on the elements
of different periods, but it is unclear as to how directly atomic radii relates to this coefficient. While
these relationships hold for superhard high-pressure phases, it may not hold true for softer materials.

25
Li et al. also circumvented ab initio calculations by using chemophysical parameters to predict
the hardness of ionic materials (35, 36). The chemophysical parameters of electronegativity values of
elements in different valence states were used to relate the stiffness of the atoms, the electron-holding
energy of an atom, and bond ionicity indicators to hardness in 8 superhard materials. The calculated
hardness values are in good agreement with experimental data. However, it remains unclear if this
relationship of electronegativity and hardness is only applicable for superhard materials or if it can be
expanded to understand softer materials as well.
The thrust of this study is to combine all of these factors that have been theoretically connected
to hardness and understand how they may interact with each other and contribute to the hardness
of crystalline ceramic materials. Previously, these factors have been used to explain hardness across
a small range of crystal structures, bonding frameworks, and hardness values. In this study, I look
to expand these concepts to a large number of compounds with various types of chemical bonding
types, structures, and compositions. These chemophysical parameters may interact with each other
to predict a range of hardness values. These factors, specific to superhard bonding, may or may not
equally apply to other bonding frameworks, which are either noncubic or not purely covalent.
To circumvent the issues found in solely energy-based calculations, machine or statistical
learning offers a less computationally expensive method to improve predictions of material properties
and accelerate the design and development of new materials. Recently, ML methods applied to
existing data have been proven effective for predicting hard-to-compute material properties at
reduced time, cost, and effort (37–43). Predictive models based on experimental data have proven
to be extremely powerful in materials research. Examples include the prediction of the intrinsic
electrical breakdown field of insulators in extreme electric fields (44, 45), the crystal structure
classification of binary sp-block transition metal compounds (46–48), the prediction of Heusler
structures based on compositional features (49), and the prediction of band gaps of insulators
(50–54). A major advantage of ML methods for rapid material property prediction on past data
is their power to uncover quantitative structure-property relationships across varied compositional
spaces.
Previous studies predicting various properties of materials with ML have used a broad range
of chemo-structural descriptor fingerprints. Typically, the approach is to map a unique set of
descriptors that act as fingerprints connecting a material to a property of interest. These descriptors
range from composition to quantities based on quantum mechanical calculations. A set of materials
informatics methods built strictly on compositional descriptors and experimental hardness data may
be more effective at determining relationships concerning the hardness of materials than previous
approaches. This study implements an approach to establish a set of ML algorithms to uncover
connections between calculable atomic parameters and the Mohs hardness of single crystalline
ceramic materials.
The application of ML requires a dataset of feature descriptors that relate the chemical
composition of diverse crystals to their physical properties. In this study, the database is based
on compositional quantities. The aim of this study is to predict mechanical properties from
compositional features without the need for computationally heavy energy-based modeling. Along
this line, I wanted to test how well ML models can be utilized to predict a comparative material
property, such as Mohs hardness. Can one improve the prediction of hardness at a less computational
expense and with greater fidelity than current methods? Are there identifiable atomic factors that
contribute to plastic deformation that can be applied across a variety of crystal structures and

26
compositions? Is there a simple formalism that allows one to track the importance of atomic
mechanisms as a function of hardness irrespective of the chemical complexity of the material?
In this study, ML is used to predict the hardness-related plastic properties of naturally occurring
ceramic minerals. Using compositional-based features, the ML approach is able to predict Mohs
hardness across a broad structural and chemical space. Specifically, 622 naturally occurring ceramic
minerals were screened using the random forests (RFs) ensemble-based ML method, as well as
support vector machines (SVMs). The results show that ML based purely on compositional features
of crystalline ceramics gives better results across a more varied chemical space than previous
methods. Moreover, the influence of atomic and electronic compositional features on the resulting
Mohs hardness prediction is evaluated. Finally, to demonstrate the efficiency of this model, it was
used to predict the Mohs hardness of 52 synthetic crystals with similar atomic and structural
characteristics. The resulting classification models accurately differentiate regimes of hardness by
identifying relevant and significant features that affect hardness, suggesting a connection between the
existence of a common panel of compositional markers and material hardness or resistance to plastic
deformation.

Methods

Datasets
In this study, the author trained a set of classifiers to understand whether compositional features
can be used to predict the Mohs hardness of minerals with different chemical compositions, crystal
structures, and crystal classes. The dataset for training and testing the classification models used
in this study originated from experimental Mohs hardness data, their crystal classes, and chemical
compositions of naturally occurring cermamic minerals reported in the Physical and Optical
Properties of Minerals CRC Handbook of Chemistry and Physics and the American Mineralogist
Crystal Structure Database (55, 56). The database is composed of 369 uniquely named minerals.
Due to the presence of multiple composition combinations for minerals referred to by the same
name, the first step was to perform compositional permutations on these minerals. This produced
a database of 622 minerals of unique compositions, comprising 210 monoclinic, 96 rhombohedral,
89 hexagonal, 80 tetragonal, 73 cubic, 50 orthorhombic, 22 triclinic, 1 trigonal, and 1 amorphous
structure. An independent dataset was compiled to validate the model performance. The validation
dataset contains the composition, crystal structure, and Mohs hardness values of 52 synthetic single
crystals reported in the literature. The validation dataset includes 15 monoclinic, 8 tetragonal, 7
hexagonal, 6 orthorhombic, 4 cubic, and 3 rhombohedral crystal structures. Both datasets were
processed by in-house Python scripts. The datasets for model development, evaluation, and
validation have been uploaded as a dataset onto Mendeley Data (57). Histograms of the distributions
indicating hardness values for both datasets are presented in Figures 1a and 1b.

Classes
The classification bins used in this study are based on relationships previously seen in the
literature from the studies of Gao et al. and Šimůnek and Vackář calculations of Vickers hardness
(58, 59). Gao et al. showed a correlation of calculated bond lengths to calculated Vickers hardness
values for binary and multicomponent oxides (58). The multicomponent oxides were broken down
into systems of pseudobinary compounds to reflect the nature of bonding in the material. These
calculations contain three groupings of hardness and bond length. For materials with bond lengths

27
greater than 2.5 Å, the Vickers hardness values were calculated to be under 5 GPa (Mohs value
(0.991, 4]). For materials with bond lengths between 2 and 2.5 Å, the Vickers hardness values were
calculated to be between 5 GPa and 12 GPa (Mohs value (4, 7]). For materials with bond lengths less
than 2 Å, the Vickers hardness values were calculated to be between 12 GPa and 40 GPa (Mohs value
(7, 10]). Similarly, Šimůnek and Vackář showed a correlation between bond length and calculated
Vickers hardness (59). However, it was more binarized. For materials with bond lengths greater than
2.4 Å, the Vickers hardness values were calculated to be less than 6.8 GPa (Mohs value (0.991, 5.5]).
For materials with bond lengths less than 2.4 Å, the Vickers hardness values were calculated to be
greater than 6.8 GPa (Mohs value (5.5, 10]).

Figure 1. Histogram of the Mohs hardnesses of the datasets of (a) 622 naturally occurring mineral and (b)
52 artificially grown single crystals.

Based on these groupings, the calculated Vickers hardness values from both studies were
converted to approximate Mohs hardness values and used as bins in this study. Minerals were
grouped according to their Mohs hardness values as shown in Table 1. Separate binary and ternary
classification groups were established as follows: Binary 0 (0.991, 5.5], Binary 1 (5.5, 10.0], Ternary
0 (0.991, 4.0], Ternary 1 (4.0, 7.0], and Ternary 2 (7.0, 10.0]. Thus, minerals of Mohs hardness
between 0.991 and 5.5 were assigned to the Binary 0 group, minerals with hardness between 5.5 and
10 were assigned to the Binary 1 group, and so on.

Features
In this study, the author constructed a database of compositional feature descriptors that
characterize naturally occurring materials, which were obtained directly from the Physical and
Optical Properties of Minerals CRC Handbook (55). This comprehensive compositional-based
dataset allows us to train models that are able to predict hardness across a wide variety of mineral
compositions and crystal classes. Each material in both the naturally occurring mineral and artificial
single crystal datasets was represented by 11 atomic descriptors, numbered 0 to 10, listed in Table 2
below. The elemental features are number of electrons, number of valence electrons, atomic number,

28
Pauling electronegativity of the most common oxidation state, covalent atomic radii, van der Waals
radii, ionization energy (IE) of neutral atoms in the ground state (also known as the first IE), the
atomic number (Z) to mass number (A) ratio, and density. These features were collected for all
elements from the NIST X-ray Mass Attenuation Coefficients Database and the CRC Handbook of
Chemistry and Physics (55, 60).

Table 1. Binary and Ternary Classes in This Study Based on Mohs Hardness Values
Classes
Binary (2-Class) Classification
Mohs Hardness
0 (0.991, 5.5]
1 (5.5, 10.0]
Ternary (3-Class) Classification
0 (0.991, 4.0]
1 (4.0, 7.0]
2 (7.0, 10.0]

Table 2. List of All Primary Features

ID Name Feature Description
0 allelectrons_Total Total number of electrons
1 density_Total Total elemental density
2 allelectrons_Average Atomic average number of electrons
3 val_e_Average Atomic average number of valence electrons
4 atomicweight_Average Atomic average atomic weight
5 ionenergy_Average Atomic average first IE
6 el_neg_chi_Average Atomic average Pauling electronegativity of the most common
oxidation state
7 R_vdw_element_Average Atomic average van der Waals atomic radius
8 R_cov_element_Average Atomic average covalent atomic radius
9 zaratio_Average Atomic average atomic number to mass number ratio
10 density_Average Atomic average elemental density

The atomic averages of nine features were calculated for each mineral. The atomic average is the
sum of the compositional feature (fi) divided by the number of atoms (n) present in the mineral’s
empirical chemical formula, or

29
Two additional feature descriptors were added based on the total number of electrons and the
total of the elemental densities for each compound, for a total of 11 features listed in Table 2 below.
The features for this study were chosen based on factors implemented in previous methods to
predict material hardness. The related factors from these studies were included as features that are
easily calculated from the number of atoms in the empirical formula and elemental characteristics.
The number of valence electrons per bond was included as a factor in Gao et al. (23), Šimůnek
et al. (24), and Mukhanov et al. (34). In this study, the effect of valence electrons on hardness is
considered by a simplified feature of atomic average of valence electrons. Atomic weight was included
in this study since it is used to calculate molar volume, which was a factor in the Mukhanov et al.
study as well (34). Atomic radii (covalent and van der Waals) were included as features in this study
since they are related to the bond length factor in Gao et al. and the molar volume in Mukhanov
et al. (23, 34). Electronegativity was included in the feature set as the atomic average of Pauling
electronegativity for all elements in a material’s empirical formula. This atomic average is a simplified
version of the electonegativity-derived factors of bond electronegativity, stiffness of atoms, and bond
ionicity factors in Li et al. used to predict hardness (35, 36).
In addition to features based on characteristics previously utilized in hardness calculations, three
more features are also included: the first IE, the total number of all electrons, and the atomic number
to mass ratio for each compound. Each of these have a connection to either the atomic radii or the
strength of bonds of these materials. The first IE, or the amount of energy to remove the most loosely
bound valence electron, is directly related to the nature of bonding in a material (61, 62). According
to Dimitrov and Komatsu (61), bond strength can be modeled as an electron binding force related to
first IE through a Hooke’s law potential energy relationship,

where is the effective ionic radii. This has not been previously connected to hardness, so it is
included as a novel feature in this study. Since hardness has been previously connected to bond
strength, it makes sense that this could also be a related factor to mechanical properties like hardness.
The total number of electrons (both bonding and nonbonding) are also included in this study
as a feature due to their contribution to atomic radii. As the number of electrons in inner shells
increases, the repulsive force acting on the outermost shell electrons in a process known as shielding.
This repulsive force increases the atomic radius, which could directly affect the bond length of a
material. The atomic number to mass number ratio (Z/A) is directly related to the total electron cross-
section, or the effective electronic energy-absorption cross section of an individual element. While it
is commonly used to describe X-ray attenuation, it may also help in this case to describe an effective
area of electronic activity that can contribute in a different context.

ML Models
In this study, nine supervised learning models were built and trained to classify hardness values
in naturally occurring minerals and artificial single crystals. Specifically, I implemented RF and SVMs
to predict Mohs hardness. This section reviews the models, optimization schema, feature importance
calculations, and evaluation criteria utilized in this study.

30
RFs and Gini Feature Importance
Decision trees use decision-making rules implemented on features or attributes of the data to
predict target properties. Major issues with decision trees are that they can be highly variable and
sensitive to overfitting. To resolve this issue, one can employ ensemble methods, which implement
multiple decision trees. Such methods include extremely randomized trees (extra trees) and RFs,
which is one of the methods employed in this study. In a RFs ensemble, each tree is built on a
bootstrap sample from the training dataset. At each node, a set of attributes is randomly selected
from among all possible attributes to test the best split. For RF classifiers, the best split is chosen by
minimizing variance, which leads to a reduction in misclassification. In the end, the value of a target
property is the average of all of the predictions for that property.
This study not only predicts Mohs hardness based on feature descriptors, but also identifies
which of these descriptors are most important to making the predictions for several RF models.
To do this, the variable importance metric called Gini importance is employed to find the relative
importances of a set of predictors based on the Gini index. The Gini index is commonly used as
the splitting criterion in tree-based classifiers, as a function to measure the quality of a split. The
reduction of the Gini index brought on by a feature is called the Gini importance or the mean
decrease impurity. This decrease in node impurity is given by the equation

where is the class frequency for class j in node t (63, 64).

To summarize, the Gini importance for a feature indicates that feature’s overall discriminative
value during the classification. If the decrease is low, then the feature is not important. An irrelevant
variable has an importance of zero. The sum of the importances across all features is equal to 1. In
this study, Gini feature importance is used to gauge the relative importance of a set of compositional-
based features on binary and ternary RF classifications of Mohs hardness values.

SVMs
A SVM is a supervised ML method that fits a boundary or separating hyperplane around
elements with similar features. The input features are mapped to a high-dimension feature space to
produce a linear decision surface or hyperplane. This decision surface is based on a core set of points
or support vectors. An SVM finds the linear hyperplane with the maximum margin in the higher-
dimensional feature space.
Considering a training dataset of n input feature-label pairs (xi, yi), i = 1…n, the SVMs require
the solution of the following optimization problem:

subject to

31
where w is the set of weights for each feature or the weight vector, is the radial function that maps
the data into a higher-dimensional space, b is bias, is a slack variable to allow some misclassifications
while maximizing the margin and minimizing the total error, and C is the soft margin cost function
that controls the influence of each support vector (65, 66). Due to the construction of the hyperplane
on these support vectors, it is not sensitive to small changes to the data. This robustness to data
variation means the SVM can generalize quite well. Also, the construction of the hyperplane results
in complex boundaries, which resists overfitting. The SVM algorithms utilized in this study were
implemented by the Scikit-Learn Python package (67).
Feature spaces are not always readily linearly separable. In order to improve separability within
the feature space, one can map the feature space onto a higher dimensional space. By applying
a nonlinear kernel mapping function, SVMs are more easily able to be applied to classification
problems with this type of feature space. One common kernel function K that is utilized for feature
space transformation is the radial basis function (RBF) shown in the equation below. This study uses
the radial basis kernel function kRBF given in the equation

where is the parameter for the Gaussian RBF (65).

However, not all feature spaces are smooth. For data that may have discontinuities in the feature
space, a better option is to include a variable to adjust for the possible lack of smoothness of the
data. A generalization of the RBF called Matérn includes a variable to adapt to data roughness. This
variable increases the flexibility of the kernel by allowing adaptation of the kernel to properties of the
true underlying functional relation that may not be smooth. In addition to the RBF kernel, this study
employs the Matérn kernel function given in the equation

where is the gamma function (68, 69), l is a scalar length-scale parameter for amplitude, is the
roughness factor, and is the modified Bessel function of the second kind. As approaches infinity,
the Matérn kernel converges with the RBF kernel, which is smooth due to its infinite differentiability.
This lack of smoothness is relevant to the feature space in this study. Mohs is not a linear scale,
but it is an ordinal that jumps in Vickers hardness. It is entirely possible that there are jumps or
discontinuities in feature contributions to the physical phenomena between adjacent Mohs hardness
values. To account for this possibility, RBF and Matérn kernels are employed separately as kernels in
SVMs to observe if there is a difference in performance and the underlying nature of the continuity of
the feature space.
Classifiers built with a SVM are referred to in this work as support vector classifiers (SVCs). The
support vector machine model in this study that was applied to the ternary classification problem
was constructed with a One-vs-One (OVO) decision-making function. An OVO classifier develops
multiple binary classifiers, one for each pair of classes. Specifically, it trains N(N-1)/2 binary

32
classifiers for an N-way classification problem. In the case of the ternary (3-way) classification
problem, three binary classifiers are generated: Class 0 versus 1, 1 versus 2, and 0 versus 2, each of
which would only distinguish between two classes at a time. To classify a given material, the material
is presented to all three classifiers and majority voting is used to assign a label to the material.
Each feature was standardized by individually centering to the mean and scaling to unit variance
or standard deviation. While RFs are less sensitive to absolute values, SVMs are sensitive to feature
scaling. This is due to the construction of the hyperplane on the distance between the nearest data
points with different classification labels, or support vectors. If one of the dimensions have a
drastically larger value range, it may influence this distance and thereby affect the hyperplane. For
consistency, all models in this study used this standardized feature space.

Study Models
Nine ML models were implemented, which are listed in Table 3 below. Models 1 and 2 are
RBF-kernel SVMs applied to the binary and ternary classifications outlined in Table 1, respectively.
Models 3 and 4 are RFs applied to the binary and ternary classifications, respectively. Model 5 is
a binary RF in which Class 0 (0.991, 4.0] is classified against a combined superclass of Classes
1 (4.0, 7.0] and 2 (7.0, 10.0]. This model is employed to separate materials with low hardness
values from the rest of the dataset. Model 6 is similarly constructed in that it is a binary RF that
separates the medium hardness Class 1 (4.0, 7.0] against the superclass of Classes 0 and 2, (i.e.,
low and high hardness values). Model 7 is similarly constructed to classify materials with high Mohs
hardness values from the combined superclass of low and medium Mohs hardness values. Models
8 and 9 are Matérn-kernel SVMs applied to the binary and ternary classifications, respectively. The
implementations from Scikit-learn were used for all models.

Table 3. The Nine ML Models Utilized in This Study. The Acronyms Following a Class Type
Indicate the Type of Model Used
ID Model
1 Binary RBF SVC
2 Ternary RBF SVC – OVO
3 Binary RF
4 Ternary RF – multiclass
5 Ternary RF – OVR: 0 versus 1, 2
6 Ternary RF – OVR: 1 versus 0, 2
7 Ternary RF – OVR: 2 versus 0, 1
8 Binary Matérn SVC
9 Ternary Matérn SVC – OVO

These nine models allow several comparisons. First, we can compare the effectiveness of the
OVO nature of RBF-kernel SVCs (RBF SVCs) and Matérn-kernel SVCs (Matérn SVCs) to the
inherently multiclass decision-making scheme of RF ensemble methods for binary (Models 1, 3,
and 8) and ternary classifications (Models 2, 4, and 9). Second, we can compare the effectiveness
of an inherently multiclass RF scheme to the One-vs-Rest (OVR) ternary RF classification scheme
for Models 4–7 and determine the best way to classify materials as having low, medium, and high

33
Mohs hardness values. Finally, comparing feature importances for Models 5–7 tells us which features
contribute most to the classification between low, medium, and high Mohs hardness. This
information can highlight which material properties are most important for low, medium, and high
resistance to plastic deformation.

Grid Optimization for Binary and Ternary SVC: Models 1, 2, 8, and 9

The hyperparameters C and for RBF-based SVMs can drastically affect the performance of
a classifier. The hyperparameter of C is the cost function that implements a penalty for
misclassification. If C is too low for the dataset, then a simpler decision function is applied to the
model, but it may underfit the data. This represents a soft margin, which may allow training points to
either be ignored or misclassified. If C is too high for the dataset, then a more complicated decision
function is applied to the model, but it may overfit to the training data.
The hyperparameter in the RBF is connected to the spread of influence of a single training
point. If is too low, the decision boundary is too broad and does not separate the data well. If
is too high, the RBF overfits to the training data by forming decision boundary islands. For any
dataset to be classified, a balance exists between these two hyperparameters. The hyperparameters
C and for Matérn-kernel SVM parameters can drastically affect the performance of a classifier.
The hyperparameter C performs similarly for SVCs with a Matérn kernel or RBF kernel. The
hyperparameter in the Matérn kernel is connected to the smoothness of the function to the data. If
is too high, the kernel may be too smooth to capture any underlying discontinuities in the feature
space.
For any dataset to be classified, a balance exists between these hyperparameters. One approach
to find optimal values for these hyperparameters is a grid search method. In a grid search, each
possible hyperparameter combination is applied to the dataset and the accuracy is reported to find
the combination that produces the highest accuracy without overfitting. To optimize the parameters
for the four SVC models in this study, a grid search was performed exploring the effects that various
hyperparameters have on model performance. For the binary and ternary SVCs built with RBF
kernel, combinations of the hyperparameters C and were tested to observe their effects on accuracy.
For the binary and ternary SVCs built with the Matérn kernel, combinations of the hyperparameters
C and were tested to observe their effects on accuracy. The ranges for the RBF SVC used in this
paper is C between 10-2 and 1013 and between 10-9 and 1013. The ranges for the Matérn SVC
hyperparameters used in this paper is C between 10-2 and 1013 and between 0.5 to 6.0. These
should be adequate to find a suitable combination to prevent both under- and over-fitting.
For each classifier undergoing grid optimization, the mineral dataset of 622 minerals (each
having 11 feature descriptors) was split into two shuffled stratified subsets: a development set
(66.7%) and an evaluation set (33.3%). These datasets were rearranged to ensure each subset was
representative of the whole with respect to the distribution of Mohs hardness values. The
development subset was used for training while the evaluation subset was used to test each classifier.
This process was repeated for all hyperparameter combinations to perform the two-dimensional grid
optimization.

Evaluation Criteria
In this study, all nine ML models are trained to predict Mohs hardness through binary or
ternary classification methods. Their performance is evaluated with four metrics based on the true
positives (Tp), true negatives (Tn), false positives (Fp), and false negatives (Fn) predicted by a given

34
classification model. The metrics used in this study are accuracy, specificity, precision, recall, and
F1-scores. Accuracy (A) gives the proportion of true positive results in a population. Precision (P)
describes how many of true positive predictions are actually positive. Specificity (S) is the probability
that a classification model will identify true negative results. The higher the specificity, the lower
the probability of false negative results. Recall (R) or sensitivity indicates the proportion of actual
positives that were predicted as positive. R is the probability that a classification model will identify
true positive results. The higher the recall, the lower the probability of false positive results. Typically,
precision and recall are considered together through the F1-score (F1). F1 is the harmonic average
of precision and recall and gives equal importance to both. It is an important metric for datasets
with uneven class distribution. The closer F1 is to 1, the closer the model comes to perfect recall
and precision. Overall these five metrics give great insight into the performance of the classification
models, and their equations are as follows:

Results and Discussion

In this study, several machine or statistical learning approaches are presented to quantitatively
study the relationship between material composition and Mohs hardness values, which is a complex
property relating to elasticity and plastic deformation of a material.

Grid Optimization Results for Binary and Ternary SVCs: Models 1, 2, 8, and 9
In this study, grid search optimization on the binary and ternary Matérn and RBF SVC models
was performed. For each classifier undergoing the grid optimization scheme, the mineral dataset of
622 minerals was split into two stratified subsets: a development set (66.7%) and an evaluation set
(33.3%). For the RBF SVC models (Models 1 and 2), grid search optimization was performed by

35
methodically building and evaluating a model for each hyper-parameter combination of C between
10-2 and 1013 and between 10-9 and 1013. For the Matérn kernel SVC models (Models 8 and 9), grid
search optimization was performed by methodically building and evaluating a model for each hyper-
parameter combination of C between 10-2 and 1013 and between 0.5 to 6.0.
The binary RBF SVC classifier Model 1 achieved an accuracy of 86.4% with C = 10 and = 1,
as shown in Figure 2a. The ternary RBF SVC Model 2 achieved an accuracy of 85.0% with C = 10
and = 1, as shown in Figure 2b. The binary Matérn SVC Model 8 achieved an accuracy of 86.4%
with hyperparameters C = 10 and = 2.5, as shown in Figure 2c. The ternary Matérn SVC Model
9 achieved an accuracy of 87.4% with hyperparameters C = 1 and = 1, as shown in Figure 2d.
There is a moderate gain in accuracy in classifiers employing the Matérn kernel compared to the RBF
kernel, but they are both close. This may suggest that discontinuities that may exist in the feature
space may be small enough to approximate with a smooth function like RBF. For the remainder
of the study, these grid-optimized hyperparameters were utilized for Models 1, 2, 8, and 9. These
prediction accuracies suggest that binary and ternary SVCs built on either RBF or Matérn kernels to
classify the Mohs hardness of ceramics can be helpful in materials discovery and development.

Figure 2. The cross-validation grid optimization accuracies for the (a) binary and (b) ternary RBF SVCs,
Models 1 and 2, respectively. The y axis is the value of C, or the soft margin cost function. The x axis is the
value of , or the parameter for the Gaussian RBF. The cross-validation grid optimization accuracies for the
(c) binary and (d) ternary Matérn SVCs, Models 8 and 9, respectively. The color represents the model
accuracy.

36
Figure 3. Performance from models trained under 500 stratified train-test splits (a) workflow of model
performance. (b) The specificity and recall scores from for all binary models. (c) The recall scores for ternary
models. (d) The specificity scores for ternary models. The bar height corresponds to the average respective
score. The black error bars correspond to the standard deviation of the respective metric.

Model Performance on Naturally Occurring Minerals Dataset

To determine the performance of the models utilized in this study, all models were constructed
with the naturally occurring mineral dataset, which was split 500 times into three-fold training and
test subsets. The workflow is shown in Figure 3a. Upon completion of all 500 splits, the F1-score,
precision, recall, specificity, and accuracy were calculated based on the predicted and known values
of the test subsets. Figure 3b shows the weighted F1, precision, and accuracy scores for all nine
models over 500 splits, along with their standard deviations. For Figure 3b and 3c, the hashed bars
represent the same performance metric across each model. For Figure 3d and 3e, the hashed bars

37
represent the same class across each model. The height of the bars is the magnitude of the respective
metric for that attribute. The x-axis labels for Figure 3b–3e correspond to the Model ID in Table 3.
All of the classifiers were able to classify the vast majority of Mohs hardness values with weighted
F1, precision, and accuracy scores of 0.79 or higher. The ternary Matérn SVC (Model 9)
underperforms its ternary RBF SVC (Model 2) and ternary multiclass RF (Model 4) counterparts.
Also, the ternary OVR RF models (Models 5–7) appear perform similarly to or better than the
ternary multiclass RF model (Model 4) with scores >0.82. For the specificity and recall for the binary
models shown in Figure 3c, the scores are good (>0.7) in all models except 7. In Model 7, there is a
drastic decrease in specificity to 0.2. This suggests that the model overclassified false positives. This
is unlikely due to the model predicting based on the features themselves but to the model predicting
based on the bias in the training data. Materials in Class 2 (7.0, 10.0] are underrepresented, only
counting toward 6% of the entire dataset. This is due to the natural rarity of materials in that range.
Overall, Models 5 and 6 had the strongest prediction performance across all 5 metrics. Model 5
is a binary RF in which Class 0 (0.991, 4.0] is classified against a combined superclass of Classes 1
(4.0, 7.0] and 2 (7.0, 10.0]. This model is employed to separate materials with low hardness values
from the rest of the dataset. Model 6 is similarly constructed in that it is a binary RF that separates
the medium hardness Class 1 (4.0, 7.0] against the superclass of Classes 0 and 2 (i.e., low and high
hardness values). These one-vs-rest binary classification of ternary bins best captured the underlying
patterns of material hardness for naturally occurring ceramic minerals. However, given the small
size of the population in the Class 2 classification bins, both of these models could condense into a
pseudobinary classification task of Class 0 (0.991, 4.0] and Class 1 (4.0, 10.0], where the separation
is at Mohs value 4.0 instead of 5.5 as in the true binary bins in this study. The effect of data bias t
plagued Model 7 is less pronounced in Models 5 and 6, due to their population sizes closer to parity.
Models 5 and 6 may correspond more to the grouping presented in the correlation between
the Gao calculated bond lengths and calculated hardness values than Šimůnek and Vackář’s more
binarized grouping found in Model 3 on this dataset of naturally occurring minerals. Model 3 is a
binary RF in which Class 0 (0.991, 5.5] is classified against a combined superclass of Class 1 (5.5,
10.0]. This model is also employed to separate materials with low hardness values from the rest of
the dataset. Even with a specificity is closer to 0.74, Model 3 still performs well with a recall score ±
std. dev. of 0.8633 ± 0.02024. Overall, these models yield insight into the connection between bond
characteristics and hardness. According to the Gao et al. study (23), the three factors connecting the
hardness of polar covalent crystals to bond behavior are the bond density or electronic density, bond
length, and degree of covalent bonding. According to the Šimůnek and Vackář study (59), the factors
connecting the hardness of binary covalently bonded solids are crystal valence electron density,
number of bonds, and bond length between pairs of atoms. From the performance of Models 3, 5,
and 6, all of these factors are closely related to the hardness of naturally occurring ceramic minerals.
To further understand the impact of these factors in these RF models, feature importances can be
used to gauge their relative importances and increase understanding of the physical basis in the
hardness regimes of these ceramic minerals.
Next, the effectiveness of several binary classifiers was evaluated using the quantitative variables
of true positive rate, which represents the total number of correctly classified Mohs hardness values
in the positive class, and the false positive rate, which represents the total number of incorrectly
classified Mohs hardness values assigned to the positive class. With these variables, the receiver
operating characteristic (ROC) curves were calculated. ROC curves plot the true positive rate for a
binary classifier as a function of its false positive rate to gauge model performance. The area under the
curve (AUC) is a quality measure of the classifier’s ability to correctly classify a given material. The

38
ideal AUC is unity, or 1. To compare the effectiveness of the binary (Models 1, 3, and 8) and OVR
(Models 5, 6, 7) superclass classifiers used in this study, the author implemented ROC curves and
calculated the areas under the curves. These curves, are given in Figure 4 below.

Figure 4. ROC plots using the false positive ratio and false negative ratio are used to evaluate the ability of
classifiers to predict Mohs hardness values. (a) Comparison of binary RBF SVC, RF, Matérn SVC, Models
1, 3, and 8, respectively. (b) Comparison of the ternary OVR classifiers, Models 5, 6, and 7, which predict
low, medium, and high hardness values, respectively.

Both the binary and ternary OVR superclass classifiers were able to discriminate the vast
majority of naturally occurring minerals with an AUC of 0.88 or greater. The ROC plots in Figure
4b illustrate the similar performance of ternary OVR superclass classifiers when applied to the same
set of compositional features. This appears to reflect the This suggests that compositional predictors
developed for these materials can be generally applied with reasonable reliability to other single
crystalline materials across a wide-ranging compositional and structural space.

Feature Importances
To determine feature importance for the RF-based models utilized in this study (Models 3–7),
10,000 trees were constructed for a single forest. Upon completion of each forest, the list of input
features with their representative Gini importances was returned. Figure 5a shows the relative
importance of different atomic features for binary (Model 3) and ternary (Model 4) RFs constructed
as inherently multiclass. Figure 5b shows the relative importances of different atomic features for
ternary RFs constructed as OVR classifiers (Models 5–7). The x-axis labels correspond to the feature
ID outlined in Table 3. The heights of the color bars in Figures 5a and 5b indicate the Gini importance
of each feature. The colors represent the different models.
The most important features vary on three points: (1) whether the model is binary or ternary,
(2) whether the model is constructed as multiclass or binary OVR, and (3) the regime of hardness
classified. For the binary RF model (Model 3) in Figure 5a, the four most important features are
Features 5, 0, 3, and 6 with feature importances of 0.124, 0.129, 0.120, and 0.111, respectively.
These features correspond to the atomic average of IE, the total of the number of electrons in the
empirical formula, the atomic average of the valence electrons, and the atomic average of the Pauling
electronegativities, respectively. For the ternary multiclass model (Model 4) in Figure 5a, the four
most important features are Features 7, 8, 3, and 5 with feature importances of 0.129, 0.113, 0.112,
and 0.111, respectively. These features correspond to the atomic average of the van der Waals atomic
radii, the atomic average of the covalent atomic radii, the atomic average of the valence electrons, and
the atomic average of IE, respectively.

39
Figure 5. The Gini feature importances of a 10,000 tree RF for (a) binary (Model 3) and ternary multiclass
(Model 4) models and (b) OVR binary RF classifiers (Models 5–7) with low, medium, and high hardness
ternary class as a positive class, respectively.

For the ternary-bin, OVR binary RF in which Class 0 (0.991, 4.0] is classified against a
combined superclass of Classes 1 (4.0, 7.0] and 2 (7.0, 10.0] (Model 5) in Figure 5b, the four
most important features are Features 5, 8, 3, and 7 with feature importances of 0.118, 0.118, 0.116,
and 0.113, respectively. These features correspond to the atomic average of IE, the atomic average
of the covalent atomic radii, the atomic average of the valence electrons, and the atomic average
of the van der Waals atomic radii, respectively. For the ternary-bin, OVR binary RF in which the
medium hardness Class 1 (4.0, 7.0] against the superclass of Classes 0 and 2 (Model 6) in Figure
5b, the four most important features are Features 7, 8, 3, and 5 with feature importances of 0.140,
0.117, 0.111, and 0.110, respectively. These features correspond to the atomic average of the van der
Waals atomic radii, the atomic average of the covalent atomic radii, the atomic average of the valence
electrons, and the atomic average of IE, respectively. For the ternary-bin, one-vs-rest binary RF in
which the medium hardness Class 2 (7.0, 10.0] against the superclass of Classes 0 and 1 (Model 7) in
Figure 5b, the four most important features are Features 10, 1, 0, and 5 with feature importances of
0.121, 0.108, 0.106, and 0.101, respectively. These features correspond to the atomic average of the
densities of the elements found in the empirical formula, the total densities, and the atomic average
of IE, respectively.
Earlier in Figures 3b–3e, it was shown that Models 5 and 6 are similar and perform well on the
dataset of naturally occurring crystalline ceramic minerals. Here in Figure 5b, it can be found that
these two models share the top 4 features: 7, 8, 3, and 5. These features correspond to the atomic
average of the van der Waals atomic radii, the atomic average of the covalent atomic radii, the atomic
average of the valence electrons, and the atomic average of IE, respectively. The related factors from
these studies directly correspond to material characteristics previously attributed as contributors to
material hardness. The number of valence electrons per bond was included as a factor in Gao et al.

40
(23), Šimůnek et al. (24), and Mukhanov et al. (34). Atomic radii (both covalent and van der Waals)
are related to the bond length factor in Gao et al. and the molar volume in Mukhanov et al. (23,
34). The first IE is related to the bond strength of the material (61, 62), which Šimůnek and Vackář
attribute as a major factor in hardness (59). Three of the four importances (Features 8, 3, and 5) are
the same between Models 5 and 6. However, Feature 7, or the atomic average of the van der Waals
atomic radii, varies greatly between the two models. For Model 5, the importance of Feature 7 is
0.118. For Model 6, the importance of Feature 7 is 0.140. From Model 5 to Model 6, the importance
of this feature drops 15.7%. Therefore, it may then follow that a major difference in performance
between these models on a validation dataset would likely depend on this feature.

Model Validation with Validation Set

To determine the generalizability of the models to artificial ceramic crystals, all models were
trained with the naturally occurring mineral dataset and tested on a validation dataset. The validation
dataset consists of 52 artificial single crystals, with 15 monoclinic, 8 tetragonal, 7 hexagonal, 6
orthorhombic, 4 cubic, and 3 rhombohedral crystal structures. The F1, precision, recall, specificity
and accuracy scores were calculated based on the validation set. This workflow is diagrammed in
Figure 6a. Figure 6b shows the weighted F1, precision, and accuracy scores for all nine models. For
Figure 6b and 6c, the hashed bars represent the same performance metric across each model. For
Figure 6d and 6e, the hashed bars represent the same class across each model. The height of the
bars is the magnitude of the respective metric for that attribute. The x-axis labels for Figure 6b–6e
correspond to the Model IDs in Table 3.
All models in Figures 6b–6e shown have similar trends as in the naturally occurring dataset but
with lower performance values. Such a dip in performance can be expected. In the growth of artificial
crystals, different crystal phases can be created that may produce crystals with higher or lower Mohs
hardness values than would occur naturally. Growth conditions were not factored into the feature set
in this study. To address this experimental intervention, it may be useful to include crystal growth
conditions as a feature in future ML models.
According to Figure 6b, all models have F1, precision, and accuracy scores greater than 0.70.
Model 5 is the strongest performing model with F1, precision, and accuracy scores around 92%.
Model 7 also shows strong performance metrics. However, this model is affected by data bias due to
the small number in the positive class. This is also seen by the low recall score in Figure 6c. Model 5
is a stronger than Model 7 due to the reduced data bias from the more evenly matched populations
of the positive and negative classes. Therefore, the author has strong confidence that the models are
predicting on the feature set instead of data bias of imbalanced classes.
As discussed earlier, Models 5 and 6 are basically pseudo-inverses of each other due to the
underrepresentation of naturally occurring ceramic minerals with Mohs hardness values greater than
7.0. In this case, both of these models could condense into a pseudobinary classification task of Class
0 (0.991, 4.0] and Class 1 (4.0, 10.0], where the separation is at Mohs value 4.0 instead of 5.5 as
in the true binary bins in this study. However, on the validation dataset, Model 6 performs much
better than Model 5. A major reason for this could reflect back to the feature importance analysis.
Feature 7, or the atomic average of the van der Waals atomic radii, decreases around 15.7% between
Models 5 and 6. Model 5 appears to have enough importance on that factor to produce reasonable
predictions. For Model 6, there appears to be an important overreliance on the atomic average van
der Waals atomic radius in predicting hardness values for man-made artificial materials. This may be

41
due to the sensitivity of van der Waals bonding interactions at the solid-liquid interface to growth
conditions during the crystallization process, such as temperature and solvents (70, 71).

Figure 6. (a) Workflow of performance testing for model validation. (b) The specificity and recall scores for
all binary models. (c) The recall scores for ternary models. (d) The specificity scores for ternary models. The
bar height corresponds to the average respective score. All models were trained on the naturally occurring
mineral dataset and validated with the artificial single crystal dataset.

Considerations
Note that the prediction models in this study have only considered a small number of
composition-based factors and other easily accessible attributes for an extremely varied chemical
space. This is a reasonable first screening step that allows us to efficiently gauge important factors that

42
may contribute to material hardness. However, to make larger generalizations about the nature of
hardness, more features would have to be considered and then narrowed down with feature selection
methods. Specifically, feature selection methods such as principal component analysis, univariate
selection, and recursive feature elimination across a larger set of features would yield a deeper
understanding about the nature of factors that contribute to material hardness.
In addition, predicting the hardness of superhard (M > 9) materials would be particularly
problematic with the current dataset. Materials with a Mohs hardness less than 2 and greater than
9 are not equally represented, which leads to an imbalanced dataset. The implementation of data
handling methods specifically constructed for handling imbalanced datasets, such as oversampling,
undersampling, or synthetic minority oversampling technique, may allow more accurate prediction
of the Mohs hardness in those regimes. This effect extends to ceramic materials in the 7–10 Mohs
range. While there are more minerals included in this range, there is still a great imbalance that
produces a data bias in the training of statistical and ML models. Therefore, to extend this application
to predict minerals in the 7–10 Mohs range in the future, more artificial materials would need to be
included. These approaches would allow possible design and prediction of novel superhard crystal
ceramics.
Please note that our prediction models have only considered single crystalline materials. For
other types of materials, different factors affect hardness. For metals, hardness is affected by structural
factors like dislocation entanglements (16). Also in metals, there is a connection between bulk
modulus, shear modulus, hardness, and ductility. This connection has previously been referenced by
Chen (72), Tabor (73), and Pugh among others (74). Due to the delocalized nature of the bonding
in metals, plastic deformations locally accumulate before fracture, resulting in ductility and reduced
hardness. To explore this effect, the inclusion of metals that stretch across the ductile-to-brittle
transition into the feature set could offer insight into the connection due to the nature of bonding
strength in these materials.
For plastics, elastic and plastic properties depend on chain length, degree of cross-linking, and
the degree of crystallinity of the material. Inclusion of nanomaterials into single-crystalline matrices
has also been shown to increase hardness. Those were ignored in this study but may also be an avenue
to consider in future studies. The continued growth of data repositories based on experimental
characterization of materials is expected to enable the development of models for mechanical and
microstructural material properties not covered in this study, specifically fracture toughness, thermal
stability, bulk modulus, shear modulus, and work hardening, among others.

Conclusions
This study shows that comparative material properties like Mohs hardness can be modeled with
ML algorithms using features based solely on material composition. The results show that RFs and
SVMs are able to produce reasonable predictions of materials property. They also show that different
features are relatively important for predicting Mohs hardness values. These features include the
atomic average of the van der Waals atomic radii, the atomic average of the covalent atomic radii, the
atomic average of the valence electrons, and the atomic average of IE among others. These features
were previously included in separate studies but were combined into this study to further understand
their interrelated physical contributions to materials hardness (23, 34, 59). In conclusion, I have
demonstrated that a ML model can be useful in classifying comparative material properties. The
methodology described in this study could be applied to other types of materials for accelerated
design and materials science discovery of novel materials.

43
Acknowledgments
The author acknowledges the support provided by the National Science Foundation under grant
numbers HRD-1547757 (CREST-BioSS Center) and HRD-1647013. I would also like to
acknowledge and show my appreciation to the Vanderbilt Advanced Computing Center for Research
and Education (ACCRE) as well as Professor Kelly Holley-Bocklemann and Dr. Caleb Wheeler.

References
1. Plinninger, R. J.; Spaun, G.; Thuro, K. Prediction and Classification of Tool Wear in Drill and
Blast Tunnelling. In Proceedings of 9th Congress of the International Association for Engineering
Geology and the Environment [Online]; Engineering Geology for Developing Countries:
Durban, South Africa, 2002; pp 16–20. http://www.geo.tum.de/people/thuro/pubs/2002_
iaeg_durban_pli.pdf (accessed Jan 26, 2019).
2. Hoseinie, S. H.; Ataei, M.; Mikaiel, R. Comparison of Some Rock Hardness Scales Applied in
Drillability Studies. Arab. J. Sci. Eng. 2012, 37, 1451–1458.
3. Thuro, K.; Plinninger, R. J. Hard Rock Tunnel Boring, Cutting, Drilling and Blasting: Rock
Parameters for Excavatability. In 10th ISRM Congress; International Society for Rock Mechanics
and Rock Engineering: Sandton, South Africa, 2003.
4. Ellecosta, P.; Schneider, S.; Kasling, H.; Thuro, K. Hardness–A New Method for
Characterising the Interaction of TBM Disc Cutters and Rocks? In 13th ISRM International
Congress of Rock Mechanics; International Society for Rock Mechanics and Rock Engineering:
Montreal, Canada, 2015.
5. Moore, M. A. The Relationship between the Abrasive Wear Resistance, Hardness and
Microstructure of Ferritic Materials. Wear 1974, 28, 59–68.
6. Axén, N.; Jacobson, S.; Hogmark, S. Influence of Hardness of the Counterbody in Three-Body
Abrasive Wear — an Overlooked Hardness Effect. Tribol. Int. 1994, 27, 233–241.
7. Jefferies, S. R. Abrasive Finishing and Polishing in Restorative Dentistry: A State-of-the-Art
Review. Dent. Clin. North Am. 2007, 51, 379–397.
8. Balaceanu, M.; Petreus, T.; Braic, V.; Zoita, C. N.; Vladescu, A.; Cotrutz, C. E.; Braic, M.
Characterization of Zr-Based Hard Coatings for Medical Implant Applications. Surf. Coatings
Technol. 2010, 204, 2046–2050.
9. Parsons, J. R.; Lee, C. K.; Langrana, N. A.; Clemow, A. J.; Chen, E. H. Functional and
Biocompatible Intervertebral Disc Spacer Containing Elastomeric Material of Varying Hardness. U.S.
Patent 5,545,229, December 15, 1992.
10. Okazaki, Y.; Ito, Y.; Ito, A.; Tateishi, T. Effect of Alloying Elements on Mechanical Properties
of Titanium Alloys for Medical Implants. Mater. Trans. JIM 1993, 34, 1217–1222.
11. Kanyanta, V. Hard, Superhard and Ultrahard Materials: An Overview. In Microstructure-
Property Correlations for Hard, Superhard, and Ultrahard Materials; Springer International
Publishing: Cham, Switzerland, 2016; pp 1–23.
12. Hwang, D. K.; Moon, J. H.; Shul, Y. G.; Jung, K. T.; Kim, D. H.; Lee, D. W. Scratch Resistant
and Transparent UV-Protective Coating on Polycarbonate. J. Sol-Gel Sci. Technol. 2003, 26,
783–787.

44
13. Luber, J. R.; Bunick, F. J. Protective Coating for Tablet. Official Gazette of the United States
Patent & Trademark Office Patents 1249(3), August 21, 2001.
14. Tabor, D. Mohs’s Hardness Scale - A Physical Interpretation. Proc. Phys. Soc. Sect. B 1954, 67,
249–257.
15. Tabor, D. The Physical Meaning of Indentation and Scratch Hardness. Br. J. Appl. Phys. 1956,
7, 159–166.
16. Tabor, D. The Hardness of Solids. Rev. Phys. Technol. 1970, 1, 145–179.
17. Li, K.; Yang, P.; Niu, L.; Xue, D. Group Electronegativity for Prediction of Materials Hardness.
J. Phys. Chem. A 2012, 116, 6911–6916.
18. Broz, M. E.; Cook, R. F.; Whitney, D. L. Microhardness, Toughness, and Modulus of Mohs
Scale Minerals. Am. Mineral. 2006, 91, 135–142.
19. Gilman, J. J. Chemistry and Physics of Mechanical Hardness; John Wiley & Sons: Hoboken, NJ,
2009; Vol. 5.
20. Oganov, A. R.; Lyakhov, A. O. Towards the Theory of Hardness of Materials. Orig. Russ. Text
© A.R. Oganov, A.O. Lyakhov, J. Superhard Mater. 2010, 32, 3–8.
21. Li, K.; Yang, P.; Niu, L.; Xue, D. Hardness of Inorganic Functional Materials. Rev. Adv. Sci.
Eng. 2012, 1, 265–279.
22. Cohen, M. L. Predicting Useful Materials. Science 1993, 261, 307–309.
23. Gao, F.; He, J.; Wu, E.; Liu, S.; Yu, D.; Li, D.; Zhang, S.; Tian, Y. Hardness of Covalent
Crystals. Phys. Rev. Lett. 2003, 91, 015502.
24. Šimůnek, A.; Vackář, J. Hardness of Covalent and Ionic Crystals: First-Principle Calculations.
Phys. Rev. Lett. 2006, 96, 085501.
25. Inal, K.; Neale, K. W. High Performance Computational Modelling of Microstructural
Phenomena in Polycrystalline Metals. In Advances in Engineering Structures, Mechanics &
Construction; Springer Netherlands: Dordrecht, 2006; pp 583–593.
26. Vo, N. Q.; Averback, R. S.; Bellon, P.; Caro, A. Limits of Hardness at the Nanoscale: Molecular
Dynamics Simulations. Phys. Rev. B 2008, 78, 241402.
27. Van Swygenhoven, H. Grain Boundaries and Dislocations. Science 2002, 296, 66–67.
28. Botu, V.; Ramprasad, R. Adaptive Machine Learning Framework to Accelerate Ab Initio
Molecular Dynamics. Int. J. Quantum Chem. 2015, 115, 1074–1083.
29. Hansen, K.; Biegler, F.; Ramakrishnan, R.; Pronobis, W.; von Lilienfeld, O. A.; Müller, K.-R.;
Tkatchenko, A. Machine Learning Predictions of Molecular Properties: Accurate Many-Body
Potentials and Nonlocality in Chemical Space. J. Phys. Chem. Lett. 2015, 6, 2326–2331.
30. Behler, J. Perspective: Machine Learning Potentials for Atomistic Simulations. J. Chem. Phys.
2016, 145, 170901.
31. Li, Z.; Kermode, J. R.; De Vita, A. Molecular Dynamics with On-the-Fly Machine Learning of
Quantum-Mechanical Forces. Phys. Rev. Lett. 2015, 114, 096405.
32. Zeng, Y.; Li, Q.; Bai, K. Prediction of Interstitial Diffusion Activation Energies of Nitrogen,
Oxygen, Boron and Carbon in Bcc, Fcc, and Hcp Metals Using Machine Learning. Comput.
Mater. Sci. 2018, 144, 232–247.
33. Wu, H.; Mayeshiba, T.; Morgan, D. High-Throughput Ab-Initio Dilute Solute Diffusion
Database. Sci. Data 2016, 3, 160054.

45
34. Mukhanov, V. A.; Kurakevych, O. O.; Solozhenko, V. L. Thermodynamic Aspects of
Materials’ Hardness: Prediction of Novel Superhard High-Pressure Phases. High Press. Res.
2008, 28, 531–537.
35. Li, K.; Xue, D. Estimation of Electronegativity Values of Elements in Different Valence States.
J. Phys. Chem. A 2006, 110, 11332–11337.
36. Li, K.; Wang, X.; Zhang, F.; Xue, D. Electronegativity Identification of Novel Superhard
Materials. Phys. Rev. Lett. 2008, 100, 235504.
37. Rajan, K. Materials Informatics: The Materials "t; Gene "t; and Big Data. Annu. Rev. Mater. Res
2015, 45, 153–169.
38. Butler, K. T.; Davies, D. W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine Learning for
Molecular and Materials Science. Nature 2018, 559, 547.
39. Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A General-Purpose Machine Learning
Framework for Predicting Properties of Inorganic Materials. npj Comput. Mater. 2016, 2,
16028.
40. Ward, L.; Wolverton, C. Atomistic Calculations and Materials Informatics: A Review. Curr.
Opin. Solid State Mater. Sci. 2017, 21, 167–176.
41. Mueller, T.; Kusne, A. G.; Ramprasad, R. Machine Learning in Materials Science: Recent
Progress and Emerging Applications. Rev. Comput. Chem. 2016, 29, 186–273.
42. Liu, Y.; Zhao, T.; Ju, W.; Shi, S. Materials Discovery and Design Using Machine Learning. J.
Mater. 2017, 3, 159–177.
43. Curtarolo, S.; W Hart, G. L.; Buongiorno Nardelli, M.; Mingo, N.; Sanvito, S.; Levy, O. The
High-Throughput Highway to Computational Materials Design. Nat. Mater. 2013, 12.
44. Kim, C.; Pilania, G.; Ramprasad, R. From Organized High-Throughput Data to
Phenomenological Theory Using Machine Learning: The Example of Dielectric Breakdown.
Chem. Mater. 2016, 28, 1304–1311.
45. Kim, C.; Pilania, G.; Ramprasad, R. Machine Learning Assisted Predictions of Intrinsic
Dielectric Breakdown Strength of ABX 3 Perovskites. J. Phys. Chem. C 2016, 120,
14575–14580.
46. Ghiringhelli, L. M.; Vybiral, J.; Levchenko, S. V.; Draxl, C.; Scheffler, M. Big Data of Materials
Science: Critical Role of the Descriptor. Phys. Rev. Lett. 2015, 114, 105503.
47. Goldsmith, B. R.; Boley, M.; Vreeken, J.; Scheffler, M.; Ghiringhelli, L. M. Uncovering
Structure-Property Relationships of Materials by Subgroup Discovery. New J. Phys. 2017, 19,
013031.
48. Ghiringhelli, L. M.; Vybiral, J.; Ahmetcik, E.; Ouyang, R.; Levchenko, S. V; Draxl, C.;
Scheffler, M. Learning Physical Descriptors for Materials Science by Compressed Sensing. New
J. Phys. 2017, 19, 023017.
49. Oliynyk, A. O.; Antono, E.; Sparks, T. D.; Ghadbeigi, L.; Gaultois, M. W.; Meredig, B.; Mar,
A. High-Throughput Machine-Learning-Driven Synthesis of Full-Heusler Compounds. Chem.
Mater. 2016, 28, 7324–7331.
50. Dey, P.; Bible, J.; Datta, S.; Broderick, S.; Jasinski, J.; Sunkara, M.; Menon, M.; Rajan, K.
Informatics-Aided Bandgap Engineering for Solar Materials. Comput. Mater. Sci. 2014, 83,
185–195.

46
51. Ward, L.; Agrawal, A.; Choudhary, A.; Wolverton, C. A General-Purpose Machine Learning
Framework for Predicting Properties of Inorganic Materials. npj Comput. Mater.
2016[Online]. https://www.nature.com/articles/npjcompumats201628
52. Lee, J.; Seko, A.; Shitara, K.; Tanaka, I. Prediction Model of Band-Gap for AX Binary Compounds
by Combination of Density Functional Theory Calculations and Machine Learning Techniques. 2015,
arXiv:1509.0097. arXiv.org e-Print archive. https://arxiv.org/abs/1509.00973 (accessed June
27, 2019).
53. Pilania, G.; Mannodi-Kanakkithodi, A.; Uberuaga, B. P.; Ramprasad, R.; Gubernatis, J. E.;
Lookman, T. Machine Learning Bandgaps of Double Perovskites. Nature Sci. Rep. 2015.
54. Pilania, G.; Gubernatis, J. E.; Lookman, T. Multi-Fidelity Machine Learning Models for
Accurate Bandgap Predictions of Solids. Comput. Mater. Sci. 2017, 129, 156–163.
55. CRC. CRC Handbook of Chemistry and Physics, 98th ed.; Rumble, J. R., Ed.; CRC Press/Taylor
& Francis: Boca Raton, FL, 2018.
56. Downs, R. T.; Hall-Wallace, M. The American Mineralogist Crystal Structure Database. Am.
Mineral. 2003, 88, 247–250.
57. Garnett, J. Prediction of Mohs hardness with machine learning methods using compositional features.
2019. https://data.mendeley.com/datasets/jm79zfps6b/1 (accessed Jan 26, 2019)
58. Gao, F. Hardness Estimation of Complex Oxide Materials. Phys. Rev. B 2004, 69, 094113.
59. Šimůnek, A.; Vackář, J. Hardness of Covalent and Ionic Crystals: First-Principle Calculations.
Phys. Rev. Lett. 2006, 96, 085501.
60. Berger, M. J.; Hubbell, J. H. NIST X-Ray and Gamma-Ray Attenuation Coefficients and Cross
Sections Database; U.S. Department of Commerce: Gaithersburg, MD 1990.
61. Dimitrov, V.; Komatsu, T. Correlation among Electronegativity, Cation Polarizability, Optical
Basicity and Single Bond Strength of Simple Oxides. J. Solid State Chem. 2012, 196, 574–578.
62. Plenge, J.; Kühl, S.; Vogel, B.; Müller, R.; Stroh, F.; von Hobe, M.; Flesch, R.; Rühl, E. Bond
Strength of Chlorine Peroxide. J. Phys. Chem. A 2005, 109, 6730–6734.
63. Nembrini, S.; König, I. R.; Wright, M. N. The Revival of the Gini Importance? Bioinformatics
2018, 34, 3711–3718.
64. Ishwaran, H. The Effect of Splitting on Random Forests. Mach. Learn. 2015, 99, 75–118.
65. Boser, B. E.; Guyon, I. M.; Vapnik, V. N. A Training Algorithm for Optimal Margin Classifiers.
In Proceedings of the fifth annual workshop on Computational learning theory - COLT ’92; ACM
Press: New York, NY, 1992; pp 144–152.
66. Yang, C.; Fernandez, C. J.; Nichols, R. L.; Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical
Guide to Support Vector Classification.
67. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.;
Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-Learn: Machine Learning in Python. J. Mach.
Learn. Res. 2011, 12, 2825–2830.
68. Rasmussen, C. E.; Williams, C. K. I. Gaussian Processes for Machine Learning; MIT Press:
Cambridge, MA; 2006.
69. Matérn, B. Spatial Variation. In Lecture Notes in Statistics; Springer-Verlag New York: New York,
NY, 2013; Vol. 36.

47
70. Stoica, C.; Verwer, P.; Meekes, H.; van Hoof, P. J. C. M.; Kaspersen, F. M.; Vlieg, E.
Understanding the Effect of a Solvent on the Crystal Habit. Cryst. Growth Des. 2004, 4,
765–768.
71. Liu, Y.; Lai, W.; Yu, T.; Ma, Y.; Kang, Y.; Ge, Z. Understanding the Growth Morphology of
Explosive Crystals in Solution: Insights from Solvent Behavior at the Crystal Surface. RSC Adv.
2017, 7, 1305–1312.
72. Chen, X.-Q.; Niu, H.; Li, D.; Li, Y. Modeling Hardness of Polycrystalline Materials and Bulk
Metallic Glasses. Intermetallics 2011, 19, 1275–1281.
73. Tabor, D. The Hardness of Metals; Oxford University Press: Oxford, United Kingdom, 2000.
74. Pugh, S. F. XCII. Relations between the Elastic Moduli and the Plastic Properties of
Polycrystalline Pure Metals. London, Edinburgh, Dublin Philos. Mag. J. Sci. 1954, 45, 823–843.

48
Chapter 3

High-Dimensional Neural Network Potentials for Atomistic

Simulations
Matti Hellström1 and Jörg Behler*,2

1Software for Chemistry & Materials BV, De Boelelaan 1083, 1081HV Amsterdam,
The Netherlands
2Universität Göttingen, Institut für Physikalische Chemie, Theoretische Chemie,
Tammannstrasse 6, 37077 Göttingen, Germany
*E-mail: joerg.behler@uni-goettingen.de

Machine-learning methods have become increasingly popular for describing

potential energy surfaces for molecular and materials simulations, and they are
even beginning to challenge the present-day dominance of force fields for this task.
This chapter reviews high-dimensional neural network potentials (HDNNPs),
which are a general-purpose reactive potential method that can be used for
simulations of an arbitrary number of atoms, can describe all types of chemical
interactions (e.g., covalent, metallic, and dispersion), and includes the breaking
and forming of chemical bonds. Before an HDNNP can be applied, it must be
parameterized using electronic structure data, and great care must be taken at the
parameterization stage to ensure that all pertinent parts of the potential energy
surface are adequately covered. Typically, this is done iteratively through the
addition of more training data and refitting of parameters. This chapter illustrates
these points through the use of two case studies from our recent work for aqueous
NaOH solutions and the ZnO/water interface.

Introduction
In chemistry and materials science, a typical goal is to predict the properties of different
compounds. One way that computers have revolutionized the natural sciences is the ability to
perform computer simulations, or computational experiments, where the computer, based on some
predefined rule set, generates new, physically meaningful data. In lieu of an actual experiment,
simulations are used to answer the questions posed by the scientist.
One of the most important methods in computational chemistry for predicting the properties
of a molecule or material is to construct an atomistic model of the target compound, letting the
atomic positions evolve via either Monte Carlo or molecular dynamics (MD) simulations. Through
such simulations, many properties of the compound can be extracted, such as its stability at various

© 2019 American Chemical Society

conditions like pH, ionic strength, temperature, and pressure. Many other properties can also be
computed from dynamical simulations, such as ionic or thermal conductivity, rates of reactions, and
conformational changes.
Even though these types of simulations are, in principle, possible, there are two main bottlenecks
that must be overcome for such simulations to have predictive value:

• The atomistic model must be representative of the actual chemical system, requiring many
atoms (often thousands to millions) and a method that is computationally inexpensive for
the simulation to be possible in practice.
• The atomic interactions predicted in the computer model should closely approximate the
true quantum-mechanical interactions.

Electronic-structure methods like density functional theory can often accurately predict atomic
interactions but are too computationally demanding to be applicable for modeling large-scale
systems. In contrast, empirical force fields are much faster to evaluate and can be used for large
systems, but they often lack the accuracy needed for truly predictive simulations.
Machine-learning potentials (MLPs) provide the accuracy of electronic structure methods at
approximately the computational cost of force fields. Unlike force fields, MLPs do not have a
functional form based on known physical interactions, but they instead exploit an extremely flexible
functional form with many parameters, which must be fitted to a typically very large amount of
training data. This process is the learning of machine learning. There are several different types of
MLPs, based on different machine-learning techniques; high-dimensional neural network potentials
(HDNNPs) (1, 2) are the focus of this chapter, but several other methods like Gaussian
approximation potentials (3), kernel ridge regression (4), and support vectors machines can be used
to describe potential energy surfaces (5). Interested readers are referred to a review providing an
overview of these methods (6).
HDNNPs exploit the capability of artificial neural networks (NNs) to approximate complicated
nonlinear functions (e.g., a potential energy surface). NNs have been historically used for
representing potential energy surfaces, starting with the pioneering work of Doren et al. (7). Early
NNs could describe small molecular systems, but only with the development of HDNNPs was
this limitation eventually overcome (1, 8), allowing for simulations of thousands of atoms. Several
overviews of this method have been published (2, 9, 10).
HDNNP simulations have been employed for many different types of systems including the
high-pressure phase diagram of silicon (11), copper surfaces (12), water (13), N2 scattering at Ru
surfaces (14), and modeling the phase change material GeTe (15). Smith et al. constructed the ANI-
1 potential based on HDNNPs (16), which is a general-purpose potential for organic molecules. To
illustrate the capability of the method, this chapter will, in addition to describing the computational
method itself, briefly review some of our work on proton transfer reactions in aqueous solutions
(17–20) and at solid/liquid interfaces that we have carried out using HDNNPs (21–24).
The theory and practice of parameterizing NNs for arbitrary applications can be found in many
textbooks and online resources. In this chapter, we will only focus on how HDNNPs are typically
applied for calculating potential energy surfaces. The potential energy for a given set of atomic
positions is a single real-valued number, so the HDNNP is a function that takes a representation
of the atomic structure as input and provides a single real-valued number, the potential energy, as
output.

50
HDNNPs

Overview
In the HDNNP method, the potential energy E is calculated as a sum of atomic contributions Ei:

where the sum is taken over all atoms in the system. Each atomic energy is calculated by an individual
atomic NN, which takes a representation of the chemical environment around the atom as input. The
representation is a vector of numbers calculated using so-called symmetry functions, described in
more detail later (8). The number of symmetry functions (i.e., the number of elements in the input
vector to the atomic NN), as well as the functional forms and parameters of the symmetry functions,
must be decided before the NN can be parameterized. Typically, between 10 and 100 symmetry
functions are used as input to an atomic NN.
The symmetry function values are then fed into fully connected feed-forward NNs. Typically,
two hidden layers are used to compute the value of the single output node representing the atomic
energy Ei. The hidden layers contain a number of nodes; the values at the nodes are calculated
by taking linear combinations of the values in the preceding layer and adding biases, where the
coefficients for the linear combinations and the biases are parameters of the NN that need to be fitted.
Then, a nonlinear activation function is applied to all of the node values. An example of the activation
function could be the logistic function or the hyperbolic tangent function.
The parameters of the NN are element dependent, meaning that, for example, the same NN
function (but with different inputs) is used to calculate the atomic energies for every H atom in liquid
water, but that another NN function is used to calculate the atomic energies for the O atoms. The
NNs for different elements need not necessarily have the same architecture defined by the number of
input features, hidden layers, and nodes per hidden layer.
As a concrete example, Morawietz et al. developed an HDNNP for liquid water (13). They used
30 and 27 symmetry functions describing the environment around O and H atoms, respectively. For
both elements, the atomic NNs had two hidden layers with 25 nodes each. In total, including the bias
weights, the HDNNP thus contained 31 × 25 + 26 × 25 + 26 × 1 + 28 × 25 + 26 × 25 + 26 × 1 =
2827 fitted parameters.

Symmetry Functions as Descriptors of the Local Chemical Environment

A symmetry function provides a characteristic structural fingerprint of the chemical
environment around an atom (8). By using several different symmetry functions for the same atom,
the goal is to provide a unique description of the chemical environment that is invariant with respect
to translation and rotation, as well as permutation of atoms of the same element. It is common to
apply smooth cutoff function fc in both value and slope, so that only the atoms within the cutoff radius
Rc contribute to the value of the symmetry function. An example of such a cutoff function is

where a typical value for the cutoff radius Rc, which has to be determined by convergence tests, is in

51
the range from 6–10 Å.
Typically, two types of symmetry functions are used: radial functions and angular functions,
both of which are many-body functions depending on the positions of all atoms inside the atomic
cutoff spheres. The radial functions, evaluated for some central atom i of element I, take the form

where the sum runs over all atoms j of a particular element J. The radial symmetry function is a sum of
Gaussian functions centered at Rs and with a width controlled by the parameter η. Typically, several
symmetry functions, with different values of η and/or Rs, are used for each element J as input to the
atomic NN for atoms of element I.
The angular symmetry functions, for an atom i of element I, take the form

where the sum runs over all unique combinations of atoms j of element J and atoms k of element
K, and λ and ζ determine the angular dependence of the function. Typically, several values of the
parameters η, λ, and ζ are used as input to the atomic NN for each combination of elements K and L
that can be the same as or different from I.
There are also other possible definitions of symmetry functions and, more generally, other types
of descriptors for machine-learning potentials (6).

Construction of HDNNP
The construction of an HDNNP for a given system is a procedure involving:

• Choosing a reference electronic structure method;

• Generating training and validation data;
• Deciding on a set of symmetry functions for each element;
• Deciding on an NN architecture for each element;
• Fitting the NN parameters; and
• Validating the final potential.

The resulting NN potential can only be as accurate as the reference method to which it has been
parameterized, so it is important to choose a reference electronic structure method that is suitable for
the problem of interest. Simultaneously, because one needs to construct a typically very large training
set, the reference electronic structure method should not be too computationally demanding.
The training set should contain atoms in environments that are representative for the intended
production simulations. These are typically obtained from (potentially high-temperature) MD
simulations, employing either the reference electronic structure method, some force field, or a
previous fit of the NN potential.
The set of symmetry functions for each element must also be chosen with care, since it is
the set of symmetry functions that allows an atomic NN to distinguish between different chemical
environments. Often, the set of symmetry functions is chosen in a rather empirical fashion, but more
systematic approaches have also been suggested (25). A good set of symmetry functions should cover

52
a range of chemically meaningful distances and angular environments. As a typical example, the
largest value of η for radial symmetry functions GI:J should be chosen so that function decays around
the shortest meaningful distance between atoms of elements I and J. Moreover, if the forces on two
atoms are very different but the set of symmetry functions are very similar, then that is an indication
that the set of symmetry functions should be augmented.
Before the NN weights can be fitted, they must be initialized. It is possible to just assign random
numbers for the initialization, but it is also possible to use various initialization algorithms, which can
speed up the parameterization process [e.g., the algorithm developed by Nguyen and Widrow (26)].
The NN weights can be fitted using a variety of typically gradient-based optimization algorithms.
The simplest one is standard gradient descent, which is often called backpropagation in this context
(27). However, other, usually more efficient algorithms can also be used, such as the extended
Kalman filter and the Levenberg–Marquardt algorithm (28–31). The fitting algorithm works by
minimizing a cost function, which is usually taken to be the sum of squared errors for the energies per
atom, for the predicted energies as compared to the reference energies. Often, one also includes the
squared errors in the predicted forces on individual atoms in the cost function.
The quality of a fitted NN potential is usually measured by the root-mean-squared errors for the
energies per atom and for the individual force components of the atoms. Typically, errors of about 1
meV per atom for energies and about 100 meV/Å for forces are obtained after fitting, although this
can of course depend on how difficult the potential energy surface is to fit. The errors are evaluated
not only on the training set, but also on a validation set. A validation set is a set of structures randomly
drawn from the total available data set, which is not used to optimize the NN parameters. The error
on the validation set thus gives a measure of how well the NN potential performs on structures to
which it has not been explicitly trained. A full hierarchy of validation steps is available, and more
details can be found in Behler (9).

Case Studies: NaOH Solutions and the ZnO/Water Interface

Here, we briefly summarize some of our findings for two systems for which we constructed
HDNNPs and performed atomistic MD simulations: aqueous NaOH solutions and the ZnO/water
interface.

Aqueous NaOH Solutions

Aqueous NaOH solutions are among the most widely used chemical reagents, but they are
difficult to model because of the proton-transfer reactivity of such solutions, in which hydroxide
ions diffuse via the Grotthuss mechanism (32). In order to describe the proton transfer reactions in
such solutions, one must perform either computationally demanding ab initio calculations or use a
reactive atomistic potential (e.g., an HDNNP), which would allow for a more thorough sampling
of the phase space and of slow events in such solutions. We have thus constructed an HDNNP for
the entire room-temperature solubility range of NaOH in water and used it to run MD simulations,
getting valuable information about:

• Hydrogen-bonding fluctuations around the local environment of an HOH...OH– pair,

which either aid or inhibit proton transfer (decrease or increase the proton transfer barrier)
(17);
• The most important presolvation step [i.e., what type of fluctuation (if any) around
HOH...OH– gives the greatest increase of the overall rate of proton transfer] (17);

53
• The structure of the solvation shell around Na+, with respect to the coordination number,
the extent of direct-contact ion pairing, and the shape of the coordination polyhedra (18);
• The water exchange mechanism in the Na+ hydration shell (i.e., the mechanism describing
how one water molecule leaves the hydration shell and another takes its place) for which
we showed that a proton transfer event in the first hydration shell increased the likelihood
of a water exchange event, which we called a proton-transfer-driven water exchange
mechanism (19); and
• The role of nuclear quantum effects (e.g., zero-point energy and tunneling) on the proton
transfer barriers and rates, the vibrational power spectra, and diffusion coefficients (20).

All of the above properties were evaluated for the full room-temperature solubility range of
NaOH in water, thus requiring significant amounts of simulation data and necessitating the use of a
reactive atomistic potential. In particular, NaOH solutions near the solubility limit (about 19 mol/
L, corresponding to about one NaOH formula unit for every two H2O molecules) are very viscous,
making dynamical events like proton transfer and diffusion quite slow and requiring long trajectories.
Moreover, the role of nuclear quantum effects was modeled using ring polymer MD, which required
many nearly identical replicas of the system to be modeled simultaneously, thus being even more
computationally demanding. For these reasons, we constructed an atomistic potential to be able to
investigate the aforementioned properties of NaOH solutions.
In order to construct the potential, we iteratively generated a dataset consisting of 17,899
structures of various concentrations of NaOH, of which 16,113 were used for training and 1786
for validation. We chose the revised Perdew–Burke–Ernzerhof density functional (33), together
with D3 type dispersion corrections (34), as the reference method. This method had previously
been shown to yield a good description of liquid water (13, 35). The following types of symmetry
functions were used: H:H, H:O, H:Na, H:HH, H:HO, H:HNa, H:OO, H:ONa, H:NaNa, O:H,
O:O, O:Na, O:HH, O:HO, O:HNa, O:OO, O:ONa, O:NaNa, Na:H, Na:O, Na:Na, Na:HH,
Na:HO, Na:HNa, Na:OO, Na:ONa, and Na:NaNa, where the species before the colon specifies
the element for which the NN atomic energy is calculated, and the species after the colon specifies
that the symmetry function gives a measure of the radial (if one species) and angular (if two species)
environments of those species around the central element. For each of the preceding types of
symmetry functions, several different values of the defining values η, Rs, ξ, and λ were used. The
full list is given in Hellström and Behler (18). The cutoff radius Rc was set to 6.35 Å. The NN
architectures were 36-35-35-1 for Na, 46-35-35-1 for O, and 48-35-35-1 for H. The parameters
were optimized using an extended Kalman filter.
First, we used short ab initio MD runs to construct the initial data set. We then fitted an HDNNP
and used it to run MD simulations, thus generating new data points to be included in the training and
validation sets. The process was repeated several times, until the potential was deemed satisfactory.
In addition, some structures were modified either by random displacements or displacements of
one or more hydrogen atoms along a proton transfer coordinate. Finally, the root-mean-squared
error for the energies on the training and test sets were 1.25 and 1.58 meV/atom, and for the force
components 80 and 81 meV/bohr. We additionally performed validations of radial distribution
functions and proton-transfer barriers, which were directly compared to ab initio MD and found
overall satisfactory agreement between the NN predictions and the reference ab initio method (17).

54
The ZnO/Water Interface
ZnO crystallizes in the hexagonal wurtzite structure and primarily exposes two different
crystallographic surfaces: the (1010) and (1120) surfaces (where the underscores are used to indicate
negative Miller indices). The surfaces are mixed terminated and expose an equal amount of surface
zinc and oxide ions to the surrounding medium. The interface between ZnO particles and water is of
particular interest for understanding photocatalytic water splitting on ZnO-based materials and the
dissolution mechanism of such particles, which is believed to be one of the causes of the mild toxicity
of ZnO. Modeling this interface using atomistic potentials is very challenging, since the employed
potential must simultaneously describe an ionic oxide, a molecular liquid, and the interface between
them, where frequent proton-transfer reactions occur (e.g., for hydroxylation of the surface oxide
ions). However, this can be achieved with an HDNNP. In our studies, we have elucidated:

• The types of hydrogen-bonding fluctuations that drive proton-transfer reactions near the
interface (21);
• The relative rate of proton transfer to and from the surface, as compared to transfer
“within” the adsorbed layer (21, 23);
• The effect of the local hydrogen-bonding environment around both adsorbed water and
hydroxide on OH stretching frequencies (22);
• The local structure of adsorbed water molecules and hydroxide ions near the surface (e.g.,
with respect to their angular orientation) (24);
• The relative rates of water exchange (i.e., the desorption of an adsorbed water molecule
into the bulk liquid) at the two surfaces (24); and
• The long-range proton transport properties of the two surfaces, where we found that
proton transport at ZnO(1010) is pseudo-one-dimensional, whereas it is two-dimensional
at the ZnO(1120) surface (23).

For the application exploring proton transport dimensionalities, it was especially useful to have
access to an atomistic potential with first-principles accuracy. The ZnO/water interface was modeled
under three-dimensional periodic boundary conditions, with the ZnO and water thicknesses in the
surface normal direction being about 2 and 3 nm, respectively, and the surface area of one side of
the ZnO slab being about 7 nm2. The dynamics of such a system were modeled for more than 40
ns in order to obtain sufficient statistics about the different types of proton transfer reactions at the
interface.
The strategy employed to parameterize the HDNNP was similar to that of aqueous NaOH
solutions. The training set consisted of a mixture of structures of water, ZnO, and the ZnO/water
interface. In total, there were 17,031 structures, of which 15,319 were used for training and 1712 for
validation. The NN architectures were 60-25-25-1 for Zn, 45-25-25-1 for O, and 41-25-25-1 for H.
Also, some properties of the ZnO/water interface had previously been studied using another
reactive atomistic potential approach, namely the ReaxFF reactive force field (36, 37). Because the
ReaxFF force field functional form is largely based on physically reasonable and intuitive equations
(38), the force field used to describe the ZnO/water interface contained only 149 parameters. In
contrast, our HDNNP, based on machine-learning techniques, contained a total of 5753 fitted
parameters. This much larger number of parameters is necessary, since the HDNNP is not based
on any physical approximations, but on the other hand the resulting flexibility allows very good

55
agreement with electronic structure calculations. Consequently, a much larger training set is also
needed.

Limitations and Strengths of HDNNPs

Although HDNNPs have been parameterized for a variety of molecular and condensed-phase
systems, there are still some limitations. Developing methods for overcoming these limitations is
currently a very active research field. Some limitations include:

• Limited number of elements. The environments around the atoms are typically transformed
via a set of symmetry functions before they enter the atomic NNs. With more than about
four elements, the number of possible environments around an atom becomes very large,
which results in an inordinately large NN.
• Poor generalizability. Although NNs, and machine-learning methods in general, are
extremely good at fitting the training set, predictions for atomic configurations sufficiently
far from any data point in the training set can be very inaccurate. Thus, the NN potential
can only reliably be applied to systems to which it has explicitly been trained.
• Slow fitting procedure. The large number of parameters, as well as the large number of
structures in the training set, can cause the training (fitting) to be quite slow.
• An iterative construction of large training sets. The training set must typically contain many
thousands of structures, all computed using the reference electronic structure method,
which is quite time consuming. Moreover, the training set must typically be expanded
in an iterative fashion, by adding new structures predicted by the current fit of the NN
parameters.

Even with these limitations, HDNNPs have many strengths:

• They can be fitted to describe the training data with only very small errors (typically about
1 meV per atom).
• They are by construction reactive and can describe the formation and breaking of chemical
bonds, as long as the relevant training data are provided.
• They can describe any type of chemical interaction (e.g., covalent, hydrogen-bonding,
dispersion, etc.).
• They can be evaluated quickly and can be applied to large-scale MD simulations.

Summary
An HDNNP is a type of atomistic potential that can be used in Monte Carlo and MD simulation.
It can be fitted to very accurately reproduce training data obtained in electronic structure calculations.
However, because it is not based on any physical functional form, many parameters, and
consequently many training data points, are required. The total energy is calculated by summing
up atomic contributions; each atomic contribution is calculated by an element-dependent NN.
The NNs take a representation of the chemical environment around the atoms as input. This
representation is typically calculated by means of symmetry functions. In order to fit an HDNNP, an
iterative approach is required, meaning that a fitted NN potential is used to generate more training
data, which are used to fit a new NN potential and so forth.

56
Two systems that have been investigated in-depth with HDNNPs are aqueous NaOH solutions
and the ZnO/water interface. Both systems are characterized by proton transfer reactions,
necessitating the use of a reactive atomistic potential.
Finally, although there are some drawbacks when using NN potentials, there are also many
benefits. In particular, NN potentials are reactive and can describe any type of chemical bonding, and
they are also capable of fitting the given training data with remarkable accuracy, which is a common
property of all machine-learning potentials.

Acknowledgments
This work was supported by the DFG Heisenberg professorship Be3264/11-2. M.H.
acknowledges funding from the European Union’s Horizon 2020 research and innovation
programme under grant agreement No 798129.

References
1. Behler, J.; Parrinello, M. Generalized Neural-Network Representation of High-Dimensional
Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401.
2. Behler, J. First Principles Neural Network Potentials for Reactive Simulations of Large
Molecular and Condensed Systems. Angew. Chem., Int. Ed. 2017, 56, 12828.
3. Bartók, A.; Payne, M. C.; Kondor, R.; Csányi, G. Gaussian Approximation Potentials: The
Accuracy of Quantum Mechanics, without the Electrons. Phys. Rev. Lett. 2010, 104, 136403.
4. Rupp, M.; Tkatchenko, A.; Müller, K.-R.; von Lilienfeld, O. A. Fast and Accurate Modeling of
Molecular Atomization Energies with Machine Learning. Phys. Rev. Lett. 2012, 108, 058301.
5. Balabin, R. M.; Lomakina, E. I. Support Vector Machine Regression (LS-SVM)-an Alternative
to Artificial Neural Networks (ANNs) for the Analysis of Quantum Chemistry Data? Phys.
Chem. Chem. Phys. 2011, 13, 11710.
6. Behler, J. Perspective: Machine Learning Potentials for Atomistic Simulations. J. Chem. Phys.
2016, 145, 170901.
7. Blank, T. B.; Brown, S. D.; Calhoun, A. W.; Doren, D. J. Neural Network Models of Potential
Energy Surfaces. J. Chem. Phys. 1995, 103, 4129–4137.
8. Behler, J. Atom-Centered Symmetry Functions for Constructing High-dimensional Neural
Network Potentials. J. Chem. Phys. 2011, 134, 074106.
9. Behler, J. Constructing High-Dimensional Neural Network Potentials: A Tutorial Review. Int.
J. Quantum Chem. 2015, 115, 1032–1050.
10. Hellström, M.; Behler, J. Neural Network Potentials in Materials Modeling. In Handbook of
Materials Modeling; Andreoni, W., Yip, S., Eds.; Springer: Cham, 2018; pp 1–20.
11. Behler, J.; Martonak, R.; Donadio, D.; Parrinello, M. Metadynamics Simulations of the High-
Pressure Phases of Silicon Employing a High-Dimensional Neural Network Potential. Phys.
Rev. Lett. 2008, 100, 185501.
12. Artrith, N.; Behler, J. High-dimensional Neural Network Potentials for Metal Surfaces: A
Prototype Study for Copper. Phys. Rev. B 2012, 85, 045439.
13. Morawietz, T.; Singraber, A.; Dellago, C.; Behler, J. How van der Waals Interactions
Determine the Unique Properties of Water. Proc. Natl. Acad. Sci. U. S. A. 2016, 113,
8368–8373.

57
14. Shakouri, K.; Behler, J.; Meyer, J.; Kroes, G.-J. Accurate Neural Network Description of
Surface Phonons in Reactive Gas-Surface Dynamics: N2+Ru(0001). J. Phys. Chem. Lett. 2017,
8, 2131–2136.
15. Sosso, G. C.; Miceli, G.; Caravati, S.; Behler, J.; Bernasconi, M. Neural Network Interatomic
Potential for the Phase Change Material GeTe. Phys. Rev. B 2012, 85, 174103.
16. Smith, J. S.; Isayev, O.; Roitberg, A. E. ANI-1: An Extensible Neural Network Potential with
DFT Accuracy at Force Field Computational Cost. Chem. Sci. 2017, 8, 3192–3203.
17. Hellström, M.; Behler, J. Concentration-Dependent Proton Transfer Mechanisms in Aqueous
NaOH Solutions: From Acceptor-Driven to Donor-Driven and Back. J. Phys. Chem. Lett.
2016, 7, 3302–3306.
18. Hellström, M.; Behler, J. Structure of Aqueous NaOH Solutions: Insights from Neural-
Network-Based Molecular Dynamics Simulations. Phys. Chem. Chem. Phys. 2017, 19, 82–96.
19. Hellström, M.; Behler, J. Proton-Transfer-Driven Water Exchange Mechanism in the Na+
Solvation Shell. J. Phys. Chem. B 2017, 121, 4184.
20. Hellström, M.; Ceriotti, M.; Behler, J. Nuclear Quantum Effects in Sodium Hydroxide
Solutions from Neural Network Molecular Dynamics Simulations. J. Phys. Chem. B 2018, 122,
10158–10171.
21. Quaranta, V.; Hellström, M.; Behler, J. Proton Transfer Mechanisms at the Water-ZnO
Interface: The Role of Presolvation. J. Phys. Chem. Lett. 2017, 8, 1476–1483.
22. Quaranta, V.; Hellström, M.; Behler, J.; Kullgren, J.; Mitev, P. D.; Hermansson, K. Maximally
Resolved Anharmonic OH Vibrational Spectrum of the Water/ZnO(1010) Interface from a
High-Dimensional Neural Network Potential. J. Chem. Phys. 2018, 148, 241720.
23. Hellström, M.; Quaranta, V.; Behler, J. One-Dimensional vs. Two-Dimensional Proton
Transport Processes at Solid–Liquid Zinc-Oxide–Water Interfaces. Chem. Sci. 2019, 10,
1232–1243.
24. Quaranta, V.; Behler, J.; Hellström, M. Structure and Dynamics of the Liquid–Water/Zinc-
Oxide Interface from Machine Learning Potential Simulations. J. Phys. Chem. C 2019, 123,
1293–1304.
25. Imbalzano, G.; Anelli, A.; Giofre, D.; Klees, S.; Behler, J.; Ceriotti, M. Automatic Selection of
Atomic Fingerprints and Reference Configurations for Machine-Learning Potentials. J. Chem.
Phys. 2018, 148, 241730.
26. Nguyen, D. H.; Widrow, B. Neural Networks for Self-Learning Control Systems. IEEE Control
Syst. Mag. 1990, 3, 18–23.
27. Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Learning Representations by Back-
Propagating Errors. Nature 1986, 323, 533–536.
28. Kalman, R. E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960,
82, 35–45.
29. Blank, T. B.; Brown, S. D. Adaptive, Global, Extended Kalman Filters for Training Feed-
Forward Neural Networks. J. Chemom. 1994, 8, 391–407.
30. Levenberg, K. A Method for the Solution of Certain Non-Linear Problems in Least Squares. Q.
Appl. Math. 1944, 2, 164–168.
31. Marquardt, D. W. An Algorithm for Least-Squares Estimation of Nonlinear Parameters. SIAM
J. Appl. Math. 1963, 11, 431–441.

58
32. Marx, D.; Chandra, A.; Tuckerman, M. E. Aqueous Basic Solutions: Hydroxide Solvation,
Structural Diffusion, and Comparison to the Hydrated Proton. Chem. Rev. 2010, 110,
2174–2216.
33. Hammer, B.; Hansen, L. B.; Nørskov, J. K. Improved Adsorption Energetics within Density-
Functional Theory using Revised Perdew-Burke-Ernzerhof functionals. Phys. Rev. B 1999, 59,
7413–7421.
34. Grimme, S.; Antony, J.; Erhlich, S.; Krieg, H. A Consistent and Accurate ab initio
Parametrization of Density Functional Dispersion Correction (DFT-D) for the 94 Elements H-
Pu. J. Chem. Phys. 2010, 132, 154104.
35. Kühne, T.; Khaliullin, R. Z. Electronic Signature of the Instantaneous Asymmetry in the First
Coordination Shell of Liquid Water. Nat. Commun. 2013, 4, 1450.
36. Raymand, D.; van Duin, A. C. T.; Spångberg, D.; Goddard, W. A., III; Hermansson, K. Water
Adsorption on Stepped ZnO Surfaces from MD Simulation. Surf. Sci. 2010, 604, 741–752.
37. Raymand, D.; van Duin, A. C. T.; Goddard, W. A., III; Hermansson, K.; Spångberg, D.
Hydroxylation Structure and Proton Transfer Reactivity at the Zinc Oxide−Water Interface. J.
Phys. Chem. C 2011, 115, 8573–8579.
38. van Duin, A. C. T.; Dasgupta, S.; Lorant, F.; Goddard, W. A., III ReaxFF: A Reactive Force
Field for Hydrocarbons. J. Phys. Chem. A 2001, 105, 9396–9409.

59
Chapter 4

Data-Driven Learning Systems for Chemical Reaction

Prediction: An Analysis of Recent Approaches
Philippe Schwaller*,1,2 and Teodoro Laino1

1IBM Research – Zurich, Rueschlikon 8803, Switzerland

2Department of Chemistry and Biochemistry, University of Berne, Berne 3012, Switzerland

*E-mail: phs@ibm.zurich.com

One of the critical challenges in efficient synthesis route design is the accurate
prediction of chemical reactivity. Unlocking it could significantly facilitate
chemical synthesis and hence, accelerate the discovery of novel molecules and
materials. With the current rise of artificial intelligence (AI) algorithms, access
to cheap computing power, and the wide availability of chemical data, it became
possible to develop entirely data-driven mathematical models able to predict
chemical reactivity. Similar to how a human chemist would learn chemical
reactions, those learn by repeatedly looking at examples, the underlying patterns
in the data. In this chapter, we compare the state-of-the-art data-driven learning
systems for forward chemical reaction prediction, analyzing the reaction
representations, the data, and the model architectures. We discuss the advantages
and limitations of the different AI model strategies and make comparisons on
standard open-source benchmark datasets. The intention is to provide a critical
assessment of the different data-driven approaches recently developed not only
for the cheminformatics community, but also for the AI models end-users, the
organic chemists, and for early adoption of such technologies.

Introduction
Approaching the universe of organic chemistry can be an ordeal for beginner students, who
typically have trouble in predicting the products of chemical reactions. It takes a certain amount of
practice and understanding to make the process more successful and efficient. Problems that may
appear as great challenges for an undergraduate student may be embarrassingly simple for a synthetic
organic chemist with more than 30 years of experience. However, the complexity of the molecular
space is such that the prediction of chemical reaction may become a difficult task even for expert
synthetic organic chemists.
Similar to the way humans created computer programs to confront expert players at Chess (1),
Jeopardy (2), and Go (3, 4), chemists also encoded a vast collection of instructions for making

© 2019 American Chemical Society

molecules, available in a wide variety of chemistry literature, into computer software with the
purpose of creating an expert system to assist themselves in designing efficient routes to target
molecules for organic synthesis. The origin of this revolution started not too much later than the
pioneering work of Corey (5, 6). In fact, around 1967 groups started to attack this problem by
constructing three computer programs—LHASA by Corey et al. (7), SECS by Wipke and Dyott (8),
and SYNCHEM by Gelernter et al. (9, 10)—that were searching synthetic strategies in synthesizing
known and unknown compounds, using a chemical knowledge base, rather than performing a
reaction retrieval from a database of literature examples.
The idea at the base of the reaction prediction or synthesis planning was that by analyzing an
input molecule with a catalogue of retro-reactions (or transforms) encoded in memory, one could
retrieve the descriptions of all the possible changes which will occur in the course of a particular
reaction. It is inherent in such an approach that the planned syntheses will be based only on a
combination of encoded transforms. EROS was the first attempt to use the large chemical dataset
to cast the problem of reaction prediction into a mathematical framework (11, 12). Molecules and
reactions were represented by specific matrices (bond-electron matrix and reaction matrix,
respectively (13)) and the synthesis planning was cast as a pure matrix-matrix multiplication
problem. This mathematical model was used as a basis for a variety of deductive computer programs
for the solution of chemical problems, and EROS can be considered the first attempt to use artificial
intelligence (AI) for the reaction prediction problem (11, 12).
Since the mid-nineties, we have witnessed an increased interest in the development of different
approaches based on data with CAMEO (14), WODCA (15), and SOPHIA (16), being the
pioneering technologies in this field exploiting advanced mathematical frameworks.
Similar to LHASA and SYNCHEM (7, 9), but with a bigger commitment of resources,
Chematica a few years later used human experts to extract chemical reactions from the literature and
to encode them with rules (17). The project started at the beginning of the year 2000 and went on
for more than a decade before it was publicly announced (17). Although the decision to decode the
broad knowledge of organic chemistry with rules was not new (12, 14), Chematica was the first to
achieve a high level of accuracy in reaction prediction (forward and retro) (17). This competitive
advantage was explainable with the multi-year efforts to codify the most extensive set of rules ever,
including reaction core, reactivity conflicts, substituents, and groups requiring protection during
multi-step synthesis. Despite the recent scientific and business successes (18), the approach is not
sustainable in the long-term, since manually extracting rules from literature is a tedious work and
prone to human error. Rules tend to be very brittle, as for every new reaction outside the scope of the
current rules is a new rule, which does not contradict the existing 86,000 rules to be added. Finally,
the involvement of humans in the entire curation process makes the maintenance and development
of the software unscalable due to the ever-growing amount of data produced and published (Figure
1).
Starting in 2010, and thanks to advances in machine learning (ML) algorithms, more powerful
computational resources, and to the availability of a vast amount of open-source chemical data, we
witnessed the development of a multitude of different types of mathematical AI models that tried to
offer a valid alternative to the rule-based approaches. The advantage of these mathematical models
is that once trained on a dataset, they can infer the patterns hidden in the data in a few hundreds of
milliseconds. Similar to what a human chemist would do, data-driven models learn by repeatedly
looking at examples, ideally without having humans encoding domain (organic synthetic chemistry)
specific knowledge, such as reaction rules. The main difference is that a mathematical model can

62
analyze and incorporate the whole literature, millions of distinct chemical reactions, in a matter of
days, which would take more than a lifetime for a human.

Figure 1. The number of publications with topic "Organic Chemistry" on Web of Science from 1924 to 2019
(binned over 5 years) (19).

Organic chemistry synthesis is still mainly designed by human experts, who relying on their
personal experience, intuition, and years of training to try to come up with reasonable steps. Along
the way, the route is improved, typically, by trial-and-error. If one step fails, and no alternative route
is found to circumvent the failing step, the whole route is rethought with different initial steps. The
later the failing step, the higher the costs. Data-driven chemical reaction models could be used to
validate individual steps in a multistep synthesis. One goal is to estimate the risk of a specific reaction
and place the reactions that are more likely to fail at the beginning of the synthesis route. Data-driven
chemical models could also be used to predict side products and impurities, and as inexpensive cross-
validation of outcomes generated by time-consuming and computation-intensive simulations.
Therefore, it is no surprise that such models are believed to profoundly change the way chemists
will design synthesis in the near future. Similar to what happened after Deep Blue beat Gary Kasparov
with computers assisting human players in chess matches (centaur chess), we envisage scientific
assistants supporting human chemists by giving them access to the knowledge hidden in a much
wider variety of chemical reactions.
While the recent mathematical approaches are based on data (20–32), their architecture can be
very different with unique responses to specific data sets. For non-experts, it can be a real hurdle
to rationalize the subtleties of the different implementations and critically assess which models are
more business-ready to use in daily life applications than others. In this chapter, we will focus on
data-driven approaches for the problem of forward reaction prediction, describing artificial neural
network-based models that we and others developed in the last years and that can be trained on
previously published experimental data.
We will also cover the discussion about training data. While large reaction corpora (e.g., Reaxys
and SciFinder (33, 34)) became widly available in the last 15 years, their usage in model training
is still hindered by their limited access for data analysis and model training purposes. Innovation in
designing new data-driven models requires unconditional data availability. For organic chemistry
reaction prediction, the experienced acceleration was strongly correlated to the possibility of
accessing a large set of chemical reactions consisting of millions of tabulated examples, which are
extracted from the United States Patent and Trademark Office (USPTO) (35, 36). Therefore, our
discussion about AI-models will focus more on details for all those approaches that trained and
tested on the USPTO data set, comparing the performance and analyzing the details of the respective

63
implementation, including the model, known as IBM RXN for chemistry, we made freely available
in 2018 using a natural language approach for reaction prediction in organic chemistry (32, 37).
The hope is that more data-driven models will get in reach of experimental chemists, similar to
what happened for IBM RXN for Chemistry: designed with an intuitive user interface, it reached a
large basin of domain experts, with more than 40,000 reaction predictions done by more than 6000
registered users since its debut in August 2018. The availability and consumption of such models
should be a significant community accomplishment, with the purpose to show the potential of AI
models, generate interest in the experimental community, and to finally drive chemical research and
development into a new era.

Representation and Formats

An essential aspect of data-driven models is the representation of the data used during the
training process. To uncover and better understand the highly nonlinear patterns in organic
chemistry and reaction prediction using ML, data should be made available in a machine-readable
format and as accurate and clean as possible. Currently, those highly complex chemical reactions are
simplified to quite abstract reaction diagrams, challenging to interpret with the use of a computer
program. Reaction diagrams consist mainly of four main parts: (1) in the center is the arrow, which
points in the direction the reaction proceeds; (2) to the left are the starting materials; (3) above and
below the arrow the additional reagents, agents and spectator molecules (e.g., catalysts and solvents);
and (4) the products on the right.
As simple as this basic scheme seems, there are several nonobvious challenges. Firstly, the
distinction between what a starting material and what an agent is can be vague. What one chemist
would call a reactant, would be a reagent for another, and the placement in one of the two categories
would give more indication on what the chemist, who draw the diagram in the first instance, focused
their attention on, instead of the actual role of the compound. Hence, having a molecule above
or below the arrow does not necessarily mean that it does not transfer part of its atoms to the
final product. Secondly, it is common that only the major target is reported and not the whole
product distribution. Trivial products like water or alcohol are often left out to simplify the diagram
representation, which can become even more cryptic in case there is the need to report enantiomers
and racemic mixtures. Finally, depending on the reaction conditions, the outcome of a reaction can
be different. Ideally, a chemical representation would contain information on the reaction conditions
(e.g., temperature, time, and pH), the reaction yield, and the enantiomeric excess. Because they
are not always added to reaction diagrams, the corresponding text must be consulted to get a full
picture. While a human expert can easily make the connections between the diagram and additional
information found in the supporting text, no reliable methods exist to date to extract all the
information from reaction a diagram and combine it with the textual information to generate a
machine-readable representation. Even then, it happens that some of the crucial details are not
disclosed.
An effort was made in the last decades to create different standards to codify reaction information
into machine-readable format to efficiently store, compare, and analyze chemical reactions. RXNfiles
and RDfiles are quite similar (38), with RXNfiles containing the molecular information of a single
reaction and RDfiles containing multiple reactions with additional information on the reaction
conditions, atom-mapping, and reaction center. Reaction SMILES or SMIRKS contain reactants
(39, 40), agents, and products; the last being separated by a “>” symbol. Although the format
supports atom-mapping, there are no extra fields for reaction conditions and reaction center
information. SMARTS, describing the molecular pattern, are extended with the “>” symbol to

64
encode reaction rules, which are also called reaction templates. Chemical Markup Language (CML)
is the equivalent to XML for chemical information (41, 42). Since this format is very flexible, it
allows for the most complete description of chemical reactions. However, no clear standards exist,
which makes the data exchange and comparison between research groups difficult. RInChI (43),
based on the IUPAC International Chemical Identifier (44), is a line notation describing groups of
reactants, agents, and products. As the aim of RInChI is to generate a unique and unambiguous
reaction descriptor to link and find chemical reactions, atom-mapping is not supported (45). While
the RInChI only contains standardized structural information, RAuxInfo stores the conformation
and orientation of the compounds used to generate the RInChI. Moreover, hashing algorithms
allow to generate shorter keys for the reactions, which facilitate the search of reactions. To store
reaction conditions, stoichiometry of reactants and agents, as well as yields and conversion ratios,
a RInChI extension called ProcAuxInfo has been proposed (46). The information on the individual
molecules involved in a chemical reaction is represented either as fingerprints (e.g., ECFP (47)), line
notations (e.g., SMILES and InChI), or graphs. In contrast to the latter two, fingerprinting methods
are noninvertible hashes. In molecular graphs, the nodes usually correspond to the atoms and the
edges of the graph to the bonds. Molecular graphs are often hydrogen depleted. Line notations are
text-based representations of molecular graphs. Recently, two novel line notations have emerged:
DeepSMILES (48), an adaptation of SMILES, and SELFIES (49), SELF-referencing Embedded
Strings. Both aim to facilitate the construction of syntactically valid molecular graphs which could
improve the performance of data-driven models. For an extensive review of molecular descriptors,
we point the reader to the work of Sanchez-Lengeling and Aspuru-Guzik (50).

Chemical Reaction Data

The path chemical reaction information must flow from the laboratory, where the reaction was
conducted, through an article or patent publication and finally, extracted by whatever means to be
stored in a database is extremely lossy and error prone. As suggested before (46, 51), there should
be a standard on how to report chemical data, such that every data point supporting a publication
is submitted in a machine-readable format together with the manuscript. Such a shortcut, where the
reaction information would go from the author, using standardized electronic lab notebooks, directly
through an open database, would be ideal and allow the field to advance rapidly. To date, there are
few options of reaction datasets collection, but most of them are commercial, close-access, and come
with terms and conditions that do not allow training of open-access AI models. In this chapter, we
will focus on the largest open-access reaction dataset generated (35, 36). Originally, the text-mining
tool was developed at the University of Cambridge, but was later improved by NextMove and takes
advantage of the latest improvements and technologies in Natural Language Understanding and text-
mining in the field of chemistry (52). The dataset is available in two formats: SMILES and CML. The
reaction SMILES (“.rsmi”) file contains not only the reaction SMILES, but also the patent number,
paragraph, year, text-mined, and calculated yield. The CML files are more complete, containing
the paragraph, from which the reaction was extracted, the names of the compounds, which were
converted to SMILES and action lists, describing the steps taken during the procedure (heating,
cooling, stirring, etc.). To date, most of the data-driven models took into account the information in
the easily readable “.rsmi” file.
The reactions in the USPTO dataset are atom-mapped using Epam’s Indigo toolkit (53). While
typically correct, the atom-maps may be wrong in many cases and hence should not be entirely relied
on (36). In fact, although the most recent atom-mapping approach is based on heuristics (54), it

65
is simple to draw reactions, where the atom-mapping is ambiguous, as seen in Figure 2. The work
of Schneider et al. (55) has shown that between Indigo Toolkit and NameRXN (53, 56), two tools
able to generate atom-mapping, only in 22% of the reactions on 50,000 random reactions from
the USPTO dataset the set of reactants matched. Therefore, because of the inherent difficulty in
determining the precise mapping, all methods, which are based on atom-mapping, are fundamentally
limited by the underlying software, which generates the atom-mapping.

Figure 2. A Bromo Grignard reaction with nontrivial atom-mapping, as any phenyl group in the product
could correspond to any phenyl in the reactants. There is more than one correct atom-mapping. The reaction
SMILES for this reaction is: “O=C(c1ccccc1)c1ccccc1.Br[Mg]c1ccccc1
>>OC(c1ccccc1)(c1ccccc1)c1ccccc1.”

While it is impressive how much information could be extracted from the U.S. patents, the
USPTO dataset is far from being perfect. It is not free from systematic extraction errors and contains
partly incomplete reactions with a preponderant tendency to misinterpret organometallic
compounds. In particular, the incomplete reactions are a severe problem for data-driven reaction
prediction methods. Despite the usage of the atom-mapping to check whether all atoms on the
product side were also present on the reactant side, there is no possibility to check if all the necessary
solvents and catalysts were correctly extracted. One reason for these errors is the incorrect spellings of
IUPAC names in patents. Consequently, models are trained on similar reactions not always explicitly
containing the catalysts and hence, infer that the catalyst is not important for the reaction to take
place. For example, the models trained with such data perfectly predict a coupling reaction without
seeing the metal catalyst. There is another problem with organometallic compounds when using
SMILES. In fact, SMILES were designed to represent organic compounds only and there is no
obvious way to treat bonds within organo-metallic systems in SMILES. Moreover, for data-driven
reaction prediction models, it is not clear if the correct bond representation is crucial in attaining a
higher prediction accuracy. Acting as catalysts, their presence or absence is often more important
than the exact bonding description within the organo-metallic center.
Since the publication of the USPTO dataset an entire family of reaction prediction benchmark
subsets with different flavors appeared (35), as shown in Figure 3. All these subsets were made
available at publication time, including the correct splitting between the training, validation, and
test set. The publicly available data allows not only to reproduce the scientific outcome reported
in a publication, but also a direct and statistical comparison between the different approaches. The
first benchmark set was the USPTO_MIT set, during the filtering process Jin et al. (27) removed
all reactions containing stereochemical information. As stereochemical information might be crucial
for the functionality of molecules, Schwaller et al. (28) generated another subset of the USPTO
dataset keeping stereochemistry, therefore referred to as USPTO_STEREO. Bradshaw et al. (29)
later removed all reactions without a so-called linear electron flow topology (excluding pericyclic
reactions), hence, simplifying the dataset. This subset of USPTO_MIT containing only 73% of the
reactions is referred to as USPTO_LEF. A summary of the benchmark datasets is found in Table 1.

66
Figure 3. The USPTO dataset family tree with the two versions of the text-mined dataset and the different
filtered subsets used to benchmark data-driven reaction prediction models (26–29, 35, 36).

Reaction Prediction Approaches

The idea of chemical reaction prediction models is not new. Pioneering examples are EROS
(11), CAMEO (14), WODCA (15), and SOPHIA (16), all of which were built on top of either
a rather small-scale reaction or knowledge database. Satoh and Funatsu (16) presented the first
approach not requiring the reaction type or class as input for the prediction and recognized the
potential of using reaction outcome prediction models for the validation of retrosynthesis steps in
synthesis planning tools. For a more extensive review of the history of computer-assisted synthesis
programs we refer the reader to Engkvist et al. (51), Ihlenfeldt et al. (57), Todd (58), Cook et al.
(59), Coley et al. (60), and Battaglia et al. (61) . In this chapter, we focus on purely data-driven
chemical reaction prediction methods taking advantage of novel machine-learning techniques based
on artificial neural networks.
The recent data-driven approaches can be distinguished by analyzing the model, the data, the
input features, and the outputs (Table 1). There are several types of network architectures used
for reaction prediction. Feed-forward neural networks learn a function, which maps a fixed-sized
vector through the network to another fixed-sized vector. Sequence-2-sequence (seq-2-seq) and
transformer networks are auto-regressive encoder-decoder architectures and have the advantage that
they can handle inputs of varying lengths, as well as generate outputs of varying lengths. Graph neural
networks learn a function applied to a node in a graph and its neighbors. For an extensive discussion
of the different neural network architectures and their inductive biases, we point the reader to the
review of Battaglia et al. (61).
Kayala et al. (20) used a neural network to predict mechanistic steps through the identification
and ranking of electron sources and sinks. The inputs to the network contained a combination of the
reaction conditions, hand-crafted molecular features, and the local neighborhood of the individual
atoms. As a chemical reaction can consist of a sequence of mechanistic steps, multiple predictions
would be required to get the final product of a reaction. Building further on this idea, Kayala and Baldi
(21) developed the ReactionPredictor, which ranked the atomic interactions based on the output of
three separate feed-forward neural networks: the first trained for polar, the second for pericyclic, and

67
last for radical reactions. The main drawback is that data on mechanistic steps is not readily available.
Therefore, Kayala et al. (20) generated their own data using their rule-based expert system (62). In a
more recent work by Foshee et al. (25), the Baldi group extended their dataset from 5500 to 11,000
elementary reactions. Still applying a very similar approach for the prediction of mechanical steps,
they showed that a bi-directional long short-term memory network using solely a SMILES string
as input nearly matches the electron source/sink identification performance of their feed-forward
neural network with more chemical inputs.

Table 1. Comparison of the Input, Output, Data, and Model Architecture of the Data-Driven
Reaction Prediction Approaches Analyzed in this Chapter
Input Output Data Model
Kayala et al. (20, Atomic and Electron sources/ Generated using Feed-forward
21) Fooshee et al. molecular features, sinks, mechanistic rules neural network
(25) reaction conditions steps
Wei et al. (22) Neural fingerprint Template ranking Generated using Feed-forward
of 2 reactants + 1 (16 rules) rules neural network
reagent
Segler and Waller Extended- Links in knowledge Binary reactions Graph reasoning
(23) connectivity graph from Reaxys
fingerprints
Segler and Waller Extended- Template ranking Extracted from Feed-forward
(24) connectivity (8820 rules) Reaxys neural network
fingerprints
Coley et al. (26) Edit-based, Product ranking USPTO -15,000 Feed-forward
applying templates (15,000 random neural network
reactions)
Jin et al. (27) Molecular graph Bond changes USPTO_MIT Graph
dataset convolutional
(480,000 reactions) neural network
Schwaller et al. (28) SMILES, separated Product molecule USPTO_MIT and Seq-2-seq model
reagents generation USPTO_STEREO with attention
(1 million reactions)
Bradshaw et al. (29) Molecular graph, Bond changes USPTO_LEF Gated graph neural
separated reagents (350,000 reactions) networks
Do et al. (30) Molecular graph, Bond changes USPTO-15,000, Graph
separated reagents USPTO_MIT transformation
policy network
Coley et al. (31) Molecular graph Bond changes USPTO_MIT Graph
convolutional
neural network
Schwaller et al. (32) SMILES Product molecule USPTO_MIT, Transformer
generation USPTO_LEF, network
USPTO_STEREO

68
Wei et al. (22) used feed-forward neural networks to identify which SMARTS transformation
out of 16 reaction templates to apply to a set of two reactants plus one reagent. Their approach
was based on the concatenation of differentiable molecular fingerprints. Therefore, their network
could be trained end-to-end and did not require any hand-crafted features. In contrast, Segler and
Waller (23) modelled the reaction prediction task with a graph-reasoning model to find missing links
in a knowledge graph made of binary reactions from the Reaxys database (33). In another work,
Segler and Waller (24), used a neural network to rank reaction templates, which were automatically
extracted from the Reaxys database (33). Reactions were represented using traditional fingerprints,
which construct a fixed-sized vector based on the presence and absence of individual local motives
in the molecules. Segler et al. (63) developed an in-scope filter to estimate the reaction feasibility
based on their fingerprint. As also pointed out by Coley et al. (60), a reaction template might match
different reactive sites in the reactants and, therefore, generate more than one product. Hence,
template ranking is not enough to predict the most likely product of a reaction. To overcome this
problem, Coley et al. (26) proposed a different approach. Instead of ranking the templates, they
applied all the templates matching the reactants in a first step to generate possible candidate products.
The products were then ranked by a neural network. Recognizing the drawbacks of hashing the
reactant molecules to a fixed-sized fingerprint, Coley et al. (26) designed edit-based reaction
representation based on the atoms that had a change in bond type and hydrogen count. The inputs
to their model were augmented with structural information, as well as easily computable geometric
and electronic information. The method was tested on a rather small subset of the USPTO dataset
containing 15,000 reactions. In general, template-based methods are fundamentally limited by the
set of templates they are based on and cannot predict anything outside the scope of this set. While
automatically generated template sets scale well, it is still not straightforward to produce a good
set of templates (24, 26). Usually, the number of neighboring atoms or the distance around the
reaction center has to be specified. This leads to a trade-off between a large amount of very specific
templates and a small amount of overly generic templates. Moreover, the local environment near
the reaction center might not be enough to describe the reaction. Another drawback of automatic
template extraction is that the reaction center is typically identified using the atom-mapping, which
depending on the source might not be correct.
All in all, before the end of 2017 most of the data-driven reaction prediction approaches were
either rule-based or small-scale. In the meantime, template-free large-scale approaches emerged,
which can be categorized into two main classes: bond change predictions and product molecule
generation (see Figure 4).
Jin et al. (27) presented the Weisfeiler-Lehman Network/Weisfeiler-Lehman Difference
Network approach, which uses a two-step process to predict bond changes within the reactants.
In the first step, a graph-convolutional neural network calculates the pair-wise reactivity between
atoms and identifies possible reaction centers. After the reaction centers are filtered, a Weisfeiler-
Lehman Difference network ranks the bonds most likely reacting. The final product molecule is
generated by applying the suggested bond changes to the reactants. Jin et al. (27) made their dataset,
training, validation, and test split publicly available, from here on referred as USPTO_MIT. The
dataset contained no reactions with stereochemical information. Reaction SMILES containing
stereoisomers were previously filtered out, as this would have required a more sophisticated
approach, able to predict not only bond changes but also changes in atomic labels (e.g., specifying 3-
dimensional configuration at a tetrahedral carbon).

69
Figure 4. Timeline of the recent developments of large-scale data-driven reaction prediction models that can
be compared using the different USPTO reaction subsets. There are two main strategies: bond changes
predictions and product molecule generation.

The open-source USPTO_MIT dataset made it possible to compare with alternative methods
directly. In the same year, Schwaller et al. (28) published a SMILES-2-SMILES approach using a
seq-2-seq model with an attention layer. Seq-2-seq models generate product molecules, SMILES
token by SMILES token, using a recurrent neural network (64). While the usage of neural machine
translation models for reaction prediction had already been proposed by Nam and Kim (65) and
for retrosynthesis by Liu et al. (66), it was the first large-scale demonstration of a seq-2-seq model.
Schwaller et al. (28) showed that representing reactants and reagents solely with SMILES attention-
based seq-2-seq models could compete with graph-based models where the node features were
composed of more chemical information. The attention-weights could be visualized and revealed
that the decode focuses on one or more relevant atoms in the reactants while predicting each atom of
the product. Compared to the bond change prediction approaches, SMILES-2-SMILES approaches
construct the whole product molecule token-by-token. To solve the ambiguity of atomic order in
SMILES, Schwaller et al. (28) used the canonical SMILES to specify an order in which the atoms
have to be predicted. Besides predicting accuracies similar to the original work of Jin et al. (27) on the
USPTO_MIT set, Schwaller et al. (28) published the USPTO_STEREO dataset to compare models
able to predict stereoisomers (to the level they can be described in SMILES). Beyond the proof
of scaling seq-2-seq models with large datasets, Schwaller et al. (28) introduced a new metric for
measuring accuracy, by weakly separating reactants and reagents with a “>" token and representing
only the most common reagents. This metric was unfortunately endorsed by other groups creating
a measure of comparison that brings the development of such models in the wrong direction (27,
29–31). Separating reactant and reagents leads to simplification of the reaction prediction problem,
as one must already know the reacting molecules to do the separation, as pointed out by Griffiths et
al. (67). The prediction problem is then reduced to the prediction of the correct reactive sites. This
metric has been recently corrected (32).
Similar to the Baldi group (62), Bradshaw et al. (29) followed an approach inspired by textbook
organic chemistry and arrow pushing diagrams. They developed a model to predict electron paths.
To do so, they analyzed the graph-edits published by Jin et al. (27). Their method could only be

70
applied to USPTO_LEF, a subset of USPTO_MIT. In their paper, Bradshaw et al. (29) claim that
they predict not only the product, but also the “mechanism.” While they might get the mechanism
of simple reactions, the underlying mechanistic steps often involve more electron movements then
can be read out by comparing the final product with the starting material. Predicting the correct
product does not mean that the predicted electron path is correct, as graph-edits cannot be taken as
ground truth for mechanistic steps. For instance, a push to a catalyst in a coupling reaction could not
be represented in their method as they add the reagents (e.g., solvents and catalysts) only as global
features. The work of Bradshaw et al. (29) is interesting as they tackle the problem with new ML
approaches. Similarly, Do et al. (30) suggested a Graph Transformation Policy network, to learn the
best policy to predict bond changes. The model did not have the restriction of only being able to
predict the USPTO_LEF, but could also be used on the USPTO_MIT dataset, where it after invalid
product removal achieved a top-1 accuracy of 83.2%.

Figure 5. Visualization of the two chemical reaction representation settings. a) shows the separate reagents
setting, where the information of which molecule contributes atoms to the product and which molecule does
not is explicitly contained. Unfortunately, this requires knowing the atom-mapping and therefore, also
knowing the product before making the prediction. b) In contrast, shows the mixed setting, where no
distinction is made between reactants and reagents. The model must figure out itself, which molecules are the
most likely to react together. The mixed setting makes the reaction prediction problem more realistic, but also
more challenging.

In late 2018, Coley et al. (31) improved their previous Weisfeiler-Lehman Network/Weisfeiler-
Lehman Difference Network approach presented in Jin et al. (27). The main difference is that they
changed the enumeration criterium in the first step. Instead of generating candidates using the top-
6 atom pairs, they allow up to 5 simultaneous bond changes out of the top-16 bond changes for
the enumeration. This change leads to higher coverage of products in the test set and hence, also
an improvement in the overall accuracy, reaching considerable 85.6% top-1 on the USPTO_MIT
with separated reagents. The approach is still a two-step process and therefore, not end-to-end.
Parameters, like the maximum number of bond changes to consider, have to be determined
empirically over the validation set and might change for another reaction dataset. The coverage of the
first step sets the upper bound for the accuracy of the second step.

71
Table 2. Top-3 Accuracies of the Recent Data-Driven Reaction Prediction Models on the
Different USPTO Subsetsa
Accuracy Top-1 [%] Top-2 [%] Top-3 [%]
USPTO_MIT
Separated reagents
Jin et al. (27) 79.6 n/a 87.7
Schwaller et al. (28) 80.3 84.7 86.2
Do et al. (30) 83.2 n/a 86.0
Coley et al. (31) 85.6 90.5 92.8
Schwaller et al. (32) 90.4 93.7 94.6
USPTO_MIT
Mixed reagents
Jin et al. (27) 74.0 n/a 86.7
Schwaller et al. (32) 88.6 92.4 93.5
USPTO_STEREO
Separated reagents
Schwaller et al. (28) 65.4 71.8 74.1
Schwaller et al. (32) 78.1 84.0 85.8
USPTO_STEREO
Mixed reagents
Schwaller et al. (32) 76.2 82.4 84.3
a Currently,
only product generation models can consider stereochemical information and make predictions
on the USPTO_STEREO dataset. For all the models a significant accuracy increase is observed between top-1
and top-2.

Recently, Schwaller et al. (32) demonstrated for the first time accuracies of over 90% on the
USPTO_MIT dataset. They called their model the Molecular Transformer, as it was built on top
of the transformer architecture (68). This architecture is a seq-2-seq model, where the encoder
and decoder consist of multi-head attention layers instead of recurrent neural networks, as done
in traditional seq-2-seq models. To prevent the model from learning only from the canonical
representation, the training set inputs were augmented with noncanonical versions of the SMILES
(69). Schwaller et al. (32) not only show significant improvements in terms of top-1 accuracy on
the USPTO_MIT dataset, but also on the USPTO_STEREO and a time-split pistachio reaction
test set containing stereochemical information. One major advantage of this approach is that the
Molecular Transformer outperforms all previous approaches, even when no distinction is made
between reactants and reagents in the input. Therefore, the approach is the first, which is completely
template and atom-mapping independent. The difference between a separated reagent and the so-
called mixed reagents reaction representation is visualized in Figure 5. It is also interesting to note
as SMILES-based linguistic approaches have often been discredited, because of the possibility to
introduce syntactical errors during the SMILES inference process. Syntactical errors are normal,
and actually, the capacity for an underlying AI model to learn the grammar rules behind SMILES

72
codification is very much depending on the architecture used. For instance, the work made by
Schwaller et al. (32) using the Molecular Transformer clearly shows that only less than 1% of the
top-1 prediction is grammatically invalid. Remarkably, the underlying AI model learns not only
the domain knowledge (organic chemistry), but also the SMILES grammar to a level that can be
considered close to perfection.
In Table 2, we report the top-1, top-2, and top-3 accuracies of the different approaches on the
USPTO_MIT and USPTO_STEREO set, where top-N accuracy means that the reported product
could be found in the N most likely predictions of the model.

Figure 6. Product-reactants attention generated by the Molecular Transformer for a Bromo Suzuki coupling
reaction (32). Attention weights show how important an input token (horizontal) was for the prediction of
an output token (vertical). The model focused on the corresponding molecule parts in the reactants, while
predicting the product.

As we have discussed above, there are two main methods to construct the major products of a
reaction: product generation and bond changes prediction methods. While the most recent product
generation methods are completely atom-mapping independent, the atom-mapping is required to
generate the ground-truth bond changes for the bond changes prediction methods. As atom-
mapping is still typically generated by rule-based approaches, the bond changes prediction methods
inherit the limitations of the underlying approach used for the atom-mapping.
In the work of Coley et al. (31) and Schwaller et al. (32), the attention weights are used to
enhance the explicability of their predictions and make the models more transparent, which is one
of the major criticisms of those data-driven black-box models. Coley et al. (31) calculate pair-wise
interactions between reactant and reagents atoms (source) during the first step of their approach.
The most reactive sites can be identified by selecting one atom and highlighting those interactions
with all the other source atoms. In the Molecular Transformer, instead, this would correspond to
a visualization of the self-attention in the encoder. When using the Molecular Transformer, not
only can the encoder and decoder self-attentions be visualized, but more interestingly, the decoder-
encoder attention can be visualized as well. The latter can be interpreted as how important source

73
atoms are to predict a specific product atom. Empirical evaluations of those attention weight maps
show that the model learned something similar to atom-mapping, as seen for a Bromo Suzuki
coupling reaction in Figure 6.
The models presented by Coley et al. (31) and Schwaller et al. (32) both make predictions in real
time when running on the USPTO_MIT test set. The timings reported in the articles are strongly
dependent on the hardware, the batching of the reactions, and the reaction input length. Beyond
the prediction time, which has in both cases a negligible impact for the user experience, the training
time and hyperparameter search represent major limiting factors. Improvements could be made by
optimizing the code for speed. Molecular Transformers were recently adopted as a chemical reaction
prediction model by Bradshaw et al. (70) to discover new molecules using chemical reaction.

Conclusion and Outlook

The availability of open-data and more powerful hardware enabled a rise in novel approaches
to tackle challenges in organic synthesis that went beyond simple regression problems. Prediction of
chemical reactivity, which long was reckoned to be an art only human experts could do, has become
within reach of data-driven learning systems.
In this chapter, we compared recent approaches and identified dominating strategies to assess
bond changes prediction and product molecule generation (27–32). Although bond changes
prediction approaches typically use a stronger inductive bias and more chemical information, they
are limited by the tools that are available to generate the ground truth bond changes. By applying
valency rules and filtering all invalid bond changes, the “validity” of the predicted product can
be guaranteed. Product generation models, in contrast, are less restricted and could technically
construct molecules that do not follow valency rules or even predict alchemic changes, where for
example a Bromide atom in the reactants would turn into a Fluoride atom in the products. The
filtering using syntax and valency rules can be done in a post-processing step. However, recent
models were able to almost perfectly infer such constraints from the training examples. Hence,
product molecule generation models are currently able to achieve higher prediction accuracies, even
when no distinction is made between reactants and reagents. Product molecule generation
approaches are the only ones that have been assessed on datasets containing stereochemical
information so far.
Since they can leverage recent advancements in Natural Language Processing, text-based
chemical reaction representations are currently dominating. However, developing novel reaction
representations with a stronger inductive bias based on domain knowledge and adding reaction
condition information has the potential to lead to significant advancements. In any case, new reaction
representation should not rely on rule-generated atom-mapping.
The success of data-driven models heavily depends on the data they are trained on. Systematic
errors and incomplete reactions are challenging and will mislead the models. The hope for the
future is a shortened information flow between the data generator (e.g., the medicinal chemist) and
the data consumer (e.g., the cheminformatic). We hope that one day all open-access publications
in synthetic organic chemistry are accompanied by all experimental data, including failed ones.
Moreover, similar to what happened in the computational material science community with projects
like MaterialsCloud (71), NOMAD (72), and the Materials Genome Initiative (73), an effort is
needed by the Synthetic Organic Chemistry community to manage distributed solutions to store
data and to agree on chemical reaction representation standards. This would lead to the availability of
higher quality data, which will further enhance the performance of data-driven models with a great
return for the entire chemical community.

74
When this happens, one of the greatest challenges we will face will be the combination of data
from different sources with varying noise levels to best guide the exploration of chemical space. In
this respect, the rise of automation and robotic platform that we are currently witnessing will have a
profound impact on the quality of the produced data (74). Better data will lead to better predictive
data-driven models. This feedback loop between the automation platform and data-driven models
has the potential to revolutionize the way chemistry will be done in the future.

References
1. Campbell, M.; Hoane, A. J.; Hsu, F. Deep Blue. Artif. Intell. 2002, 134, 57–83.
2. Ferrucci, D.; Brown, E.; Chu-Carroll, J.; Fan, J.; Gondek, D.; Kalyanpur, A. A.; Lally, A.;
Murdock, J. W.; Nyberg, E.; Prager, J.; Schlaefer, N.; Welty, C. Building Watson: An Overview
of the DeepQA Project. AI Mag. 2010, 31, 59–79.
3. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; van den Driessche, G.;
Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.;
Nham, J.; Kalchbrenner, N.; Sutskever, I.; Lillicrep, T.; Leach, M.; Kavukcuoglu, K.; Graepel,
T.; Hassabis, D. Mastering the Game of Go with Deep Neural Networks and Tree Search.
Nature 2016, 529, 484–489.
4. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; Huang, A.; Guez, A.; Hubert, T.;
Baker, L.; Lai, M.; Bolton, A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driessche,
G.; Graepel, T.; Hassabis, D. Mastering the Game of Go without Human Knowledge. Nature
2017, 550, 354–359.
5. Corey, E. J. General Methods for the Construction of Complex Molecules. Pure Appl. Chem.
1967, 14, 19–38.
6. Corey, E. J.; Long, A. K.; Rubenstein, S. D. Computer-Assisted Analysis in Organic Synthesis.
Science 1985, 228, 408–418.
7. Corey, E. J.; Wipke, W. T.; Cramer, R. D.; Howe, W. J. Computer-Assisted Synthetic Analysis.
Facile Man-Machine Communication of Chemical Structure by Interactive Computer
Graphics. J. Am. Chem. Soc. 1972, 94, 421–430.
8. Wipke, W. T.; Dyott, T. M. Simulation and Evaluation of Chemical Synthesis. Computer
Representation and Manipulation of Stereochemistry. J. Am. Chem. Soc. 1974, 96,
4825–4834.
9. Gelernter, H. L.; Sanders, A. F.; Larsen, D. L.; Agarwal, K. K.; Boivie, R. H.; Spritzer, G. A.;
Searleman, J. E. Empirical Explorations of SYNCHEM. Science 1977, 197, 1041–1049.
10. Gelernter, H.; Rose, J. R.; Chen, C. Building and Refining a Knowledge Base for Synthetic
Organic Chemistry via the Methodology of Inductive and Deductive Machine Learning. J.
Chem. Inf. Comput. Sci. 1990, 30, 492–504.
11. Gasteiger, J.; Jochum, C. EROS A Computer Program for Generating Sequences of Reactions.
Org. Compunds 1978, 93–126.
12. Gasteiger, J.; Hutchings, M. G.; Christoph, B.; Gann, L.; Hiller, C.; Löw, P.; Marsili, M.;
Saller, H.; Yuki, K. A New Treatment of Chemical Reactivity: Development of EROS, an
Expert System for Reaction Prediction and Synthesis Design. In Organic Synthesis, Reactions and
Mechanisms; Topics in Current Chemistry; Springer: Berlin Heidelberg, 1987; pp 19–73.

75
13. Dugundji, J.; Ugi, I. An Algebraic Model of Constitutional Chemistry as a Basis for Chemical
Computer Programs. In Computers in Chemistry; Fortschritte der Chemischen Forschung;
Springer: Berlin Heidelberg, 1973; pp 19–64.
14. Jorgensen, W. L.; Laird, E. R.; Gushurst, A. J.; Fleischer, J. M.; Gothe, S. A.; Helson, H. E.;
Paderes, G. D.; Sinclair, S. CAMEO: A Program for the Logical Prediction of the Products of
Organic Reactions. Pure Appl. Chem. 1990, 62, 1921–1932.
15. Gasteiger, J.; Ihlenfeldt, W. D.; Röse, P. A Collection of Computer Methods for Synthesis
Design and Reaction Prediction. Recl. Trav. Chim. Pays-Bas 1992, 111, 270–290.
16. Satoh, H.; Funatsu, K. SOPHIA, a Knowledge Base-Guided Reaction Prediction System -
Utilization of a Knowledge Base Derived from a Reaction Database. J. Chem. Inf. Comput. Sci.
1995, 35, 34–44.
17. Grzybowski, B. A.; Szymkuć, S.; Gajewska, E. P.; Molga, K.; Dittwald, P.; Wołos, A.;
Klucznik, T. Chematica: A Story of Computer Code That Started to Think like a Chemist. Chem
2018, 4, 390–398.
18. Klucznik, T.; Mikulak-Klucznik, B.; McCormack, M. P.; Lima, H.; Szymkuć, S.; Bhowmick,
M.; Molga, K.; Zhou, Y.; Rickershauser, L.; Gajewska, E. P. Efficient Syntheses of Diverse,
Medicinally Relevant Targets Planned by Computer and Executed in the Laboratory. Chem
2018, 4, 522–532.
19. Web of Science. http://wcs.webofknowledge.com/RA/analyze.do?product=WOS&SID=
F5VcAxKl6LOVlxdQVQd&field=PY_PublicationYear_PublicationYear_en&yearSort=true
(accessed Sep 19, 2019).
20. Kayala, M. A.; Azencott, C.-A.; Chen, J. H.; Baldi, P. Learning to Predict Chemical Reactions.
J. Chem. Inf. Model. 2011, 51, 2209–2222.
21. Kayala, M. A.; Baldi, P. ReactionPredictor: Prediction of Complex Chemical Reactions at the
Mechanistic Level Using Machine Learning. J. Chem. Inf. Model. 2012, 52, 2526–2540.
22. Wei, J. N.; Duvenaud, D.; Aspuru-Guzik, A. Neural Networks for the Prediction of Organic
Chemistry Reactions. ACS Cent. Sci. 2016, 2, 725–732.
23. Segler, M. H. S.; Waller, M. P. Modelling Chemical Reasoning to Predict and Invent Reactions.
Chem.—Eur. J. 2017, 23, 6118–6128.
24. Segler, M. H. S.; Waller, M. P. Neural-Symbolic Machine Learning for Retrosynthesis and
Reaction Prediction. Chem.—Eur. J. 2017, 23, 5966–5971.
25. Fooshee, D.; Mood, A.; Gutman, E.; Tavakoli, M.; Urban, G.; Liu, F.; Huynh, N.; Vranken,
D. V.; Baldi, P. Deep Learning for Chemical Reaction Prediction. Mol. Syst. Des. Eng. 2018, 3,
442–452.
26. Coley, C. W.; Barzilay, R.; Jaakkola, T. S.; Green, W. H.; Jensen, K. F. Prediction of Organic
Reaction Outcomes Using Machine Learning. ACS Cent. Sci. 2017, 3, 434–443.
27. Jin, W.; Coley, C.; Barzilay, R.; Jaakkola, T. Predicting Organic Reaction Outcomes with
Weisfeiler-Lehman Network. In Advances in Neural Information Processing Systems 30; Guyon,
I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.;
Curran Associates, Inc., 2017; pp 2607–2616.
28. Schwaller, P.; Gaudin, T.; Lányi, D.; Bekas, C.; Laino, T. “Found in Translation”: Predicting
Outcomes of Complex Organic Chemistry Reactions Using Neural Sequence-to-Sequence
Models. Chem. Sci. 2018, 9, 6091–6098.

76
29. Bradshaw, J.; Kusner, M. J.; Paige, B.; Segler, M. H. S.; Hernández-Lobato, J. M. A Generative
Model For Electron Paths. 2018, arXiv:1805.10970. arXiv.org e-Print archive. https://arxiv.org/
abs/1805.10970.
30. Do, K.; Tran, T.; Venkatesh, S. Graph Transformation Policy Network for Chemical Reaction
Prediction. 2018, arXiv:1812.09441. arXiv.org e-Print archive. https://arxiv.org/abs/1812.
09441.
31. Coley, C. W.; Jin, W.; Rogers, L.; Jamison, T. F.; Jaakkola, T. S.; Green, W. H.; Barzilay, R.;
Jensen, K. F. A Graph-Convolutional Neural Network Model for the Prediction of Chemical
Reactivity. Chem. Sci. 2019, 10, 370–377.
32. Schwaller, P.; Laino, T.; Gaudin, T.; Bolgar, P.; Bekas, C.; Lee, A. A. Molecular Transformer:
A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci. 2019, 5,
1572–1583.
33. Reaxys. https://www.reaxys.com/#/login (accessed June 3, 2019).
34. SciFinder. CAS. https://www.cas.org/products/scifinder (accessed June 3, 2019).
35. Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. Doctoral Thesis,
University of Cambridge, 2012.
36. Lowe, D. Chemical Reactions from US Patents (1976-Sep2016). 2017. figshare Dataset.
https://doi.org/10.6084/m9.figshare.5104873.v1.
37. IBM RXN for Chemistry. https://rxn.res.ibm.com/
38. Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.;
Laufer, J. Description of Several Chemical Structure File Formats Used by Computer Programs
Developed at Molecular Design Limited. J. Chem. Inf. Comput. Sci. 1992, 32, 244–255.
39. Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to
Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.
40. Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for Generation of Unique
SMILES Notation. J. Chem. Inf. Comput. Sci. 1989, 29, 97–101.
41. Murray-Rust, P.; Rzepa, H. S. Chemical Markup, XML, and the Worldwide Web. 1. Basic
Principles. J. Chem. Inf. Comput. Sci. 1999, 39, 928–942.
42. Holliday, G. L.; Murray-Rust, P.; Rzepa, H. S. Chemical Markup, XML, and the World Wide
Web. 6. CMLReact, an XML Vocabulary for Chemical Reactions. J. Chem. Inf. Model. 2006,
46, 145–157.
43. Grethe, G.; Goodman, J. M.; Allen, C. H. International Chemical Identifier for Reactions
(RInChI). J. Cheminformatics 2013, 5, 45.
44. Heller, S. R.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. InChI, the IUPAC
International Chemical Identifier. J. Cheminformatics 2015, 7, 23.
45. Grethe, G.; Blanke, G.; Kraut, H.; Goodman, J. M. International Chemical Identifier for
Reactions (RInChI). J. Cheminformatics 2018, 10, 22.
46. Jacob, P.-M.; Lan, T.; Goodman, J. M.; Lapkin, A. A. A Possible Extension to the RInChI as a
Means of Providing Machine Readable Process Data. J. Cheminformatics 2017, 9, 23.
47. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50,
742–754.
48. O’Boyle, N.; Dalke, A. DeepSMILES: An Adaptation of SMILES for Use in Machine-Learning of
Chemical Structures. 2018. ChemRxiv. https://doi.org/10.26434/chemrxiv.7097960.v1.

77
49. Krenn, M.; Häse, F.; Nigam, A.; Friederich, P.; Aspuru-Guzik, A. SELFIES: A Robust
Representation of Semantically Constrained Graphs with an Example Application in Chemistry.
2019, arXiv:1905.13741. arXiv.org e-Print archive. https://arxiv.org/abs/1905.13741.
50. Sanchez-Lengeling, B.; Aspuru-Guzik, A. Inverse Molecular Design Using Machine Learning:
Generative Models for Matter Engineering. Science 2018, 361, 360–365.
51. Engkvist, O.; Norrby, P.-O.; Selmi, N.; Lam, Y.; Peng, Z.; Sherer, E. C.; Amberg, W.; Erhard,
T.; Smyth, L. A. Computational Prediction of Chemical Reactions: Current Status and
Outlook. Drug Discov. Today 2018, 23, 1203–1218.
52. Lowe, D. M.; Sayle, R. A. LeadMine: A Grammar and Dictionary Driven Approach to Entity
Recognition. J. Cheminformatics 2015, 7, S5.
53. Indigo Toolkit. https://lifescience.opensource.epam.com/indigo/index.html (accessed June 3,
2019).
54. Jaworski, W.; Szymkuć, S.; Mikulak-Klucznik, B.; Piecuch, K.; Klucznik, T.; Kaźmierowski,
M.; Rydzewski, J.; Gambin, A.; Grzybowski, B. A. Automatic Mapping of Atoms across Both
Simple and Complex Chemical Reactions. Nat. Commun. 2019, 10, 1434.
55. Schneider, N.; Stiefl, N.; Landrum, G. A. What’s What: The (Nearly) Definitive Guide to
Reaction Role Assignment. J. Chem. Inf. Model. 2016, 56, 2336–2346.
56. NextMove Software. NameRxn. https://www.nextmovesoftware.com/namerxn.html (accessed
June 3, 2019).
57. Ihlenfeldt, W.-D.; Gasteiger, J. Computer-Assisted Planning of Organic Syntheses: The
Second Generation of Programs. Angew. Chem., Int. Ed. Engl. 1996, 34, 2613–2633.
58. Todd, M. H. Computer-Aided Organic Synthesis. Chem. Soc. Rev. 2005, 34, 247–266.
59. Cook, A.; Johnson, A. P.; Law, J.; Mirzazadeh, M.; Ravitz, O.; Simon, A. Computer-Aided
Synthesis Design: 40 Years On. Wiley Interdiscip. Rev. Comput. Mol. Sci. 2012, 2, 79–107.
60. Coley, C. W.; Green, W. H.; Jensen, K. F. Machine Learning in Computer-Aided Synthesis
Planning. Acc. Chem. Res. 2018, 51, 1281–1289.
61. Battaglia, P. W.; Hamrick, J. B.; Bapst, V.; Sanchez-Gonzalez, A.; Zambaldi, V.; Malinowski,
M.; Tacchetti, A.; Raposo, D.; Santoro, A.; Faulkner, R.; Gulcehre, C.; Song, F.; Baller, A.;
Gilemer, J.; Dahl, G.; Vaswani, A.; Allen, K.; Nash, C.; Langston, V.; Dyer, C.; Heess, N.;
Wierstra, D.; Kohil, P.; Batvvinck, M.; Vinyals, O. Li, Y.; Pascanu, R. Relational Inductive
Biases, Deep Learning, and Graph Networks. 2018, arXiv:1806.01261. arXiv.org e-Print archive.
https://arxiv.org/abs/1806.01261.
62. Chen, J. H.; Baldi, P. No Electron Left Behind: A Rule-Based Expert System to Predict
Chemical Reactions and Reaction Mechanisms. J. Chem. Inf. Model. 2009, 49, 2034–2043.
63. Segler, M. H. S.; Preuss, M.; Waller, M. P. Planning Chemical Syntheses with Deep Neural
Networks and Symbolic AI. Nature 2018, 555, 604–610.
64. Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focussed Molecule Libraries
for Drug Discovery with Recurrent Neural Networks. 2017, arXiv:1701.01329. arXiv.org e-Print
archive. https://arxiv.org/abs/1701.01329.
65. Nam, J.; Kim, J. Linking the Neural Machine Translation and the Prediction of Organic Chemistry
Reactions. 2016, arXiv:1612.09529. arXiv.org e-Print archive. https://arxiv.org/abs/1612.
09529.

78
66. Liu, B.; Ramsundar, B.; Kawthekar, P.; Shi, J.; Gomes, J.; Luu Nguyen, Q.; Ho, S.; Sloane,
J.; Wender, P.; Pande, V. Retrosynthetic Reaction Prediction Using Neural Sequence-to-
Sequence Models. ACS Cent. Sci. 2017, 3, 1103–1113.
67. Griffiths, R.-R.; Schwaller, P.; Lee, A. Dataset Bias in the Natural Sciences: A Case Study in
Chemical Reaction Prediction and Synthesis Design. 2018. ChemRxiv. https://doi.org/10.26434/
chemrxiv.7366973.v1.
68. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.;
Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems
30; Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett,
R., Eds.; Curran Associates, Inc., 2017; pp 5998–6008.
69. Bjerrum, E. J. SMILES Enumeration as Data Augmentation for Neural Network Modeling of
Molecules. 2017, arXiv:1703.07076. arXiv.org e-Print archive. https://arxiv.org/abs/1703.
07076.
70. Bradshaw, J.; Kusner, M. J.; Paige, B.; Segler, M. H. S.; Hernández-Lobato, J. M. Generating
Molecules via Chemical Reactions. 2019, CUED Publications database. http://publications.eng.
cam.ac.uk/1119728/.
71. Team, T. M. C. Materials Cloud. https://www.materialscloud.org/ (accessed June 13, 2019).
72. HOME - NOMAD. https://www.nomad-coe.eu/ (accessed June 13, 2019).
73. Materials Genome Initiative. https://www.mgi.gov/ (accessed June 13, 2019).
74. Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G. Predicting Reaction
Performance in C–N Cross-Coupling Using Machine Learning. Science 2018, 360, 186–190.

79
Chapter 5

Using Machine Learning To Inform Decisions in Drug

Discovery: An Industry Perspective
Darren V. S. Green*

Department of Molecular Design, GlaxoSmithKline, Gunnels Wood Road, Stevenage,

Hertfordshire SG1 2NY, United Kingdom
*E-mail: darren.vs.green@gsk.com

Modern machine-learning techniques have powered a wave of creative approaches

that aim to solve or improve long-standing productivity and attrition problems
in drug discovery. While industrial practitioners are keen to embrace new
technology, it is important for the community to understand the need to produce
actionable decisions for scientists in the field and the implications o for how
methods and models conceived, built, validated and their benefits quantified.

Introduction and Scope

The discovery of new drugs is a topic of interest to everyone at some point in their lives, whether
as a concerned child, parent, spouse, friend, or medial patient. Unsurprisingly, many scientists are
dedicated to the research and development of new therapies and technologies that may enable drug
discovery. Considering the personal interest that many people take in particular therapy areas,
especially for life-threatening (e.g., oncology, heart disease, and chronic obstructive pulmonary
disease) or life-destroying diseases (e.g., Alzheimer’s disease, dementia, rheumatoid arthritis), one
can see how drug discovery would be very susceptible to the technology hype cycle (Figure 1) based
on Amara’s law and popularized by the Gartner group (1, 2). The hype cycle provides a graphical
and conceptual presentation of the maturity of emerging technologies through five phases, from the
initial invention to wide-scale adoption. Technologies such as computer-aided drug design, high-
throughput screening, combinatorial chemistry, genetics, and a plethora of omics approaches have
all been hailed as game changers for the industry. Yet the rate of discovery of new therapies has
remained stubbornly unchanged, while research and development costs have increased (3). It is
sensible to always ask of a new technology whether it is a better way of solving the problem or simply
a different approach.

Figure 1. The technology hype cycle.

Modern machine learning (ML) is the latest technology to enter this hype cycle. In order to
ensure that the community makes the most of this opportunity and does not languish for years in
the trough of disillusionment, it is important to focus on producing actionable decisions for scientists
facing the typical decisions that must be taken in order to progress drug discovery programs. There is
no need to produce a raft of predictions or ideas if the decision will ultimately be left to the intuition
of a single scientist, with all of the bias and poor outcomes that can result (4). While there are many
decisions and subject areas of drug discovery that ML can be applied to (5), this chapter focuses on
the discovery and optimization of small molecule drugs, from initial identification of a hit or lead
chemical series to the selection of a single development candidate. At each stage of the discovery
cycle, relevant publications will be discussed in the context of the opportunity for ML to add value
and the necessary criteria for material impact on the drug discovery process. The author has taken a
reasonably broad definition of what constitutes ML to enable inclusion of well-established statistical
methods and heuristic algorithms, which are often used in combination with ML.

Screening for Hit and Tool Molecules

The starting point for small molecule drug discovery is the identification of a target protein
(gene), pathway, or phenotype, which is hypothesized to modulate and ideally cure a particular
disease. A starting point, or hit, must be found. Due to the close similarities observed between
a starting point and the resulting drug (6), significant investment has been made in a number of
approaches and experimental platforms over the years, including high-throughput screening (HTS)
(7), encoded library technologies (8), and fragment-based drug design (9). These techniques are
centered on testing a diverse set of molecules to find which ones bind to the target of interest. The
two critical data analysis procedures are screen quality assurance (QA) and hit detection. When
monitoring the HTS assay, which takes place over sets of screening plates of diverse compounds, the
screening scientist needs to examine the actionable readout and ask: Is the quality of my assay within
the expected acceptable range? If not, is it so bad that it should be discarded and the experiment
repeated, or can some of it be used? Traditionally, statistical procedures are applied for this analysis,
based on well-understood principles of quality management developed by manufacturing industries
(10). Once the data are accepted, compounds must be identified that show the required biological

82
response (e.g., inhibition of an enzyme). Standard procedures involve application of statistical
techniques ranging from quite simple tests to advanced recognition and correction of response
patterns across screening plates (Figure 2) (11).

Figure 2. Typical plate patterns observed in an HTS experiment, which can be corrected algorithmically.
Wells that contain apparent hits are colored red, with clear gradient (left) and edge (right) effects on the
plates where false positives serve to obscure true signals in the data.

It is not difficult to imagine the replacement of these approaches with ML. For example, Shterev
et al. demonstrated an implementation of Bayesian techniques for hit identification (12). Indeed,
where the data analysis challenges are more complex (e.g., where noisy assays are tested with
compounds in pools rather than as singletons), there are already examples of effective ML approaches
(13). Cheminformatics approaches are routinely used to rescue false positives in HTS using ML
models or clustering methods (14).
However, what would prompt the application of ML to the routine screening of robust or simple
assays, thereby displacing well-understood and established statistical approaches? When screening
QA, perhaps earlier detection of an assay beginning to deteriorate is needed before poor data are
generated. Perhaps the assay equipment or reagents could be monitored for physical signs (e.g.,
viscosity of reagents or buildup of residues on dispensing tips), and image or video data that are more
suited to ML could be used. Nonetheless, cost benefits must come into play, and many questions
come to mind: how often is this a problem, what would be the cost of development and
implementation, and what decision gets changed? If the screener has to reorder and retest two plates
instead of four plates for an automated screening system, there is not a significant difference in the
amount of work being done, even if you see a press release in the social media headlines suggesting
that artificial intelligence can reduce rework by 50%.
It is more exciting to consider modern cell-based screens, particularly those where an image
of a cell is the primary readout, .the assay delivers a multiparametric response, or cells in a three-
dimensional (3D) tissuelike construct (15). ML and informatics solutions are essential in order
to exploit the full capability of the assay techniques. Flow cytometry is a powerful, experimental
technique that is able to distinguish individual cells by means of the light scattering properties of the
cells, which have been exposed to fluorescently labeled antibodies that characterize the state of a cell
in response to disease or artificial activation and modulation by a small molecule (16). Traditional
practice involved plotting the data in two dimensions and having a scientist draw around populations
of cells in order to categorize them. With modern ML and informatics, this may be automated
through the use of tools such as FlowJo (17). The fully automated Novartis system is able to process
50,000 wells per day, with informatics systems to automate QA and process the data for operator
review (18).

83
Images of cells acquired through microscopy-based assays provide more information on
phenotypes induced by exposure to a small molecule. Solutions, such as CellProfiler (19), use feature
extraction of the cellular information by object segmentation, followed by selection, normalization,
and input into a classification algorithm. This approach works well for teams using a limited number
of assays but requires some customization for new assays. In addition, steps in the analysis pipeline
have parameters that can be adjusted, giving rise to a difficult challenge of joint parameter
optimization across the pipeline. Advances in deep convolutional neural networks (CNN) for image
analysis have been applied to this domain with impressive results on eight diverse data sets (20), even
where data sizes were relatively modest (<100 images). Although the computational time of using a
CNN for the model training is significant (roughly 1.5 days of graphics processing units for one data
set on a NVIDIA Tesla K80 graphics processing unit with 11.5 GB of memory), there is no need for
the empirical parameter tuning and experimentation required in the feature extraction approach, thus
saving time and removing human bias (Figure 3).

Figure 3. Comparison between a conventional image analysis pipeline and an approach based on multiscale
CNN. (a) Starting from the raw image data, a conventional pipeline workflow carries out a series of
independent data analysis steps that culminates in a prediction for the phenotype classes. Each step involves
method customization, as well as parameter adjustments. (b) The multiscale CNN approach instead
classifies the raw image data into phenotypes in one unbiased and automatic step. The parameters in the
approach correspond to the weights of the neural network, and these are automatically optimized based on
training images. Reproduced with permission from reference (20). Copyright 2017 Oxford University
Press.

In summary, ML approaches may provide incremental improvements in HTS processes and hit
identification, cost and time savings for high-content assays via automation, and improvements to
quality via consistent data analysis and the detection of patterns and signals that a scientist might
overlook.

Computational Approaches for Hit and Tool Molecules

Diversity-based screening is not the only approach that can generate hits. Computational
chemists and chemoinformaticians have applied and championed their own version of HTS: virtual
screening (VS) (21, 22). VS seeks to use available knowledge, such as protein structure or known
ligands or substrates of the protein, to select for testing a small subset of a compound collection with
a high probability of being active. The promise of VS is to make the discovery process faster and
less costly. However, particularly where a program is looking to modulate a novel target with little
knowledge, large companies prefer to apply their experimental techniques or run a virtual screen
in parallel (23). This is because of the false negative problem in VS although the virtual screen

84
may find hits, it may miss interesting chemotypes or binding modes (e.g., previously undiscovered
allosteric binding sites) (24). The Shoichet group’s systematic study of their work to discover cruzain
inhibitors exemplifies this (25). Of course, the advantage of VS is that it can test molecules that are
not immediately available in a company inventory and access the vast number of molecules that could
be synthesized. In this instance, the big issue is the false positive rate, which for the cruzain example
was 97.5% on an initial selection rate of 0.1%. Extrapolating to the Enamine REAL database of 680
million synthetically accessible virtual compounds, the cruzain VS protocol would select 680,000
molecules with docking poses at least as good as those used in the literature experiment, the vast
majority of which would be false positives (26). A study by Lyu et al. illustrates that this might not
be an issue for all targets: docking of 170 million compounds against AmpC β-lactamase and the
D4 dopamine receptor yielded multiple hit series for each by synthesizing 44 and 549 compounds,
respectively (27). However, it remains likely that it will be essential to find scoring functions that
better separate real from false results if this approach is to be routinely applied across a portfolio of
diverse protein targets.
There has been a problem with the VS literature, whereby published success stories too often
report the finding of promiscuous chemotypes, or compounds that interfere with the assay (28).
The REPROVIS (reproducible virtual screens) database is an excellent initiative that seeks to provide
researchers with benchmarking data sets (29), including all published applications of VS methods.
However, of the 537 hits reported in the initial database, only 115 passed GlaxoSmithKline’s (GSK)
set of compound collection filters, and only 71 of these passed with no issues at all (30). This learning
is directly applicable to ML practitioners. The drug discovery literature contains misleading or poor-
quality information, and techniques that are not validated with molecules of sufficient quality will not
be considered credible.

Figure 4. An AL screening cycle. The experiment is initiated with a starting set of compounds (maybe a
diverse set) to test. An ML model is built from the results and is used to select the next compounds to test on
the basis of finding more active compounds (exploitation) or of finding information that will help the model
to learn (exploration).

Most VS methods are based on molecular recognition techniques such as docking,

pharmacophore searching, and shape searching or similarity-based chemoinformatics methods. This
is normally due to the lack of relevant information available for programs at the start of the work,
and therefore ML approaches are not often employed. However, ML techniques that seek to guide

85
experimentation to deliver information required to build and improve models, such as active learning
(AL) (31), are becoming popular, particularly to enable experiments that are too large (32, 33) or to
reduce the resources required to deliver a sufficient result. This can be seen in AL-based screening, in
which an initial random set of compounds is screened, followed by ML to build a predictive model
that is used to drive the next round of screening (Figure 4).
This approach has been enabled, along with screening automation, by the Eve system (34).
Although this is unlikely to replace full diversity screening in large companies, for groups with
restricted budgets, assays that are very expensive, or programs that simply seek a tool molecule,
it represents an interesting option. Improvements to AL algorithms will undoubtedly improve the
speed of learning and efficiency of the process and will have a direct impact on cost and time savings.
With high-content and phenotypic assays, many researchers desire to understand the underlying
mechanism of any compound hit, known as target deconvolution. Experimental methods for this are
available under the general banner of proteomics (35) but demand specialty equipment and highly
skilled scientists. ML techniques can be applied to match imaging output to the known screening
profiles of a compound collection (36), an exciting development that fully exploits the large legacy
screening data held in pharmaceutical companies.
In summary, ML approaches may provide incremental improvements in VS processes, cost and
time savings for experimentation via AL and automation, and improvements to quality via consistent
data analysis and the detection of patterns and signals that a scientist might miss.

Figure 5. The classic DMT cycle for iterative product design, as used in drug discovery and annotated with
methods used for chemical structure ideation and learning as introduced in Reker and Schneider (32).

Lead Optimization
Once a hit molecule is identified, it must be optimized to produce a compound that fits the target
compound profile: efficacy against the target (binding affinity, residence time, and on–off rate);
solubility; absorption, distribution, metabolism, and excretion (37); and safety and dose (38). This is
clearly a multiparameter optimization and traditionally requires significant resources to accomplish,

86
estimated at $10 million per program (39). This is the domain of medicinal chemistry (40) and a
classic design–make–test (DMT) cycle (Figure 5).
Existing structure-activity relationships are explored and considered alongside other
information, such as protein structures, synthetic chemistry, available reagents, legacy program
knowledge and available predictions from models that have been generated for off-target and
absorption, distribution, metabolism, and excretion endpoints. From this new molecule, structures
are designed, synthesized, and tested, and the whole cycle is repeated until a compound is identified
that meets the target compound profile. The lead optimization process can take years, with multiple
cycles and synthesis of hundreds or thousands of compounds.
At the center of the DMT cycle is the medicinal chemist who must coordinate and execute the
process, as well as assimilate all the data generated, translate it into possible molecule designs, and
decide which designs to make next (Figure 6).

Figure 6. The chemist-centric lead optimization cycle. All data analysis, molecule design ideation, and
prediction is performed or coordinated by one or more medicinal chemists in the team, who also need to
decide which designs to make next.

Given the cost, difficulty, and inefficiency of this process, it has been, and continues to be,
the target of new technologies. Of most relevance to this book are the disciplines of computational
chemistry and cheminformatics (41, 42). These disciplines have been involved with the design and
discovery of drugs for more than 50 years (43–45). They have also been early adopters of ML and
heuristic methods (e.g., neural networks (46), genetic algorithms (47), support vector machines and
other kernel methods (48, 49), decision trees and random forests (50, 51), and AL (52).
Although many of these techniques are integrated into modern drug discovery, productivity
remains an issue, and lead optimization is clearly not an easy problem to solve. What, then, can
modern ML offer that will lead to significant improvements to the process?
For this chapter, it makes sense to start at the learning part of the DTM cycle, which is an obvious
place for ML and where many techniques are applied to build statistical models of activity, known as
quantitative structure–-activity relationships (QSAR). This is a traditional and venerable occupation,
with the first QSAR equation being published in 1898 and a plethora of techniques invented and
applied (53, 54). There are particular problems for the QSAR practitioner in lead optimization, the
first being the amount of data available. Lead optimization data is small (sometimes starting with only
one hit compound and ending with hundreds or maybe thousands of compounds), slow to acquire
(weeks), sparse (some endpoints may have data for all compounds, and some may only be measured
for a handful), biased (some endpoints may be met for a majority of compounds, and some for only
a handful), and discontinuous (the phenomenon of the activity cliff where a small change in the

87
molecular structure produces a dramatic fall in affinity) (55). This is a very long way from big data
and explains why generic deep learning methods have yet to excite the QSAR community in the lead
optimization space (56, 57), although there are encouraging signs of progress with work on one shot
and few shot/meta-learning techniques (58, 59).
ML methods can be used in this design step to assist or change the whole paradigm. It is
possible to generate QSAR models with a balance between prediction accuracy, simplicity, and
interpretability (60), while ML models can be interrogated to provide interpretation guidelines
(Figure 7) (61, 62).

Figure 7. Interpretative QSAR contributions using the universal approach for structural interpretation of
QSAR and quantitative structure–property relationships models and the Simplex representation of
molecular structure method for structure decomposition (61, 62). This analysis method is deployed at GSK
and can be applied to any ML model. Reproduced with permission from reference (61). Copyright 2013
John Wiley & Sons 2013.

We shall continue to analyze the established DMT cycle before returning to disruptive ML
approaches that may change the paradigm completely. A set of ideas can be scored by whatever
models are available, whether it is QSAR or physical models, such as free energy perturbation (63).
At this point, the concept of the applicability domain (AD) comes into play. Since drug discovery
is focused on the generation of novel molecular structures, a model will not have seen the structure
before by definition. The AD seeks to define the chemical space where the model can be trusted (e.g.,
by comparing the new structure with the training set used to build the model) (64). It is not sufficient
to only consider the AD because the predictions will be used to make a decision. For example, a
QSAR model has predicted that a compound, which looks excellent on all other endpoints, would hit
an off target that is known to lead to unwanted pharmacology or toxicity (65). How much confidence
in the prediction is required for the team to decide that this idea should not progress? A more formal
definition of the AD has been proposed and used in this chapter that integrates it into a decision-
making construct (66). This formalism separates the AD into three subdomains assessing confidence
in the model, prediction, and decision levels:

• Applicability. Can I apply my model to make a prediction for my use case?

• Reliability. Is my prediction reliable enough for my use case?
• Decisionability. Can I make a clear decision based on the outcome of the prediction?

In our team’s case, the acceptable confidence might vary according to practical considerations:
How easy is the compound to make? Is there an available assay to test this predicted liability?

88
Ideally the model builder should take the practical environment into account when building the
model. For example, the GSK panel of off-target QSAR models (known as enhanced cross-screening
panel) is designed to work in combination with the actual assay panel. The sensitivity and specificity
of the models are tuned to minimize false negatives and direct compound series to be checked in the
assay panel. From there, compound series with confirmed liabilities can use the existing model to
design away from the problem, while compounds that were false positives can be used to improve
the model or have a series-specific model built for them. Off-target profiling panels are a prime
subject for ML methods, since they have good throughput, see a wide diversity of compounds
from all programs, and provide dense (not quite full-matrix) data sets with good quality data from
robust assays. In particular, multitask deep neural networks have been advantageous, exploiting the
similarities between assay targets and compounds to share SAR signals (67).
Conformal prediction is emerging as an alternative to ADs and an effective method to quantify
prediction confidence (68). Conformal prediction provides a measure of confidence (between 0 and
1) for each prediction (compound), utilizing the prediction performance in the near neighbors of the
compound (for continuous models) or the class into which the compound is predicted to fall (for
categorical models). It may be applied alongside any ML method, including deep learning models
(69).
With a set of chemical structures and predictions of the relevant endpoints (predictions of
synthetic tractability will be covered in the next section), a decision needs to made: which
compounds should be made next? The scientific method often permeates the cycle: analysis of
data yields a hypothesis, and ideation creates structures that might test the hypothesis. In a
multiparameter optimization, the other endpoints must be considered alongside a specific
hypothesis (e.g., a solubility improvement hypothesis). Common approaches are to use desirability
functions as a way to combine predictions (70). Popular search algorithms for selecting the next set of
compounds are the aforementioned AL and emerging methods such as Bayesian optimization (71).
The promise of these methods is to use the DMT cycle to not just progress the optimization, but
to improve the performance of the predictive models at the same time, thus reducing the number
of DMT cycles that are needed. Emergent closed-loop experimental systems are exploiting these
methods and producing proof of concept optimizations (72).
With such emphasis placed on improving the lead optimization process, it is surprising that
there is so little information published on the baseline performance of the chemist-centric model and
very few direct comparisons of mind versus machine, even though there is an example of machine-
only optimization (73). Retrospectives are available (74), in which an algorithm is able to work
through the molecules that were made and select a different, hopefully more efficient path through
the compound space to find the best molecules. One such experiment is the design of benzimidazole-
based inhibitors of the bacterial gyrase enzyme (75). Starting with a baseline set of inhibitors and
assay data, iterative selections and model building were able to find the most active molecules very
efficiently. This study is of particular interest in that it compares a statistical ML model approach to
a 3D molecular model approach. An interesting finding is that although both approaches optimized
with good efficiency, the 3D method sampled a wider diversity of molecular structures due to its
ability to correctly direct the sampling of new chemical space beyond the knowledge of the current
model (Figure 8). This efficient extrapolation is a major advantage of 3D methods if the protein
structure is available.

89
Figure 8. An example of efficient exploration of novel chemical space using 3D information. Adapted with
permission from reference(75). Copyright 2002 American Chemical Society.

The traditional ideate and predict process previously described is not an optimal one. Ideally,
with a trusted predictive model, one would use the model to generate structures that fit the model,
which is the “inverse QSAR” approach. As with much in life, the concept is not new, with functional
algorithms published nearly 30 years ago (76, 77). The ML community has rediscovered this use
case and applied recursive neural network (78), variational autoencoder (79), generative adversarial
network (80), graph convolutional policy network (81), and deep reinforcement learning models
to the problem (82). It is very clear that ML techniques can generate hundreds of thousands of
molecular structures, but the key question is whether they are useful. The generated structures need
to satisfy the demands of the target compound profile, which will hopefully be dealt with by the ML
model. However, the generative molecules will also need to be synthetically tractable, chemically
stable, and without undesirable functional groups and a host of other constraints that are generally
avoided through the in cerebro and standard cheminformatics methods. To become mainstream
and fulfill the evident potential, there is work to do on molecule quality and the efficiency with
which the techniques produce fit for purpose set of structures. There are encouraging signs that the
community has recognized this (83). Additional encouragement can be found in the application of
ML to problems that cheminformatics methods have struggled with (e.g., the generation of novel
structures that fit a pharmacophore or reduced graph of a single exemplar) (84). Nonetheless, it
remains to be seen if generative models, utilizing models trained on limited QSAR data, can advance
the design process by efficient extrapolation into novel molecular structures with improved profiles.
Returning to the question in the introduction (i.e., Are they better, or just different?), regardless of
the approach to structure ideation, one theme is constant: a good scoring function (e.g., ML model)
is the key to success.
In summary, ML approaches may provide incremental improvements in single target QSAR
models and multitarget profiling panels, improvements to quality via objective inclusion of

90
appropriate off-target predictions, cost and time savings for reducing the number of optimization
cycles via AL, and disruption of the structure ideation process through generative models.

Synthetic Tractability
The prediction of chemical reactivity and reactions has long been a goal of organic chemists.
The initial promise, encouraged by simple and elegant rules for pericyclic reactions and the first
computational approaches (85, 86), did not quickly lead to effective reaction prediction or route
planning systems. Only in the last decade has progress accelerated to the point where computational
systems can compete with trained synthetic chemists (87). This has been enabled by the curation,
digitization, and availability of large chemical databases, such as Reaxys (88). Traditional
chemoinformatics methods, driven by reaction transform representations (89), are able to encode
and apply specific reactions alongside known inclusion or exclusion rules to deconstruct target
molecules and create new ones. Example systems include Chematica (90), Synthia (90), ICSynth
(91), ARChem (92), and ChemPlanner (92), with published examples of practical synthetic routes
that were designed entirely by an algorithm (93).
ML approaches have emerged that exploit the many variants of ML developed for social media,
such as text translation (94). State-of-the-art ML methods are able to predict plausible reaction
routes or give an indication of synthetic complexity within seconds (95, 96).
In this case, we can consider the same criteria as for QSAR models (e.g., applicability, reliability,
and decidability) when judging whether a prediction is fit for purpose. For a chemist facing the
synthesis of a novel compound, the suggestion of one or more synthetic routes could be useful, but
metadata on the route prediction will be required in order for them to enter the lab and start work.
The chemist might ask:

• Are there examples of that reaction working on compounds or reagents similar to mine?
• Is the prediction (that the reaction will work) robust?
• How can I decide between this route and the others proposed?

If the system cannot provide answers, the chemist may fall back on their own ideas or use their
own bias to make the decision, and then the value of having the predictions in the first place becomes
debatable.
Beyond route prediction, there is the possibility of reagent and condition suggestions and
utilization of the larger data sets that will emerge from reaction scanning automation (97). There are
successful examples of ML models that use such data to enable accurate prediction of high-yielding
conditions for untested substrates (98).
For the case of a team wishing to refine lists of molecule ideas (e.g., ideas from a generative
model), it may be enough to prune the list to remove real no-hopers. However, as with QSAR
models, the prediction needs to be tuned to fit the desired specificity and sensitivity for the use case.
Synthetic organic chemistry is an unusual source of data, in that a reaction that is considered
difficult by literature precedent can suddenly become considered easy simply by a chemist looking
at the problem and trying something that has not been considered before. Therefore, there is a use
case out there to predict reactions that should be possible but have not been reported or to combine
a mechanistic organic chemistry approach with ML in order to problem solve.

91
In summary, ML approaches may provide cost and time savings for finding shorter synthetic
routes to target molecules and for reducing the time taken to identify them and synthetic tractability
predictions that enable the use of generative models for molecule design.

Risk Management
As lead optimization matures, the program will focus on potential risks associated with the
compound series that may persist due to features of the structure that they are unable to change
(e.g., a particular binding motif that appears essential for affinity at the target). Returning to our
lead optimization team with the off-target liability, let us imagine that there is no in vitro assay to
confirm the prediction. The team has managed to reduce the problem (e.g., increasing the possible
therapeutic index by improving the on-target affinity). At this point, predictive models begin to be
used in a risk management setting and the implications of decisions become serious, creating the
need for stronger evidence for a prediction.
The Organization for Economic Co-operation and Development (OECD) QSAR toolbox is a
suite of models intended to be used in a regulatory setting for risk assessments of organic compounds
(99). The quality of these assessments is important because they must protect the public and the
environment, with incorrect predictions having potentially catastrophic consequences, and prevent
products entering the market, since incorrect predictions may harm the public or cause significant
economic damage to the company that developed the new product. In this context, convincing
decision makers to rely on a QSAR model is not a trivial challenge. In a similar vein to the
applicability, reliability, and decidability formalism, “to enhance the likelihood of acceptance, it
was critical that the toolbox first gets the chemistry correct, second gets the biology correct and
thirdly, when appropriate, adds statistical assurance (99).” According to the OECD guidance (100),
similarity is context dependent. The context is not only the similarity in a chemical structure, but also
the similarity of a biological profile, as well as toxicokinetic and toxicodynamic properties.
Our lead optimization team will likely think in a similar way to the OECD. There are limited
choices available to them: stop the progression of the chemical series, progress the chemical series,
and initiate experimental work to prove that the risk exists. What evidence might support stopping
the series?

• The team would need to be satisfied and certain that the compounds are bound to the off
target.
• This off target is part of an adverse outcome pathway (AOP) that would produce
pharmacology that is unacceptable in the proposed patient population (101).
• The compound would distribute to the tissue where the AOP operates at sufficient
concentration.
• The pharmacology would be induced at the proposed therapeutic dose.

This is an extremely high bar for a QSAR prediction, but it has been achieved for certain
endpoints, such as genetic toxicology (102). For the vast majority of the proteome, predictions of
off-target activity are unlikely to reach this level in the near future, and pan-proteome predictions are
unlikely to drive decision making where the off-target is not part of an established AOP (103). An
alternate use case (e.g., investigative toxicology) is more appealing for the use of ML. For example,
there is a finding from an in vivo study indicating testicular toxicology. What might be causing it?
What is the predicted off-target profile of the compound? What pathways does it modulate? Can a
credible hypothesis be produced that can be tested by the design of a new compound that does not hit

92
that off target? Can we use all of our legacy data and ML to inform that decision? Identification and
codification of new AOPs would be a very welcome addition to the predictive toxicology toolbox.
In summary, ML approaches may have incremental value in predictive toxicology due to our
limited codification of AOPs and enable more effective investigative toxicology and AOP
construction.

Mechanistic Models vs ML
Drug discovery has always used both mechanistic and physical models and statistical and QSAR
models. ML techniques are clearly becoming very powerful and tend to dominate competitions that
are judged on predictive power. However, much of drug discovery is concerned with extrapolation,
not interpolation (i.e., Based on what is known, what molecule should I make next?) The metrics
used to validate QSAR models have been challenged on precisely this basis (104): in the pursuit of
more potent or soluble compounds, why worry about the ability to predict the inactive or insoluble
ones with precision, when the most pressing question is where to add novelty to the molecular series?
ML techniques are susceptible to this criticism, and the gyrase case study illustrates why physical
and mechanistic models are very good at pointing scientists in the right direction or constructing
hypotheses. This is what drives the desire for interpretable ML models, which enable a “what if”
mode of thinking. As mentioned in the “Synthetic Tractability” section, mechanistic and causal
thinking is a powerful technique for working beyond a knowledge base, and integration of reasoning
abilities is an exciting future direction (105). Metrics for generative ML models need to evolve to
incorporate discovery power for efficient extrapolation.

The Human–Machine Interface

No discussion of the application of ML can ignore the human–machine interface. Although
products such as Amazon’s Alexa enjoy popularity, the introduction of ML systems into creative
cultures (e.g., medicinal and synthetic chemistry) is not straightforward. Although the logic of
replacing human judgment with statistical models has been beautifully articulated (4), human
aversion to using predictions from computational models is well documented (106). For many ML
advocates, the solution might be companies that are founded on the application of ML. Vertex (107),
for example, was founded to rely on physical modeling and structural biology. Wider acceptance and
adoption requires a parallel approach: change management to digitize and quantify expert-opinion-
dominated disciplines alongside demystification of ML models. Hopefully this can be accelerated
by research into trust, which is becoming increasingly important (108). For example, the European
Union has agreed to provide a legal right to explanation (109), which extends to ML-driven
healthcare applications. It is likely that successful implementations range from full artificial
intelligence-driven automation to systems that take inspiration from the human–machine advanced
chess game (110).

Future Challenges
This chapter has outlined many opportunities for ML to improve the drug discovery process
and to highlight the potential difficulties. There are several reasons to be optimistic that ML can
have a positive impact on drug discovery, the most important being that the industry is full of
computational scientists with experience in drug discovery who understand what will really move
the needle. Other important factors are the possibility of feeding off much better-funded and better-

93
resourced research fields (e.g., social media, online commerce) that share code via open source and
the continued growth in our ability to harvest and generate high-quality, relevant experimental data
through laboratory automation.
In order for the community to move as quickly as possible from the current peak of inflated
possibilities to the plateau of productivity (hopefully avoiding the trough of disillusionment), there
following areas are particularly important:

• Learning on small data sets. This is especially the case for drug design and the DMT cycle.
• Extrapolation. For all of the applications in chemistry, interpolation from existing data is
not enough; design and development of novel chemical structures requires extrapolation
beyond current data and knowledge. Incorporation or integration of mechanistic or causal
approaches may be required.
• Trust. The applicability, reliability, and decidability framework is a good one to guide
thinking. Although full interpretability may be difficult and not wholly necessary, enough
context should be available in those cases where some additional human judgment and
reasoning can assist the machine.

References
1. Roy Amara. Wikipedia. https://en.wikipedia.org/wiki/Roy_Amara (accessed Jan 31, 2019).
2. Hype Cycle. Wikipedia. https://en.wikipedia.org/wiki/Hype_cycle (accessed Mar 28, 2019).
3. Scannell, J. W.; Blanckley, A.; Boldon, H.; Warrington, B. Diagnosing the Decline in
Pharmaceutical R&D Efficiency. Nat. Rev. Drug Discov. 2012, 11, 191–200.
4. Kahneman, D. Thinking Fast & Slow; Penguin Books: London, 2012.
5. Ching, T.; Himmelstein, D. S.; Beaulieu-Jones, B. K.; Kalinin, A. A.; Do, B. T.; Way, G. P.;
Ferrero, E.; Agapow, P. M.; Zietz, M.; Hoffman, M. M.; Xie, W.; Rosen, G. L.; Lengerich, B.
J.; Israeli, J.; Lanchantin, J.; Woloszynek, S.; Carpenter, A. E.; Shrikumar, A.; Xu, J.; Cofer, E.
M.; Lavender, C. A.; Turaga, S. C.; Alexandari, A. M.; Lu, Z.; Harris, D. J.; DeCaprio, D.; Qi,
Y.; Kundaje, A.; Peng, Y.; Wiley, L. K.; Segler, M. H. S.; Boca, S. M.; Swamidass, S. J.; Huang,
A.; Gitter, A.; Greene, C. S. Opportunities and Obstacles for Deep Learning in Biology and
Medicine. J. R. Soc. Interface 2018, 15, 20170387.
6. Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D. Is There a Difference Between Leads
and Drugs? A Historical Perspective. J. Chem. Inf. Comput. Sci. 2001, 41, 1308–1315.
7. Macarron, R.; Banks, M. N.; Bojanic, D.; Burns, D. J.; Cirovic, D. A.; Garyantes, T.; Green,
D. V.; Hertzberg, R. P.; Janzen, W. P.; Paslay, J. W.; Schopfer, U.; Sittampalam, G. S. Impact
of High-Throughput Screening in Biomedical Research. Nat. Rev. Drug Discov. 2011, 10,
188–195.
8. Clark, M. A.; Acharya, R. A.; Arico-Muendel, C. C.; Belyanskaya, S. L.; Benjamin, D. R.;
Carlson, N. R.; Centrella, P. A.; Chiu, C. H.; Creaser, S. P.; Cuozzo, J. W.; Davie, C. P.;
Ding, Y.; Franklin, G. J.; Franzen, K. D.; Gefter, M. L.; Hale, S. P.; Hansen, N. J. V.; Israel,
D. I.; Jiang, J.; Kavarana, M. J.; Kelley, M. S.; Kollmann, C. S.; Li, F.; Lind, K.; Mataruse,
S.; Medeiros, P. F.; Messer, J. A.; Myers, P.; O’Keefe, H.; Oliff, M. C.; Rise, C. E.; Satz, A.
L.; Skinner, S. R.; Svendsen, J. L.; Tang, L.; van Vloten, K.; Wagner, R. W.; Yao, G.; Zhao,
B.; Morgan, B. A. Design, Synthesis and Selection of DNA-Encoded Small-Molecule Libraries.
Nat. Chem. Biol. 2009, 5, 647–654.

94
9. Rees, D. C.; Congreve, M.; Murray, C. W.; Carr, R. Fragment-Based Lead Discovery. Nat.
Rev. Drug Discov. 2004, 3, 660–672.
10. Malo, N.; Hanley, J. A.; Cerquozzi, S.; Pelletier, J.; Nadon, R. Statistical Practice in High-
Throughput Screening Data Analysis. Nat. Biotechnol. 2006, 24, 167–175.
11. Coma, I.; Clark, L.; Diez, E.; Harper, G.; Herranz, J.; Hofmann, G.; Lennon, M.; Richmond,
N.; Valmaseda, M.; Macarron, R. Process Validation and Screen Reproducibility in High-
Throughput Screening. J. Biomol. Screen. 2009, 14, 66–76.
12. Shterev, I. D.; Dunson, D. B.; Chan, C.; Sempowski, G. D. Bayesian Multi-Plate High-
Throughput Screening of Compounds. Sci. Rep. 2018, 8, 9551.
13. Glick, M.; Klon, A. E.; Acklin, P.; Davies, J. W. Enrichment of Extremely Noisy High-
Throughput Screening Data Using a Naive Bayes Classifier. J. Biomol. Screen. 2004, 9, 32–36.
14. Kümmel, A.; Parker C. N. The Interweaving of Cheminformatics and HTS. In
Chemoinformatics and Computational Chemical Biology; Bajorath, J., Ed.; Humana Press:
Totowa, NJ, 2010; pp 435–457.
15. Ramaiahgari, S. C.; den Braver, M. W.; Herpers, B.; Terpstra, V.; Commandeur, J. N.; van
de Water, B.; Price, L. S. A 3D in vitro Model of Differentiated HepG2 Cell Spheroids with
Improved Liver-Like Properties for Repeated Dose High-Throughput Toxicity Studies. Arch.
Toxicol. 2014, 88, 1083–1095.
16. Macey, M. G. Flow Cytometry Principles and Application; Humana Press: Totowa, NJ, 2007.
17. FlowJo. Tree Star, Inc. https://www.flowjo.com/ (accessed Mar 28, 2019).
18. Joslin, J.; Gilligan, J.; Anderson, P.; Garcia, C.; Sharif, O.; Hampton, J.; Cohen, S.; King,
M.; Zhou, B.; Jiang, S.; Trussell, C.; Dunn, R.; Fathman, J. W.; Snead, J. L.; Boitano, A.
E.; Nguyen, T.; Conner, M.; Cooke, M.; Harris, J.; Ainscow, E.; Zhou, Y.; Shaw, C.; Sipes,
D.; Mainquist, J.; Lesley, S. A Fully Automated High-Throughput Flow Cytometry Screening
System Enabling Phenotypic Drug Discovery. SLAS Discovery 2018, 23, 697–707.
19. Jones, T. R.; Kang, I. H.; Wheeler, D. B.; Lindquist, R. A.; Papallo, A.; Sabatini, D. M.;
Golland, P.; Carpenter, A. E. CellProfiler Analyst: Data Exploration and Analysis Software for
Complex Image-Based Screens. BMC Bioinf. 2008, 9, 482.
20. Godinez, W. J.; Hossain, I.; Lazic, S. E.; Davies, J. W.; Zhang, X. A Multi-Scale Convolutional
Neural Network for Phenotyping High-Content Cellular Images. Bioinformatics 2017, 33,
2010–2019.
21. Lavecchia, A.; Giovanni, C. Virtual Screening Strategies in Drug Discovery: A Critical Review.
Curr. Med. Chem. 2013, 20, 2839–2860.
22. Valler, M. J.; Green, D. V. S. Diversity Screening Versus Focused Screening in Drug Discovery.
Drug Discovery Today 2000, 5, 286–293.
23. Leveridge, M.; Chung, C. W.; Gross, J. W.; Phelps, C. B.; Green, D. Integration of Lead
Discovery Tactics and the Evolution of the Lead Discovery Toolbox. SLAS Discovery 2018, 23,
881–897.
24. Lewis, J. A.; Lebois, E. P.; Lindsley, C. W. Allosteric Modulation of Kinases and GPCRs:
Design Principles and Structural Diversity. Curr. Opin. Chem. Biol. 2008, 12, 269–280.
25. Ferreira, R. S.; Simeonov, A.; Jadhav, A.; Eidam, O.; Mott, B. T.; Keiser, M. J.; McKerrow,
J. H.; Maloney, D. J.; Irwin, J. J.; Shoichet, B. K. Complementarity Between a Docking and

95
a High-Throughput Screen in Discovering New Cruzain Inhibitors. J. Med. Chem. 2010, 53,
4891–4905.
26. Shivanyuk, A. N.; Ryabukhin, S.; Bogolyubsky, A.; Mykytenko, D. M.; Chupryna, A. A.;
Heilman, W.; Kostyuk, A. N.; Tolmachev, A. Enamine Real Database: Making Chemical
Diversity Real. Chem. Today 2007, 25, 58–59.
27. Lyu, J.; Wang, S.; Balius, T. E.; Singh, I.; Levit, A.; Moroz, Y. S.; O’Meara, M. J.; Che, T.;
Algaa, E.; Tolmachova, K.; Tolmachev, A. A.; Shoichet, B. K.; Roth, B. L.; Irwin, J. J. Ultra-
Large Library Docking for Discovering New Chemotypes. Nature 2019, 566, 224–229.
28. Thorne, N.; Auld, D. S.; Inglese, J. Apparent Activity in High-Throughput Screening: Origins
of Compound-Dependent Assay Interference. Curr. Opin. Chem. Biol. 2010, 14, 315–324.
29. Ripphausen, P.; Wassermann, A. M.; Bajorath, J. REPROVIS-DB: A Benchmark System for
Ligand-Based Virtual Screening Derived from Reproducible Prospective Applications. J. Chem.
Inf. Model. 2011, 51, 2467–2473.
30. Green, D. V. S. GlaxoSmithKline. Unpublished work, 2011.
31. Angluin, D. Queries and Concept Learning. Mach. Learn. 1988, 2, 319–342.
32. Reker, D.; Schneider, G. Active-Learning Strategies in Computer-Assisted Drug Discovery.
Drug Discovery Today 2015, 20, 458–465.
33. Naik, A. W.; Kangas, J. D.; Sullivan, D. P.; Murphy, R. F. Active Machine Learning-Driven
Experimentation to Determine Compound Effects on Protein Patterns. Elife 2016, 5, e10047.
34. Sparkes, A.; Aubrey, W.; Byrne, E.; Clare, A.; Khan, M. N.; Liakata, M.; Markham, M.;
Rowland, J.; Soldatova, L. N.; Whelan, K. E.; Young, M.; King, R. D. Towards Robot Scientists
for Autonomous Scientific Discovery. Automated Experimentation 2010, 2, 1–11.
35. Rix, U.; Superti-Furga, G. Target Profiling of Small Molecules by Chemical Proteomics. Nat.
Chem. Biol. 2009, 5, 616–624.
36. Simm, J.; Klambauer, G.; Arany, A.; Steijaert, M.; Wegner, J. K.; Gustin, E.; Chupakhin, V.;
Chong, Y. T.; Vialard, J.; Buijnsters, P.; Velter, I.; Vapirev, A.; Singh, S.; Carpenter, A. E.;
Wuyts, R.; Hochreiter, S.; Moreau, Y.; Ceulemans, H. Repurposing High-Throughput Image
Assays Enables Biological Activity Prediction for Drug Discovery. Cell. Chem. Biol. 2018, 25,
611–618 e3.
37. Kerns, E. H., Di, L., Eds. Drug-like Properties: Concepts, Structure Design and Methods: From
ADME to Toxicity Optimization, 2nd ed.; Academic Press: London, 2016.
38. McEuen, K.; Borlak, J.; Tong, W.; Chen, M. Associations of Drug Lipophilicity and Extent of
Metabolism with Drug-Induced Liver Injury. Int. J. Mol. Sci. 2017, 18, 1335–1345.
39. Paul, S. M.; Mytelka, D. S.; Dunwiddie, C. T.; Persinger, C. C.; Munos, B. H.; Lindborg, S.
R.; Schacht, A. L. How to Improve R&D Productivity: The Pharmaceutical Industry’s Grand
Challenge. Nat. Rev. Drug Discovery 2010, 9, 203–214.
40. Davis, A., Ward, S. E., Eds. The Handbook of Medicinal Chemistry; Royal Society of Chemistry:
London, 2014.
41. Leach, A. R. L. Molecular Modelling: Principles and Applications, 2nd ed.; Pearson/Prentice Hall:
Harlow, United Kingdom, 2009.
42. Bajorath, J., Ed. Cheminformatics for Drug Discovery; John Wiley & Sons: Hoboken, NJ, 2014.
43. Fujita, T. Recent Success Stories Leading to Commercializable Bioactive Compounds with the
Aid of Traditional QSAR Procedures. Quant. Struct.-Act. Relat. 1997, 16, 107–112.

96
44. Willett, P. From Chemical Documentation to Chemoinformatics: Fifty Years Of Chemical
Information Science. J. Infor. Sci. 2008, 34, 477–499.
45. Ghosh, A. K.; Gemma, S. Structure‐Based Design of Drugs and Other Bioactive Molecules: Tools
and Strategies; Wiley‐VCH Verlag GmbH: Weinheim, 2014.
46. Gasteiger, J.; Zupan, J. Neural Networks in Chemistry. Angew. Chem., Int. Ed. Engl. 1993, 32,
503–527.
47. Glenn, R. C.; Payne, A. W. R. A Genetic Algorithm for the Automated Generation of Molecules
Within Constraints. J. Comput.-Aided Mol. Des. 1995, 9, 181–202.
48. Harper, G.; Bradshaw, J.; Gittins, J. C.; Green, D. V. S.; Leach, A. R. Prediction of Biological
Activity for High-Throughput Screening Using Binary Kernel Discrimination. J. Chem. Inf.
Comput. Sci. 2001, 41, 1295–1300.
49. Czerminski, R.; Yasri, A.; Hartsough, D. Use of Support Vector Machine in Pattern
Classification: Application to QSAR Studies. Quant. Struct.-Act. Relat. 2001, 20, 227–240.
50. Jones-Hertzog, D. K.; Mukhopadhyay, P.; Keefer, C. E.; Young, S. S. Use of Recursive
Partitioning in the Sequential Screening of G-Protein–Coupled Receptors. J. Pharmacol.
Toxicol. 1999, 42, 207–215.
51. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random
Forest: A Classification and Regression Tool for Compound Classification and QSAR
Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958.
52. Warmuth, M. K.; Liao, J.; Ratsch, G.; Mathieson, M.; Putta, S.; Lemmen, C. Active Learning
with Support Vector Machines in the Drug Discovery Process. J. Chem. Inf. Comput. Sci. 2002,
443, 667–673.
53. Overton, E. Osmotic Properties of Cells in the Bearing of Toxicity and Pharmacology. Z. Phys.
Chem. 1898, 22, 189–209.
54. Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, I. I.; Cronin, M.; Dearden,
J.; Gramatica, P.; Martin, Y. C.; Todeschini, R.; Consonni, V.; Kuz’min, V. E.; Cramer,
R.; Benigni, R.; Yang, C.; Rathman, J.; Terfloth, L.; Gasteiger, J.; Richard, A.; Tropsha, A.
QSAR Modeling: Where Have You Been? Where Are You Going to? J. Med. Chem. 2014, 57,
4977–5010.
55. Vogt, M.; Huang, Y.; Bajorath, J. R. From Activity Cliffs to Activity Ridges: Informative Data
Structures for SAR Analysis. J. Chem. Inf. Model. 2011, 51, 1848–1856.
56. Luscombe, C. In Our Hands, in Agreement with Conference Presentations from Other Groups,
Current Deep Learning Methods Offer Little or No Advantage Over Faster and Simpler Methods such
as Xgboost57 for Typical Lead Optimisation Data Sets. GlaxoSmithkline, Unpublished, 2018.
57. Sheridan, R. P.; Wang, W. M.; Liaw, A.; Ma, J.; Gifford, E. M. Extreme Gradient Boosting
as a Method for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2016, 56,
2353–2360.
58. Altae-Tran, H.; Ramsundar, B.; Pappu, A. S.; Pande, V. Low Data Drug Discovery with One-
Shot Learning. ACS Cent. Sci. 2017, 3, 283–293.
59. Finn, C.; Abbeel, P.; Levine, S. Model-Agnostic Meta-Learning for Fast Adaptation of Deep
Networks. 2017, arXiv:1703.03400v3. arXiv.org e-Print archive. https://arxiv.org/abs/1703.
03400?context=cs (accessed Apr 23, 2019).

97
60. Nicolotti, O.; Gillet, V. J.; Fleming, P. J.; Green, D. V. S. Multiobjective Optimization in
Quantitative Structure Activity Relationships: Deriving Accurate and Interpretable QSARs. J.
Med. Chem. 2002, 45, 5069–5080.
61. Polishchuk, P. G.; Kuz’min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for
Structural Interpretation of QSAR/QSPR Models. Mol. Inform. 2013, 32, 843–853.
62. Simplex Representation of Molecular Structure – A Chemoinformatic Tool for Calculation of Simplex
Descriptors. GitHub, Inc. https://github.com/DrrDom/sirms (accessed Mar 28, 2019).
63. Wang, L.; Wu, Y.; Deng, Y.; Kim, B.; Pierce, L.; Krilov, G.; Lupyan, D.; Robinson, S.;
Dahlgren, M. K.; Greenwood, J.; Romero, D. L.; Masse, C.; Knight, J. L.; Steinbrecher, T.;
Beuming, T.; Damm, W.; Harder, E.; Sherman, W.; Brewer, M.; Wester, R.; Murcko, M.;
Frye, L.; Farid, R.; Lin, T.; Mobley, D. L.; Jorgensen, W. L.; Berne, B. J.; Friesner, R. A.; Abel,
R. Accurate and Reliable Prediction of Relative Ligand Binding Potency in Prospective Drug
Discovery by Way of a Modern Free-Energy Calculation Protocol and Force Field. J. Am. Chem.
Soc. 2015, 137, 2695–703.
64. Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M. T. D.; Gramatica, P.;
Jaworska, J. S.; Kahn, S.; Klopman, G.; Marchant, C. A.; Myatt, G.; Nikolova-Jeliazkova,
N.; Patlewicz, G. Y.; Perkins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van de
Sandt, J. J. M.; Tong, W.; Veith, G.; Yang, C. Current Status of Methods for Defining the
Applicability Domain of (Quantitative) Structure–Activity Relationships. Altern. Lab. Anim.
2005, 33, 1–19.
65. Bowes, J.; Brown, A. J.; Hamon, J.; Jarolimek, W.; Sridhar, A.; Waldron, G.; Whitebread, S.
Reducing Safety-Related Drug Attrition: The Use of In Vitro Pharmacological Profiling. Nat.
Rev. Drug Discov. 2012, 11, 909–922.
66. Hanser, T.; Barber, C.; Marchaland, J. F.; Werner, S. Applicability Domain: Towards a More
Formal Definition. SAR QSAR Environ. Res. 2016, 27, 893–909.
67. Xu, Y.; Ma, J.; Liaw, A.; Sheridan, R. P.; Svetnik, V. Demystifying Multitask Deep Neural
Networks for Quantitative Structure-Activity Relationships. J. Chem. Inf. Model. 2017, 57,
2490–2504.
68. Norinder, U.; Carlsson, L.; Boyer, S.; Eklund, M. Introducing Conformal Prediction in
Predictive Modeling. A Transparent and Flexible Alternative to Applicability Domain
Determination. J. Chem. Inf. Model. 2014, 54, 1596–1603.
69. Cortes-Ciriano, I.; Bender, A. Deep Confidence: A Computationally Efficient Framework for
Calculating Reliable Prediction Errors for Deep Neural Networks. J. Chem. Inf. Model. 2019,
59, 1269–1281.
70. Harrington, E. C. The Desirability Function. Industrial Quality Control 1965, 21, 494–498.
71. Jasrasaria, D.; Pyzer-Knapp, E. O. Dynamic Control of Explore/Exploit Trade-Off in Bayesian
Optimization. In Advances in Intelligent Systems and Computing; Springer: Cham, 2019; Vol. 858,
pp 1–15.
72. Schneider, G. Automating Drug Discovery. Nat. Rev. Drug Discov. 2018, 17, 97–113.
73. Besnard, J.; Ruda, G. F.; Setola, V.; Abecassis, K.; Rodriguiz, R. M.; Huang, X. P.; Norval, S.;
Sassano, M. F.; Shin, A. I.; Webster, L. A.; Simeons, F. R.; Stojanovski, L.; Prat, A.; Seidah,
N. G.; Constam, D. B.; Bickerton, G. R.; Read, K. D.; Wetsel, W. C.; Gilbert, I. H.; Roth, B.

98
L.; Hopkins, A. L. Automated Design of Ligands to Polypharmacological Profiles. Nature 2012,
492, 215–220.
74. Borrotti, M.; De March, D.; Slanzi, D.; Poli, I. Designing Lead Optimisation of MMP-12
Inhibitors. Comput. Math. Methods Med. 2014, 258627.
75. Varela, R.; Walters, W. P.; Goldman, B. B.; Jain, A. N. Iterative Refinement of a Binding
Pocket Model: Active Computational Steering of Lead Optimization. J. Med. Chem. 2012, 55,
8926–8942.
76. Kier, L. B.; Hall, L. H. The Generation of Molecular Structures from a Graph-Based QSAR
Equation. Quant. Struct.-Act. Relat. 1993, 12, 383–388.
77. Gordeeva, E. V.; Molchanova, M. S.; Zefirov, N. S. General Methodology and Computer
Program for the Exhaustive Restoring of Chemical Structures by Molecular Connectivity
Indexes. Solution of the Inverse Problem in QSAR/QSPR. Tetrahedron Comput. Methodol.
1991, 3, 389–415.
78. Segler, M. H. S.; Kogej, T.; Tyrchan, C.; Waller, M. P. Generating Focussed Molecule Libraries
for Drug Discovery with Recurrent Neural Networks. 2017, arXiv:1701.01329v1. arXiv.org e-Print
archive. https://arxiv.org/abs/1701.01329 (accessed Apr 23, 2019).
79. Lim, J.; Ryu, S.; Kim, J. W.; Kim, W. Y. Molecular Generative Model Based on Conditional
Variational Autoencoder for De Novo Molecular Design. 2018, arXiv:1806.05805v1. arXiv.org e-
Print archive. https://arxiv.org/abs/1806.05805 (accessed Apr 23, 2019).
80. Kadurin, A.; Nikolenko, S.; Khrabrov, K.; Aliper, A.; Zhavoronkov, A. druGAN: An Advanced
Generative Adversarial Autoencoder Model for De Novo Generation of New Molecules with
Desired Molecular Properties in Silico. Mol. Pharm. 2017, 14, 3098–3104.
81. You, J.; Liu, B.; Ying, R.; Pande, V.; Leskovec, J. Graph Convolutional Policy Network for Goal-
Directed Molecular Graph Generation. 2018, arXiv:1806.02473v2. arXiv.org e-Print archive.
https://arxiv.org/abs/1806.02473 (accessed Apr 23, 2019).
82. Olivecrona, M.; Blaschke, T.; Engkvist, O.; Chen, H. Molecular De-Novo Design Through
Deep Reinforcement Learning. J. Cheminf. 2017, 9, 48–62.
83. Polykovskiy, D.; Zhebrak, A.; Sanchez-Lengeling, B.; Golovanov, S.; Tatnov, O.; Belyaev,
S.; Kurbanov, R.; Artamonov, A.; Aladinskiy, V.; Veselov, M.; Kadurin, A.; Nikolenko, S.;
Aspuru-Guzik, A.; Zhavoronkov, A. Molecular Sets (MOSES): A Benchmarking Platform for
Molecular Generation Models. 2018, arXiv:1811.12823. arXiv.org e-Print archive. https://arxiv.
org/abs/1811.12823 (accessed Apr 23, 2019).
84. Pogany, P.; Arad, N.; Genway, S.; Pickett, S. D. De Novo Molecule Design by Translating from
Reduced Graphs to SMILES. J. Chem. Inf. Model. 2019, 59, 1136–1146.
85. Woodward, R. B.; Hoffmann, R. The Conservation of Orbital Symmetry. Angew. Chem., Int.
Ed. 1969, 8, 781–932.
86. Corey, E. J.; Wipke, W. T. Computer-Assisted Design of Complex Organic Syntheses. Science
1969, 166, 178–192.
87. Cook, A.; Johnson, A. P.; Law, J.; Mirzazadeh, M.; Ravitz, O.; Simon, A. Computer-Aided
Synthesis Design: 40 Years On. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 79–107.
88. Reaxys. https://www.elsevier.com/solutions/reaxys (accessed Mar 28, 2019).

99
89. Leach, A. R.; Bradshaw, J.; Green, D. V. S.; Hann, M. M. H.; Delany, J. J., III. Implementation
of a System for Reagent Selection and Library Enumeration, Profiling, and Design. J. Chem. Inf.
Comput. Sci. 1999, 39, 1161–1172.
90. Szymkuc, S.; Gajewska, E. P.; Klucznik, T.; Molga, K.; Dittwald, P.; Startek, M.; Bajczyk, M.;
Grzybowski, B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew.
Chem., Int. Ed. Engl. 2016, 55, 5904–5937.
91. Bøgevig, A.; Federsel, H.-J.; Huerta, F.; Hutchings, M. G.; Kraut, H.; Langer, T.; Löw, P.;
Oppawsky, C.; Rein, T.; Saller, H. Route Design in the 21st Century: The ICSYNTH Software
Tool as an Idea Generator for Synthesis Prediction. Org. Process Res. Dev. 2015, 19, 357–368.
92. Law, J.; Zsoldos, Z.; Johnson, A. P.; Simon, A.; Major, S.; Reid, D.; Wade, R. A.; Liu, Y.;
Khew, S. Y.; Ando, H. Y. Route Designer: A Retrosynthetic Analysis Tool Utilizing Automated
Retrosynthetic Rule Generation. J. Chem. Inf. Comput. Sci. 2009, 49, 593–602.
93. Klucznik, T.; Mikulak-Klucznik, B.; McCormack, M. P.; Lima, H.; Szymkuć, S.; Bhowmick,
M.; Molga, K.; Zhou, Y.; Rickershauser, L.; Gajewska, E. P.; Toutchkine, A.; Dittwald, P.;
Startek, M. P.; Kirkovits, G. J.; Roszak, R.; Adamski, A.; Sieredzińska, B.; Mrksich, M.; Trice,
S. L. J.; Grzybowski, B. A. Efficient Syntheses of Diverse, Medicinally Relevant Targets Planned
by Computer and Executed in the Laboratory. Chemistry 2018, 4, 522–532.
94. Schwaller, P.; Gaudin, T.; Lanyi, D.; Bekas, C.; Laino, T. “Found in Translation”: Predicting
Outcomes of Complex Organic Chemistry Reactions Using Neural Sequence-to-Sequence
Models. Chem. Sci. 2018, 9, 6091–6098.
95. Segler, M. H. S.; Waller, M. P. Neural-Symbolic Machine Learning for Retrosynthesis and
Reaction Prediction. Chemistry 2017, 23, 5966–5971.
96. Coley, C. W.; Rogers, L.; Green, W. H.; Jensen, K. F. SCScore: Synthetic Complexity Learned
from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261.
97. Lin, S.; Dikler, S.; Blincoe, W. D.; Ferguson, R. D.; Sheridan, R. P.; Peng, Z.; Conway, D.
V.; Zawatzky, K.; Wang, H.; Cernak, T.; Davies, I. W.; DiRocco, D. A.; Sheng, H.; Welch,
C. J.; Dreher, S. D. Mapping the Dark Space of Chemical Reactions with Extended Nanomole
Synthesis and MALDI-TOF MS. Science 2018, 361, 569–576.
98. Nielsen, M. K.; Ahneman, D. T.; Riera, O.; Doyle, A. G. Deoxyfluorination with Sulfonyl
Fluorides: Navigating Reaction Space with Machine Learning. J. Am. Chem. Soc. 2018, 140,
5004–5008.
99. Schultz, T. W.; Diderich, R.; Kuseva, C. D.; Mekenyan, O. G. The OECD Toolbox Starts
its Second Decade. In Computational Toxicology, Methods in Molecular Biology (Methods and
Protocols); Nicolotti, O., Ed.; Humana Press: Totowa, NJ, 2018; Vol. 1800.
100. Organisation for Economic Cooperation and Development. OECD Environmental Health and
Safety Series on Testing and Assessment No. 102. In Guidance Document for Using the OECD
(Q)SAR Application Toolbox to Develop Chemical Categories According to the OECD Guidance on
Grouping of Chemicals. https://read.oecd-ilibrary.org/environment/the-guidance-document-
for-using-the-oecd-q-sar-application-toolbox-to-develop-chemical-categories-according-to-
the-oecd-guidance-on-grouping-chemicals_9789264221482-en#page1 (accessed Apr 23,
2019).
101. Ankley, G. T.; Bennett, R. S.; Erickson, R. J.; Hoff, D. J.; Hornung, M. W.; Johnson, R.
D.; Mount, D. R.; Nichols, J. W.; Russom, C. L.; Schmieder, P. K.; Serrrano, J. A.; Tietge,

100
J. E.; Villeneuve, D. L. Adverse Outcome Pathways: A Conceptual Framework to Support
Ecotoxicology Research And Risk Assessment. Environ. Toxicol. Chem. 2010, 29, 730–741.
102. Matthews, E. J.; Kruhlak, N. L.; Benz, R. D.; Contrera, J. F.; Marchant, C. A.; Yang, C.
Combined Use of MC4PC, MDL-QSAR, BioEpisteme, Leadscope PDM, and Derek for
Windows Software to Achieve High-Performance, High-Confidence, Mode of Action–Based
Predictions of Chemical Carcinogenesis in Rodents. Toxicol. Mech. Methods 2008, 18,
189–206.
103. Zhou, H.; Gao, M.; Skolnick, J. Comprehensive Prediction of Drug-Protein Interactions and
Side Effects for the Human Proteome. Sci. Rep. 2015, 5, 11090.
104. Watson, O.; Cortes-Ciriano, I.; Taylor, A.; Watson, J. A. A Decision Theoretic Approach to
Model Evaluation in Computational Drug Discovery. 2018, arXiv:1807.08926v1. arXiv.org e-
Print archive. https://arxiv.org/abs/1807.08926 (accessed Apr 23, 2019).
105. Pearl, J. Theoretical Impediments to Machine Learning with Seven Sparks from the Causal Revolution.
2018, arXiv:1801.04016v1. arXiv.org e-Print archive. https://arxiv.org/abs/1801.04016
(accessed 23rd April 2019).
106. Dietvorst, B. J.; Simmons, J. P.; Massey, C. Algorithm Aversion: People Erroneously Avoid
Algorithms After Seeing Them Err. J. Exp. Psychol. Gen. 2015, 144, 114–126.
107. Werth, B. The Billion-Dollar Molecule: The Quest for the Perfect Drug; Simon and Schuster: New
York, 1995.
108. Vayena, E.; Blasimme, A.; Cohen, I. G. Machine Learning in Medicine: Addressing Ethical
Challenges. PLoS Med. 2018, 15, e1002689.
109. Goodman, B.; Flaxman, S. European Union Regulations on Algorithmic Decision-Making and a
“Right To Explanation”. 2016, arXiv:1606.08813v3. arXiv.org e-Print archive. https://arxiv.
org/abs/1606.08813v3 (accessed Apr 23, 2019).
110. Advanced Chess. Wikipedia. https://en.wikipedia.org/wiki/Advanced_Chess (accessed Apr
23, 2019).

101
Chapter 6

Cognitive Materials Discovery and Onset of the 5th Discovery

Paradigm
Dmitry Y. Zubarev and Jed W. Pitera*

IBM Research, Almaden Research Center, 650 Harry Road, San Jose, California 95120-6099,
United States
*E-mail: pitera@us.ibm.com

The discovery of novel materials can generate immense technological, economic,

and social benefits. However, these are slow, challenging, expert-intensive efforts.
Our thesis is that new capabilities of cognitive computing—particularly natural
language processing, knowledge representation, and automated reasoning—are
poised to transform the process of materials discovery and take us from our current
“4th paradigm” of discovery driven by data science and machine learning to a
“5th paradigm” era where cognitive systems seamlessly integrate information from
human experts, experimental data, physics-based models, and data-driven models
to speed discovery. We discuss the key bottlenecks to discovery that need to be
removed to enable this new approach and illustrate progress towards this cognitive
future with examples from IBM research efforts as well as the broader literature.

Introduction
Materials define humanity’s ability to transform the world around us. From structural materials
like steel and composites to functional materials like single-crystal silicon and
polytetrafluoroethylene, materials innovation creates new capabilities and drives new industries.
Many current global challenges such as the lack of drinkable water, the need for energy efficient
transportation, the task of feeding a growing population, and the need for strategies to address carbon
capture and climate change involve a search for new materials as a viable solution strategy (1). It has
been estimated that the full materials lifecycle from discovery through development to market impact
is 10–20 years, though progress has been made in speeding up that timeline (2). Given the economic
and societal driving forces for materials innovation, it is still a relatively slow and expert-intensive
process.
It is instructive to compare and contrast the process of pharmaceutical drug discovery with
materials discovery (3). While both fields require sustained effort leading to the acceleration of
discovery and share some useful similarities, there are significant differences. Drug discovery is easily
compared with the discovery of functional materials—in both cases there is a specific function

(i.e., inhibition, excitation, modulation, and transduction) that is being sought for. However, the
elemental variety of pharmaceuticals is much smaller than that of materials, since pharmaceuticals
are largely confined to the first few rows of the periodic table while materials can have a wider range.
This makes the combinatorial design space of materials discovery much broader than that of drug
discovery. A second, and more important distinction, is the targeted operating environment. The
vast majority of drugs are intended to be used for human health, which have a common “operating
environment” in terms of the ranges of temperature, pH, solvent, and oxidative stress they will
be exposed to. In contrast, the operating environments for materials are incredibly more diverse.
They range from the same drug-like human body environment (e.g., for structural biomaterials used
in bone reconstruction) to much harsher conditions (e.g., the >1500 °C temperatures endured
by thermal barrier coating materials in a jet engine) (4, 5). Finally, most drugs are well-defined
collections of atoms either covalently or ionically bonded together, while materials cover a diverse
range of physical and electronic structures, from simple chemicals to networks of covalently bonded
structures (e.g., epoxy polymers) all the way to metals and quantum materials with exotic band
structures (6, 7), as well as in-between cases like metal-organic frameworks composed of both
covalent and organometallic bonds (8).
Historically, there have been four successive paradigms of materials discovery. The first was
based purely on human observation of the natural world, leading to the development of the first
metals and ceramics. The second was based on directed experimentation, where an application
is known and systematic testing is used to search for materials that can serve the application
requirements. This real-world testing is replaced by in silico testing using predictive first-principles
computational models in the 3rd paradigm. Finally, in the 4th and current paradigm, the focus shifts
towards data-centric models as the driver of materials search and discovery (9).

Figure 1. Relevant paradigms of discovery. A) Fundamentally, the components of the discovery remain the
same; however, their relative importance varies. B) Dominance of the generalizable first-principle models
rooted in differential equations and supported by advances in personal and high-performance computing is
characteristic of the 3rd discovery paradigm; C) shift towards data-centric models rooted in statistical
learning is an attribute of the 4th paradigm; D) elimination of the bottlenecks in the information/knowledge
flows between components of the discovery and fusion enabled by domain-specific AI is perceived as a
signature of the 5th paradigm.

104
At the same time that materials innovation moves at one pace, the field of artificial intelligence
(AI) and machine learning (ML) is undergoing an explosive renaissance. The availability of large
labeled datasets has enabled the training of complex many-parameter models such as deep neural
networks for complex tasks like handwriting and image recognition (10, 11). In fact, it is no longer
unusual for targeted AI/ML models to exceed human performance on well-defined tasks like face
recognition (12). Our thesis is that large datasets of materials information, combined with AI/
ML innovations and improved tools for capturing the knowledge of human experts, will enable a
new paradigm of materials discovery—the 5th paradigm. Figure 1 illustrates this vision of the fifth
paradigm of discovery.
Cognition is defined as the mental action or process of acquiring knowledge and understanding
through thought, experience, and the senses. Projecting this definition on the information
technology domain, cognitive systems can be defined as a broad class of information technology
systems, often composed of multiple components or services, that embody a set of five core
capabilities (13):

1. They create deeper human engagement.

2. They scale and elevate expertise.
3. They infuse products and services with capabilities to acquire knowledge, perform
inference, make decisions, and generate new knowledge.
4. They enable processes and operations with capabilities.
5. They enhance exploration and discovery.

How might some of these capabilities manifest in cognitive materials discovery? The first,
creating deeper human engagement, can be brought about by presenting large materials data sets
in rich, contextual forms with mechanisms for rapidly integrating human expert knowledge. The
second capability is also clear—natural language processing and other tools for automated data
collection and extraction will enable materials researchers to work with a scale of data they are
currently unable to. Their expertise can also be scaled by computational models that encapsulate
subject matter expertise and knowledge and can then be applied to larger data sets than a human is
capable of reviewing.
The third capability eliminates or significantly reduces the entry barrier for human interactions
with highly specialized and technically demanding algorithms, for example, in data analysis and
exploration. The fouth capability follows from the previous one where multiple cognitive-infused
products and services form workflows.
Capability number five is self-evident—after all, the goal of cognitive materials discovery is
to speed the exploration, design, and discovery of materials. In this discovery context, capability
number four, the enablement of cognitive processes and operations, is essentially redundant with
capability number five as the process we are focused on is discovery itself.
Though the topic of this chapter is cognitive materials discovery, it is important to recognize
that there are actually two different use cases for the development of new materials—design and
discovery. In this context, design refers to the optimization of an existing materials category for a
targeted application (e.g., the tuning of tensile yield strength, elongation, and corrosion resistance in
a structural alloy by adjusting composition, processing, and annealing parameters). For design use
cases, the model—or design space—of the material is known and the task is one of interpolation and
optimization. In contrast, discovery refers to the identification of a totally novel materials category or
design paradigm, one that is outside of, and not predicted by, existing models. The task in discovery

105
is one of extrapolation, exploration outside of known data, and development of new models or
concepts.
This distinction between design and discovery can be somewhat abstract, so it is useful to
illustrate it with some examples from IBM’s polymer research and development efforts. An example
of polymer design is the work of Allen et al. (14), who systematically explored a design space of
acid-labile monomer composition in an acrylate random heteropolymer photoresist to optimize
imaging performance and chemical stability. A contrasting example of discovery is the discovery
of high strength Poly(hexahydrotriazine) (PHT) polymers by Garcia et al. (15). A key component
was left out of the synthesis of a traditional condensation polymer, and the resulting product was
a dense black mass stuck at the bottom of the reaction vessel. The material was so tough it was
impossible to scrape off a sample, and it had to be extracted by shattering the reaction vessel. A
combination of nuclear magnetic resonance and computational analysis was used to identify the
tough material as a highly crosslinked PHT polymer, a new class of high toughness polymers able to
make composites tougher than bone (15). These two examples represent limiting cases. We recognize
that design/discovery modalities are continuum-like; this leads to semantically confusing verbiage
(e.g., discovery by design) that we prefer to avoid in order to preserve clarity of the text. Clearly,
emergence of new modailties of design/discovery does not eliminate old modalities—the PHT
example would be associated with 3rd discovery paradigm, even if it occurred within the period of
time associated with the 4th paradigm.
The considerations discussed in this chapter are based both on observed trends as well as IBM
Research internal efforts to accelerate the discovery of polymer materials relevant to a broad range of
industrial applications. We identify the following factors as the major drivers of the onset of the 5th
Paradigm:

1. Acceleration of materials discovery and development is contingent on the elimination of

the bottlenecks inherited from the preceding paradigms.

- Earlier paradigm shifts were mostly defined by drastic improvements in one

particular component of the scientific process (e.g., explosive improvement of
accessibility of ab initio computer simulations following maturation of
computing hardware, software, theory, and algorithms).

2. Human subject matter experts (SMEs) will remain the driving force of discovery in the
foreseeable future.

- AI-driven technology will enable super-human capabilities of human SMEs

rather than replace them. AI discovery tools need to be designed with human
cognitive bias and limitations in mind.

3. Data-centric discovery and development requires an influx of hard data with high novelty
and low redundancy.

- Historical data are limited in volume and content, they are exhaustible and
cannot sustain the 5th paradigm; the new paradigm will grow out of the
advanced experimental data acquisition interacting seamlessly with AI models,
advanced design of experiments (DOE), and human SMEs guidance.

106
Onset of the 5th Paradigm: Bottlenecks
The earlier paradigms of scientific discovery were associated with unbalanced growth of one of
the aspects or tools used to drive discovery. The onset of the next paradigm starts with the fusion
of these components via the elimination of critical bottlenecks. The bottlenecks that we expect
to be resolved upon the transition to the 5th paradigm are data availability, reflective of the finite
number and limited diversity of the data points that can be extracted from the existing body of
knowledge; cognition, caused by limited capabilities of human SMEs to process the fast-growing
volume of scientific knowledge; and actionability, indicative of the inefficient transfer of
computationally generated hypotheses to experimental validation.

Data Availability Bottleneck

The decades of attempts of the computational branches of chemistry, physics, and material
science to change the pace of discovery led to a thorough blending of the “discovery” and “design”
concepts. Earlier we provided an operational definition of these concepts for the purposes of this
contribution; here, we want to underscore the underlying incompleteness of the scientific knowledge
accumulated up to this point in time.
The published scientific literature has a significant bias for positive results, which are naturally
the most interesting and most likely to be reported. In ML terms, this heavy class imbalance is due
to the preferred disclosure of positive results and withholding of negative results constitutes the
most acknowledged yet trivial form of the data availability bottleneck. Let’s assume that collective
psychology of the scientific community has changed, and both positive and negative data classes
are available for mining and analysis. Our practical experience shows that the volume of the data
immediately relevant to the design/discovery targets of any given industrial customer is 1–2 orders
of magnitude smaller than the overall volume of the knowledge in the pertinent domain of polymer
materials. It is typical for the design/discovery task to target “outliers,” which are materials with
properties far from the average that are statistically unlikely, and “unicorns,” which are materials
with performance characteristics that are unattainable simultaneously. Examples of such objects are
either rare in the available data sets or nonexistent. This means that the focus of the design/discovery
process shifts from balancing interpolation/extrapolation performance of the models to the search
for rare objects that, by definition, are poorly described by the existing models. To find high-value
materials, novel ideas are required, but they are by definition rare in the supporting data.
Where is the source of the novelty that has the potential to break the models capturing these
data (e.g., the regression models capturing structure-property relations of materials)? The explosive
development of quantum mechanics did not follow from the beauty of classical mechanics equations.
It took several problematic paradigm-breaking experimental results to trigger the revolution. The
modus operandi of the current discovery paradigm sets its sights on the generalizability and
transferability of statistical learning models, instead of finding new hard data that break them. Are
there key pieces of disruptive knowledge hidden in the papers, patents, curated datasets, and such?
Whether the answer is “yes” or “no”, mining the existing data cannot be seen as a sustainable
alternative to the acquisition of new, conceptually puzzling hard data.
The new discovery paradigm implies a thorough exploration of the limits of existing knowledge
that might manifest as a low diversity of available data points, the fundamental scarcity of outliers,
or the biases of human SMEs to focus experiments around known designs. Breaking through these
limits involves finding the optimal combination of the deployment of accurate modeling (e.g., in
virtual data-augmentation strategies), statistical learning on the available data that translates into

107
DOE, and advanced experimental data acquisition via semi- or fully-automated experimental
platforms capable of rudimentary cognitive functions.

Cognition Bottleneck
The predominant form of the knowledge representation and transfer is a legacy one, via
unstructured sources such as publications, patents, and technical reports. The volume of scientific
publications is growing (see Figure 2). In the polymer materials domain the amount of relevant
publications ranges from several hundreds to several thousands per year. At these rates, the
traditional way of knowledge extraction—human SMEs reading and comprehension of written
reports—is already unsustainable. This cognition bottleneck leads to the fragmentation of the
scientific domains into sub-fields where each sub-field is mostly driven by personal and historical
biases and lacks an influx of the ideas from adjacent sub-fields. Cognition bottlenecks trap SMEs
within the limits of their graduate and postgraduate research. They limit exposure of SMEs to the
new ideas, concepts, and solutions developed in related areas of research, stimulate “reinvention” of
previously known solutions, and hinder adoption of the most efficient practices. Further discussion
of the cognition bottleneck can be framed in the context of search strategies, such as depth first vs.
breadth first (16), or in the context of the general intelligence, such as fluid vs. crystallized intelligence
(17). A discussion of such nature is beyond the scope of this contribution.

Actionability Bottleneck
The current paradigm of scientific discovery is strongly driven by advanced methods of statistical
learning, such as ML and deep learning (DL) techniques. Since these approaches target evaluation
of properties via regression or classification, they effectively play the role of advanced DOE that
happens to be abstracted from the experiment itself. Training of a statistical learning model requires
compilation of the appropriate reference dataset. However, there is no explicit requirement that
all the materials in any given dataset can be prepared using compatible synthetic methodologies/
experimental protocols. Data from very different sources and synthetic approaches might be
combined in a training dataset, limiting its predictive capability. In other words, statistical models
might be trained on the data that have been produced by different labs and make unrealistic
predictions confounded by the practical operational constraints of those diverse labs.
One possible strategy to quantify the practical impact of computational research on materials
development is to compare the volume of scientific publications in the materials domain with the
volume of patents in the polymer material area (Figure 2). First, we obtained the year-to-year
publication count from the Web of Knowledge in computational fields associated with the 3rd and
4th discovery paradigms (Figure 2A). The former were obtained using the query “material” AND
“density functional theory” and the latter were obtained using the query "material" AND ("machine
learning" OR "deep learning" OR "artificial intelligence"). Second, we obtained the patent count
from the IBM patent server CIRCA using the query ("polymer material") AND reaction_count:([2
TO 200]); this query targeted patents that contain description of chemical reactions and pertain to
material synthesis/preparation. Finally, we evaluated year-to-year differential volume of inventions
based material patent data. Considering year-to-year differential volume of inventions as a proxy for
the marginal product and given monotonic growth of the publications volume, we notice the lack of
dramatic effect of the former on the latter (Figure 2D). It appears that discovery and innovation in
the polymer materials domain has routinely gone through phases of diminishing returns during the
existing discovery paradigms.

108
In general, the preparation of a polymer material is a multistep process involving multiple
components, such as molecular components and polymers, and elaborate experimental protocols,
from the synthesis of monomers, through polymerization, formulation, and postprocessing,
eventually rendering the polymer material with targeted properties. These time- and resource-
consuming procedures are often prohibitively expensive for medium/high throughput
experimentation. Given the cost of the experimental validation and testing, the burden of decision
making typically falls to the human SME who personally executes the work. Therefore, the ability
of data-driven statistical learning models to produce a high volume of hypotheses that appear to
be promising does not necessarily lead to acceleration of the discovery process as long as these
hypotheses still have to go through the SME-driven scientific triage.

Figure 2. Volume of publications in the materials science domain employing different computational
approaches and volume of patents describing inventions in polymer materials domain (18, 19). A) Query A
"material" AND ("machine learning" OR "deep learning" OR "artificial intelligence") (blue dots) captures
the contribution of the methods associated with the 4th discovery paradigm; query B “material” AND
“density functional theory” (orange dots) captures the contribution of the methods associated with the 3rd
discovery paradigm; B) ratio of the number of publications in query B to query A for each year shows that
the research activities are still dominated by the methods of the 3rd paradigm. C) Query C ("polymer
material") AND reaction_count:([2 TO 200]) captures the volume of the inventions in the area of polymer
materials synthesis. D) Year-to-year differential volume of inventions based on Figure 2C data. Queries A
and B offer a proxy for the amount of labor and query C serves as a proxy for the production output.
Considering year-to-year differential volume of inventions as a proxy for the marginal product and given
monotonic growth of the publications volume, we notice that inventorship in polymer materials space
regularly encounters periods of diminishing returns and doesn’t appear to be affected by the transition from
the 3rd to 4th discovery paradigm.

Discovery Bottlenecks in Material Science

In the spirit of the industrial perspective discussing Cognition as a Service, we evaluate how these
bottlenecks affect the tasks performed by practitioners of the occupation “material scientist” (20).
The definitions of the tasks can be found at O*Net Online resource (21). Out of the fourteen tasks
describing the occupation “material scientist,” we consider the first ten as being closely associated
with research. The remaining tasks involve testing and management and are not related to the

109
scientific aspect of the occupation. We have also taken the liberty to generalize some of the tasks that
mention metal/alloys to their counterparts related to polymer materials:

1. Conduct research on the structures and properties of materials, such as metals, alloys,
polymers, and ceramics, to obtain information that could be used to develop new products
or enhance existing ones.

- This task requires reading and critical assessment of the increasing volume of the
unstructured sources of scientific and engineering knowledge and, therefore, is
hindered by the cognition bottleneck.

2. Prepare reports, manuscripts, proposals, and technical manuals for use by other scientists
and requestors, such as sponsors and customers.

- This task requires summarization of the pre-existing and acquired knowledge;

these activities constitute a broader context of the cognition bottleneck because
they typically require providing adequate context for the reported results.

3. Perform experiments and computer modeling to study the nature, structure, and physical
and chemical properties of metals and their alloys, and their responses to applied forces.

- This task is equally relevant for polymer materials and is equally if not more
challenging from the technical point of view; it involves resolution of the data
availability bottleneck.

4. Plan laboratory experiments to confirm feasibility of processes and techniques used in the
production of materials with special characteristics.

- This task interfaces DOE with data acquisition and requires resolution of the
actionability bottleneck.

5. Determine ways to strengthen or combine materials or develop new materials with new or
specific properties for use in a variety of products and applications.

- Closely related to the previous task, this one requires generation of multiple
hypotheses and selection a few that are worth validation, facing the actionability
bottleneck.

6. Teach in colleges and universities. Even this seemingly benign task is heavily affected by the
cognition bottleneck that needs to be addressed in order to keep the curriculum current.
7. Devise testing methods to evaluate the effects of various conditions on particular materials.
Diversity of the operational conditions in applications of polymer materials almost
certainly guarantees the lack of the necessary tests and the difficulty of the direct
acquisition of the required measurements, exposing this task to the data availability
bottleneck.
8. Research methods of processing, forming, and firing materials to develop such products as
ceramic dental fillings, unbreakable dinner plates, and telescope lenses.

110
- This task is adjacent to number five and requires resolution of the actionability
bottleneck affecting hypothesis generation and validation; critical assessment of
the reported methods suffers from the cognition bottleneck.

9. Confer with customers to determine how to tailor materials to their needs. It is easy to
misjudge the challenge and complexity of this purely communicative task. In fact, it deals
with a form of the cognition bottleneck—human SMEs of the customer have unique
knowledge of the practical constraints associated with business model, technological
processes, historical directions, etc., but this knowledge is unstructured and might not be
obvious to the SMEs themselves. In practice, resolution of the cognition bottleneck often
exposes the deeper data availability and actionability bottlenecks.
10. Recommend materials for reliable performance in various environments. This is a complex
task encapsulating multiple stages of the research process, from the scoping the existing
solution, to producing practical hypotheses, to the data acquisition and interpretation;
depending on the specifics, all three bottlenecks: cognitive, actionability, and data
availability, contribute to the challenges.

It is clear from this task-based analysis that all ten of the characteristic materials science research
tasks we reviewed are impacted by one or more of the bottlenecks we have identified that limit
research productivity and materials discovery.

To the 5th Paradigm via Cognitive Systems

The outlined bottlenecks require complex solutions over a broad range of technological,
organizational, and psychological aspects of the research. A lot of promising work is being done along
these lines. Data augmentation strategies in polymer materials discovery exemplify fusion between
fundamental, transferable but resource-consuming computational approaches, such as quantum
chemistry, on one hand and fast, accessible statistical learning methods lacking generalizability and
dependent on the data on the other. Experimental automation attracts growing attention with the
advances of DL techniques and development of chemical systems suitable for medium-to-high
throughput experimentation (22, 23).
In this section, we will concentrate on signature efforts within the Accelerated Materials
Discovery program of IBM Research that aim to resolve the aforementioned bottlenecks with the
goal of augmenting the capabilities of human SMEs to enable super-human materials discovery. This
research program is part of a broader IBM initiative in Cognitive Discovery that aims to accelerate
scientific discovery more generally (24). Instead of picking some specific framework, such as DL or
semantic embedding, and identifying it with the next discovery paradigm, we start by recognizing the
centrality of human SMEs to the 5th paradigm and select practically relevant tool sets and examples,
regardless of how well-known they are.
The resolution of the cognition bottleneck, quite literally, comes down to teaching computers
to read scientific articles and patents like a human SME. The formal task at hand is domain-specific
knowledge extraction; while complex, it leverages many existing technologies in natural language
processing, semantic asset enrichment, and knowledge representation (25–27). Domain-specificity
is embodied by the development of semantic assets and annotators that capture knowledge relevant
to a specific (polymer materials) domain with the help of IBM research SMEs. Using this technology,
the current IBM polymer annotator extracts measured properties reported in the tables,
specifications of the measurements, chemical entities and their roles, and chemical reactions (Figure

111
3). The annotator is built using capabilities of a rule-based information extraction system, System
T (27). System T’s basic design removes the expressivity and performance limitations of current
systems based on cascading grammars. System T uses a declarative rule language, annotation query
language, and an optimizer that generates high-performance algebraic execution plans for annotation
query language rules.

Figure 3. Example of knowledge extraction in the polymer materials domain with IBM Polymer Annotator
leveraging System T capabilities. The extraction emphasizes not only tagging the entities but recognizing
their role in the text and establishing relations. The quality metrics of the annotator performance are shown
in the bottom panel.

Currently, discovery in the domain of polymer materials is enabled by specialized curated

datasets often recast as databases (28). Compilation of such datasets is strongly affected by cognition
bottleneck and availability of the data. As the focus of the discovery paradigm moves from enabling
data-centric statistical modeling towards fusion and augmentation of SMEs capabilities via cognitive
technologies, the relevance of the curated specialized databases is expected to decrease. The data that
constitute curated datasets will be subsumed by the entire volume of the knowledge extracted from
the unstructured data sources. This knowledge should become available at the fingertips of the SME
who should interact with the data in the same way as two SMEs converse with each other. Such
interactions are entity-centered, rather than being document-centered; they develop along the line
of progressive refinement of the questions asked; and, finally, go beyond trivial data reporting to
producing actionable hypotheses that aim to resolve scientific problems. At this point, the content
equivalent to the content of contemporary curated data sets will be exploited without being explicitly
exposed.
Application of the knowledge extraction technologies in the polymer materials domain
described above to patents and publications can produce the necessary content. The next step to
resolve the cognition bottleneck is the development of data models and tools enabling efficient
access to the extracted knowledge from the content. The concept of knowledge graphs (KGs) offers
the desired functionality (29). General technologies, including AI-based, enable automatic creation
of custom KGs from unstructured data in the business analytics domain and solve tasks of entity
and relationship disambiguation (30), relationship enrichment, and relevance-based ranking of the
results. These KGs provide functionality for search, summarization, recommendation engines, and
general decision-making processes. There are now some public examples of KGs and KG software

112
infrastructure specifically targeted at the materials domain (31, 32). Proliferation and adoption of
KGs in material science beyond these early examples critically depends on the development of tools
that enable processing of the fast growing volume of the unstructured data, such as scientific papers
and patents. Systems that can ingest these documents at scale and make the contained knowledge
discoverable, such as Corpus Conversion Service, are expected to play increasingly important role in
the KG ecosystem (33).
The main challenge in the development of KGs for materials knowledge comes from establishing
the potential workload as a set of use cases and finding appropriate measures of KG efficiency. The
development of the scientific KGs is in its infancy and the perception of the relevance of scientific
workloads suitable for KGs will be evolving. We recognize that KGs for scientific applications will
inevitably address trivia-like use cases, but these use cases are just a stepping stone to the capability
of helping to answer questions that do not have known answers. Therefore, we identify three general
categories of the conceptual and technological maturity of scientific KGs:

1. Human SMEs ask questions. Requires advanced search and browsing capabilities,
including conversational use cases.
2. KG is used to automatically generate questions. Requires active learning capabilities,
including advanced conversational use cases.
3. KG autonomously interrogates the data, generates and selects hypotheses, acts upon
hypotheses. Requires domain-specific AI capable of passing a subject matter expert Turing
test, a.k.a. Feigenbaum test, potentially fused with data acquisition capabilities.

The entity-centric nature of KGs dictates that the entities should be exposed to the user
according to models specific to the use cases. For example, a human who is a company founder has
business associates as relevant entities, whereas a human who is an artist has their artwork as the most
relevant entities. The models can be defined as ontologies that account for the specifics of the domain
represented by the KG and capture the internal expertise of the KG developer. Ontologies ultimately
determine what questions can be answered by the KG. There are no established ontologies for
materials science in general and the polymer materials domain in particular, though we expect these
to evolve over time. Here, we discuss ontology as a representation of the knowledge in the domain of
polymer materials. There are examples of structural ontologies in chemometrics that involve forming
sets of labels of the structural motifs encountered within organic molecules (34, 35). Figure 4 shows
an example of a basic polymer material ontology motivated by the recent progress in systems for
natural language querying over relational data stores (36). The presented version of the ontology
captures hierarchical relations between chemical entities, from the level of molecular precursors to
the formulation of polymer materials. Accounting for the factual vagueness of the polymer materials
description, it captures both the case when detailed information about the chemical entity is available
(e.g., the formulation is explicitly described or the repeat unit is unambiguously established) and the
case when the chemical entity is represented as a “bag of constituent components” (e.g., polymer is
described as a backbone with a variety of pendant groups or only precursors and synthetic protocol
are provided). Extended relations in the ontology (omitted from Figure 4 for readability) include
the impact of a chemical entity on a property, such as improvement of the elongation at break via
addition of soft-segment polymer, known performance bottlenecks of the materials, and relevant
products and application domains of the materials.
The combination of knowledge extraction and development of a polymer-focused KG offers
a path to resolving the cognition bottleneck via automated processing of the unstructured data.
In general, any scientific KG is a form of the knowledge representation that parallels the internal

113
knowledge representation in the brains of human SMEs. At the basic level, the produced domain-
specific KG can be leveraged for the disambiguation of polymer materials, recommendation tasks,
and construction of constrained datasets by projection.

Figure 4. Example of a basic ontology for the polymer materials domain. Extended relations include impact
of a chemical entity on a property, such as control of elongation at break via addition of soft-segment
polymer. The described ontology captures both the cases when detailed information about the chemical
entity is available (e.g., the formulation is explicitly described) and the cases when the chemical entity is
represented as a “bag of components” (e.g., polymer is described as a backbone with a variety of pendant
groups).

The actionability bottleneck is the signature of the 4th paradigm. Statistical learning models
achieve an outstanding rate of evaluation of new data points in regression and classification tasks;
their promise is that they can be used to produce large volumes of results at a fraction of the cost of the
underlying physics-based fundamental models. However, a large volume of generated hypotheses
does not necessarily translate into a comparable volume of experimental validation. There are very
concrete reasons why this is the case in the polymer material domain. One major problem is poor
accounting for the synthetic accessibility of the proposed polymers; even if the proposed polymer
can be synthesized in principle, the question remains about synthetic accessibility of the respective
monomers. Unlike pharmaceutical chemistry, the cost of both the monomer synthesis and
polymerization have to remain low, consistent with low margins typical for common polymer
applications. This drives design towards high availability of the synthesis components and simple,
efficient synthetic routes. The existing methods of synthetic accessibility analysis are not granular
enough for these purposes. In general, preparation procedures, treatment protocols, and evaluation
of the properties are complex in polymer materials domain.
For example, investigation of a single photosensitized polyimide for use in lithography in the
semiconductor industry consumes up to a week, including the preparation of the monomers,
polymerization, formulation, dissolution studies, mechanical tests, and lithography performance
assessment. Investment of a week of labor of a trained professional at the level of Ph.D. per candidate
material effectively means that the number of hypotheses—computational or human-
generated—accepted for validation remains in single digits. At this volume, any computationally

114
generated hypotheses face strong competition with hypotheses generated by human SME. The
factors shaping up selection process include:

1. Comprehensible scope. The human SME will generally be willing to critically assess as
many computational hypotheses as they can generate on their own. In practice, this means
that proposing a thousand candidates is not any better than proposing ten.
2. Overlap with the SME’s knowledge, experience, and intuition. If there is no clear
understanding of how to act upon a computational hypothesis, it will be rejected.
3. Confidence in the quality of computational hypotheses. This is particularly hard to achieve
with the data-driven models given low transferability of the model performance.

Automation of experimental data acquisition is tremendously helpful to resolve the actionability

bottleneck, since it increases the number of feasible experiments by orders of magnitude. A
complementary strategy involves injecting the human SME’s expertise in the process of
computational generation of candidate polymer structures with target performance. The workflows
of high-throughput computational screening are well-documented (37). The underlying exploration
strategy of the set of chemical structures (e.g., combinatorial enumeration, metaheuristic, generative
models, etc.) is well understood. In an internal research project at IBM, high-Tg polymers were
pursued via high-throughput inverse design. Computationally generated structures with promising
properties were enumerated and reviewed by an experimental polymer chemist. The human SME
estimated the volume of the actionable hypotheses in the read-out as a fraction of a percent. As far as
actual synthetic accessibility, the major family of proposed structures was synthetically impossible,
and the second largest family was synthetically possible but impractical. The next step involved
introduction of a set of constraints on the structure of the generated polymers. These constraints
reflected both the expertise of the human SME as well as their preferences dictated by the current
priorities and available lab capabilities. Postselection of the generated structures informed by these
constraints enriched the volume of the actionable predictions in the read-out by two orders of
magnitude, to tens of percent. Clearly, the injection of the human SME’s judgement improved
actionability in this case but did not improve the efficiency of the generative process. Analysis of
the future directions in the exploration of the fusion between human SMEs and computational
hypothesis generation points towards human-in-the-loop models of interaction, where input from
the human SME biases the generative process as opposed to filtering its outcome. As long as
actionability means “selection by the specific human SME,” computationally generated hypotheses
have higher actionability if they bear an intellectual stamp of the SME and remain within or close to
the bounds of the SME’s knowledge, experience, and intuition. There is a natural concern that the
limitations imposed by the finite and biased nature of the human knowledge will have a detrimental
effect. In fact, in other fields such as semantic asset enrichment, human-in-the-loop approaches have
demonstrated a capability to expose these limitations, serving as a driver for the evolution of the
SME’s knowledge, experience, and intuition as a part of the cognitive discovery process (38).
The data availability bottleneck is a fundamental one in scientific pursuit. Questions and
hypotheses that push the boundaries of knowledge necessarily require the collection of new data,
particularly in domains with large combinatorial complexity like chemistry and materials science.
Regardless of theoretical developments, experiment remains the source of ground truth data. Both
at IBM and as a field, we are moving in the direction of a fusion between statistical modeling and
experimental platforms, not just at the stage of carrying out experiment but also in the evaluation
of the experimental outcome. Along with the necessary advancements of the computational
components, such as ML/DL algorithms, software, and hardware, autonomous experimental data

115
acquisition in the polymer materials domain requires co-development of automated synthesis
systems and chemistries suitable for such platforms. The chemistries should be robust, transferable,
and enabling access to rich classes of materials, such as urea-catalyzed ring-opening polymerization
(39) Endowing experimental platforms with cognitive capabilities will be the next powerful step. The
hope is to depart from blind screening methodologies and conduct data acquisition in a targeted
active learning manner, fully informed of the deficiencies and shortcomings of the available
knowledge/data.
We illustrate the exploration of the intrinsic limitations of the available polymer data by
discussing label propagation on a similarity network connecting chemical entities (Figure 5). The
primary task in this project was to start with a set of labeled entities and assign labels to the unlabeled
entities, identifying entities of interest. As an example, we consider the case of a binary classification
that aims to split the entities into two classes—“relevant” or “irrelevant” per human SMEs’
judgement. Analysis of this type emerges at multiple stages of the projects in polymer materials
discovery, from the early steps dedicated to the assessment of the existing knowledge, to hypothesis
generation, to the selection of hypotheses for the experimental validation. In our experience, label
propagation accomplishes another task as a byproduct. It helps to establish the set of entities that
cannot be reliably labeled and thus reveals the bounds of the transferability of the initial human
SMEs’ judgement.

Figure 5. Binary label propagation on a similarity network of chemical entities. Nodes are chemical entities
(e.g., polymer materials). The nodes are connected via edges if an operationally defined measure of similarity
of the respective entities exceeds some threshold motivated by the task at hand. Red represents an
“irrelevant” class, blue represents a “relevant” class, and grey color represents an unknown class in the
beginning of the process, and an unknown/uncertain class at the end of the process. A) Initial label
assignment, the seeding labels acquired from human SMEs or other suitable sources. As the labels propagate
along the edges, they become progressively less reliable due to the dissimilarity of linked entities and
separation of the labeled and unlabeled entities on the network. B) Label assignment in the end of the
process. The stopping criteria depend on the problem at hand and the label propagation algorithm. The
entities with low confidence of the inferred labels are in grey color. The cutoff for the confidence of the label
assignment is defined operationally. In practice, the final grey labeled nodes are often interesting, as they
represent the frontier interface between known “relevant” and “irrelevant” materials.

116
Construction of a similarity graph for chemical entities (e.g., molecules, polymer materials,
reactions, or even scientific papers) can be accomplished via a variety of embedding strategies (e.g.,
structural or semantic similarity of the entities in question) (40). The initial labels are assigned to
a set of the reference entities using judgement of human SMEs (Figure 5A, red and blue nodes).
Multiple label propagation algorithms are available to carry out label inference, ranging from k-
nearest neighbor to diffusion on graphs. We notice that for the immediate neighbors, the confidence
of the label inference depends on their similarity, and for the distant neighbors both similarities along
the path and the length of the path contribute to the confidence of the label assignment. Therefore,
the uncertainty of the inferred labels grows with each iteration of label propagation as a reflection
of the structure of the data. After the inference process has finished, introduction of an uncertainty
cutoff helps to identify the set of nodes, whose labels cannot be inferred reliably from the initial set of
labels given the structure of the dataset expressed as the connectivity of the similarity graph (Figure
5B, grey nodes).
In practical application, this approach helped to identify a high-novelty set of chemical objects
as the set with the most uncertain labels. Further analysis of this set via clustering and construction
of its representative set led to the selection and experimental confirmation of several entities with
target novel properties. The overall strategy of the exploration of the most uncertain data points is
well-documented within active learning frameworks (41, 42). What makes this particular example
important in the context of the cognitive resolution of the data availability bottleneck, is the injection
of the human SME judgement at the earliest stage of the process. We notice that the relevance classes
defined in this manner combine objective and subjective aspects of the polymer materials research
expressed as knowledge, experience, and intuition of human SMEs. The consequences are two-fold.
On one hand, the produced results are within the framework of the intuition and domain knowledge
of the specific human SME who contributed the initial labels, and, therefore, require the least amount
of conceptualization or new thinking which improves actionability of the results. On the other hand,
the individual biases of multiple human SMEs can be efficiently identified and accounted for over
an ensemble of inference processes. Informing the human SMEs of their biases opens possibility for
them to improve their individual problem perception and re-evaluate research directions.

Conclusion
There is a need for materials innovation to address fundamental challenges in technology and
society. Our current era of ML innovation offers great opportunity, but the tools of AI and ML
need to be effectively integrated into the human-driven materials discovery process to accelerate
practical materials innovation. We have identified and explored a trio of bottlenecks that can restrict
this human-machine collaboration—the data availability, cognitive, and actionability bottlenecks.
Regardless of the specific implementation, we are sure that the human SME remains the central
element of the discovery process in the 5th paradigm we have shared. To fully enable that SME in
this new 5th paradigm of discovery, it will be necessary to use technological and practical approaches
to eliminate the bottlenecks to discovery. In this context, elimination of the bottlenecks between
components of scientific discovery will require the use of cognitive and AI/ML technologies that
enable a variety of capabilities, all the way from knowledge extraction using unstructured sources
to human-in-the-loop computational hypothesis generation. Looking forward, we anticipate that
the legacy system of the representation of scientific knowledge in the form of publications and
curated datasets will start transforming towards entity-centered data models, motivated by KGs.
These models have the potential to lead to the democratization of science, the development of novel

117
metrics of evaluation of scientific contributions, and the implementation of automated scientific
reasoning. Finally, we envision a world of materials discovery which will work towards a seamless
partnership between man and machine, where human insights are used to define complex problems,
embody subtle practical constraints, and inject creative insights, while machines and algorithms
enable systematic exploration, statistical rigor, and the identification of novel concepts outside the
human expert’s experience.

Acknowledgments
The authors would like to thank the following colleagues in the Science & Technology
organization at IBM Research – Almaden: Dr. V. Piunova, Dr. J. Dennis, Dr. K. Schmidt, C. Dai,
and Dr. D. Sanders. This research was supported by the IBM Corporation.

References
1. Materials Science and Technology: Challenges for Chemical Sciences in the 21st Century; Consensus
Study Report for National Research Council; National Academies Press: Washington, DC,
2003.
2. Materials Innovation Case Study: QuesTek’s Ferrium® M54® Steel for Hook Shank Application
NAVAIR Public Release #2016-639 Distribution Statement A: Approved for public release;
distribution is unlimited. https://www.thermocalc.com/media/42839/Materials-Genome-
Case-Study-on-Ferrium-M54.pdf (accessed Jan, 2018).
3. Drews, J. Drug Discovery: A Historical Perspective. Science 2000, 287, 1960–1964.
4. Matassi, F.; Nistri, L.; Chicon Paez, D.; Innocenti, M. New Biomaterials for Bone
Regeneration. Clin. Cases Miner. Bone Metab. 2011, 8, 21–24.
5. Clarke, D. R.; Philpott, S. R. Thermal Barrier Coating Materials. Mater. Today 2005, 8, 22–29.
6. Yarovsky, I.; Evans, E. Computer Simulation of Structure and Properties of Crosslinked
Polymers: Application to Epoxy Resins. Polymer 2002, 43, 963–969.
7. Basov, D. N.; Averitt, R. D.; Hsieh, D. Towards Properties on Demand in Quantum Materials.
Nat. Mater. 2017, 16, 1077–1088.
8. Li, H.; Eddaoudi, M.; O’Keeffe, M.; Yaghi, O. M. Design and Synthesis of an Exceptionally
Stable and Highly Porous Metal-Organic Framework. Nature 1999, 402, 276–279.
9. Hey, T.; Tansley, S.; Tolle, K. The Fourth Paradigm: Data-Intensive Scientific Discovery.
Microsoft Research 2009. https://www.microsoft.com/en-us/research/publication/fourth-
paradigm-data-intensive-scientific-discovery/ (accessed June 27, 2019)
10. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. In ImageNet: A Large-Scale
Hierarchical Image Database, 2009 IEEE Conference on Computer Vision and Pattern
Recognition, Miami, FL, June 20–25, 2009; IEEE: Piscataway, NJ, 248–255.
11. LeCun, Y.; Cortes, C.; Burges, C. J. C. The MNIST database of handwritten digits. http://yann.
lecun.com/exdb/mnist/ (accessed June 27, 2019).
12. Taigman, Y.; Yang, M.; Ranzato, M. A.; Wolf, L. In DeepFace: closing the gap to human-level face
verification; 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus,
OH, June 23–28, 2014; IEEE: Piscataway, NJ, pp 1701–1708.
13. Kelly, J. E. Computing, Cognition, and the Future of Knowing. Computing Research News 2016,
28.

118
14. Allen, R. D.; Sooriyakumaran, R.; Opitz, J.; Wallraff, G. M.; Breyta, G.; DiPietro, R. A.; Hofer,
D. C.; Okoroanyanwu, U.; Willson, C. G. Progress in 193 nm Positive Resists. J. Photopolym.
Sci. Technol. 1996, 9, 465–474.
15. García, J. M.; Jones, G. O.; Virwani, K.; McCloskey, B. D.; Boday, D. J.; ter Huurne, G.
M.; Horn, H. W.; Coady, D. J.; Bintaleb, A. M.; Alabdulrahman, A. M. S.; Alsewailem, F.;
Almegren, H. A. A.; Hedrick, J. L. Recyclable, Strong Thermosets and Organogels via
Paraformaldehyde Condensation with Diamines. Science 2014, 344, 732–735.
16. Mitchell, T. M. Generalization as Search. Artificial Intelligence 1982, 18, 203–226.
17. Cattell, R. B. The Measurement of Adult Intelligence. Psychological Bulletin 1943, 40, 153–193.
18. Clarivate Analytics. Web of Science. http://www.webofknowledge.com (accessed Dec 18,
2018).
19. According to IBM Research patent server CIRCA (accessed Dec 18, 2018).
20. Spohrer, J.; Banavar, G. Cognition as a Service: An Industry Perspective. AI Magazine 2015,
36, 71–86.
21. Summary Report for: 19-2032.00 - Materials Scientists, 2018. O*Net Online. https://www.
onetonline.org/link/summary/19-2032.00 (accessed Dec 18, 2018).
22. Yadav, M. K. On the Synthesis of Machine Learning and Automated Reasoning for an Artificial
Synthetic Organic Chemist. New J. Chem. 2017, 41, 1411–1416.
23. Perera, D.; Tucker, J. W.; Brahmbhatt, S.; Helal, C. J.; Chong, A.; Farrell, W.; Richardson, P.;
Sach, N. W. A Platform for Automated Nanomole-Scale Reaction Screening and Micromole-
Scale Synthesis in Flow. Science 2018, 359, 429–434.
24. Cognitive Discovery: The Next Frontier in R & D. https://www.zurich.ibm.com/
cognitivediscovery/ (accessed July 23, 2019).
25. Krishnamurthy, R.; Li, Y.; Raghavan, S.; Reiss, F.; Vaithyanathan, S.; Zhu, H. SystemT: A
System for Declarative Information Extraction. ACM SIGMOD Record 2008, 37, 7–13.
26. Coden, A.; Gruhl, D.; Lewis, N.; Mendes, P. N.; Nagarajan, M.; Ramakrishnan, C.; Welch, S.
In Semantic Lexicon Expansion for Concept-Based Aspect-Aware Sentiment Analysis, Semantic Web
Evaluation Challenge, SemWebEval 2014: Semantic Web Evaluation Challenge, Crete, Greece,
May 25–29, 2014; Presutti, V., Stankovic, M., Cambria, E., Cantador, I., Di Iorio, A., Di Noia,
T., Lange, C., Reforgiato Recupero, D., Tordai, A. Eds.; Springer International Publishing:
Switzerland, pp 34–40.
27. Chiticariu, L.; Danilevsky, M.; Li, Y.; Reiss, F. R.; Zhu, H. In SystemT: Declarative Text
Understanding for Enterprise; The 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies, New Orleans, LA,
June 1–6, 2018; Association for Computational Linguistics: Stroudsburg, PA.
28. Otsuka, S.; Kuwajima, I.; Hosoya, J.; Xu, Y.; Yamazaki, M. In PoLyInfo: Polymer Database for
Polymeric Materials Design; 2011 International Conference on Emerging Intelligent Data and
Web Technologies, Tirana, Albania, Sept 7−9, 2011; IEEE: Piscataway, NJ, pp 22–29.
29. Fargues, J.; Landau, M.; Dugourd, A.; Catach, L. Conceptual Graphs for Semantics and
Knowledge Processing. IBM J. Res. Dev. 1986, 30, 70–79.
30. Bhatia, S.; Jain, A. In Context Sensitive Entity Linking of Search Queries in Enterprise Knowledge
Graphs; The Semantic Web, ESWC 2016 Satellite Events, Lecture Notes in Computer Science,

119
Crete, Greece, May 29–June 2, 2016; Sack, H., Rizzo, G., Steinmetz, N., Mladenić, D., Auer,
S., Lange, C., Eds.; Springer: Cham, Switzerland.
31. Zhang, X.; Liu, X.; Li, X.; Pan, D. MMKG: An Approach to Generate Metallic Materials
Knowledge Graph Based on DBpedia and Wikipedia. Comput. Phys. Commun. 2017, 211,
98–112.
32. Propnet. https://propnet.lbl.gov/ (accessed July 23, 2019).
33. Staar, P. W. J.; Dolfi, M.; Auer, C.; Bekas, C. In Corpus Conversion Service: A Machine Learning
Platform to Ingest Documents at Scale; Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, London, United Kingdom, Aug 19-23,
2018; Association for Computing Machinery: New York, NY, pp 774–782.
34. Bobach, C.; Böhme, T.; Laube, U.; Püschel, A.; Weber, L. Automated Compound
Classification Using a Chemical Ontology. J. Cheminf. 2012, 4, 40.
35. Feunang, Y. D.; Eisner, R.; Knox, C.; Chepelev, L.; Hastings, J.; Owen, G.; Fahy, E.;
Steinbeck, C.; Subramanian, S.; Bolton, E.; Greiner, R.; Wishart, D. S. ClassyFire: Automated
Chemical Classification with a Comprehensive, Computable Taxonomy. J. Cheminf. 2016, 8,
61.
36. Saha, D.; Floratou, A.; Sankaranarayanan, K.; Minhas, U. F.; Mittal, A. R.; Özcan, F.
ATHENA: An Ontology-Driven System for Natural Language Querying over Relational Data
Stores. Proceedings of the VLDB Endowment 2016, 9, 1209–1220.
37. Pyzer-Knapp, E. O.; Suh, C.; Gómez-Bombarelli, R.; Aguilera-Iparraguirre, J.; Aspuru-Guzik,
A. What is High-Throughput Virtual Screening? A Perspective from Organic Materials
Discovery. Ann. Rev. Mater. Res. 2015, 45, 195–216.
38. Coden, A.; Danilevsky, M.; Gruhl, D.; Kato, L.; Nagarajan, M. In A Method to Accelerate Human
in the Loop Clustering; 2017 SIAM International Conference on Data Mining, Houston, TX, Apr
27–29, 2017; Chawla, N., Wang, W., Eds.; Society for Industrial and Applied Mathematics:
Philadelphia, PA, 237–245.
39. Lin, B.; Waymouth, R. M. Urea Anions: Simple, Fast, and Selective Catalysts for Ring-Opening
Polymerizations. J. Am. Chem. Soc. 2017, 139, 1645–1652.
40. Spangler, S.; Wilkins, A. D.; Bachman, B. J.; Nagarajan, M.; Dayaram, T.; Haas, P.;
Regenbogen, S.; Pickering, C. R.; Comer, A.; Myers, J. N.; Stanoi, I.; Kato, L.; Lelescu,
A.; Labrie, J. J.; Parikh, N.; Lisewski, A. M.; Donehower, L.; Chen, Y.; Lichtarge, O. In
Automated Hypothesis Generation Based on Mining Scientific Literature; The 20th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, New York, NY, Aug
24–27, 2014; ACM: New York, NY, pp 1877–1886.
41. Lewis, D. D.; Catlett, J. In Heterogeneous Uncertainty Sampling for Supervised Learning; Machine
Learning: Proceedings of the Eleventh International Conference, New Brunswick, NJ, July
10–13, 1994; Cohen, W., Hirsh, H., Eds.; Morgan Kaufmann Publishers: San Francisco, CA,
pp 148–156.
42. Das, S.; Wong, W.; Dietterich, T.; Fern, A.; Emmott, A. In Incorporating Expert Feedback into
Active Anomaly Discovery; 2016 IEEE 16th International Conference on Data Mining (ICDM),
Barcelona, Spain, Dec 12–15, 2016; IEEE: Piscataway, NJ, pp 853–858.

120
Editors’ Biographies
Edward O. Pyzer-Knapp
Edward O. Pyzer-Knapp (PhD, University of Cambridge) is the Lead for AI and Machine
Learning for IBM Research in the UK. He obtained his PhD from the University of Cambridge
using state of the art computational techniques to accelerate materials design. He then moved to
Harvard to work with Alan Aspuru-Guzik, where he was in charge of the day-to-day running of
the Harvard Clean Energy Project - a collaboration with IBM which combined massive distributed
computing, quantum-mechanical simulations, and machine-learning to accelerate discovery of the
next generation of organic photovoltaic materials. He made the trip back across the pond to lead
the large-scale machine learning effort at IBM Research UK, with a goal of generating step-changes
in the way industries do research through the use of high-performance AI and machine learning
techniques such as Bayesian optimization and deep learning. Research within his group covers AI
applications to materials discovery, inferencing on highly multi-modal data, alongside fundamental
method development in reinforcement learning and Bayesian inference. He holds the title of Visiting
Professor of Industrially Applied AI at the University of Liverpool, where he also co-supervises PhD
students in the areas of machine learning for materials discovery, and method development around
scalable Bayesian inference.

Teodoro Laino
T. Laino (Ph.D., Scuola Normale Superiore di Pisa) is currently a principal research staff
member in the department of Cognitive Computing and Industry Solutions at the IBM Research
Laboratory in Zurich (Switzerland), where he works on complex material simulations for industrial-
related problems and on the application of machine learning/artificial intelligence technologies to
chemistry and materials science problems with the purpose of developing customized industrial
solutions. He got his MSc in Chemistry in 2001 (University of Pisa and Scuola Normale Superiore
di Pisa) and the doctorate in 2006 in Chemistry at the Scuola Normale Superiore di Pisa, Italy. Prior
to joining IBM in 2008, he worked as a post-doctoral researcher in the research group of Prof. Dr.
Jürg Hutter at the University of Zurich, where he developed algorithms for ab initio and classical
molecular dynamics simulations. He is technical leader for molecular simulations at IBM Research
Zurich and has coauthored more than 50 papers, 5 patents publications and co-organized several
European workshops and tutorials on molecular modelling.

Indexes
Author Index
Behler, J., 49 Pitera, J., 103
Ceriotti, M., 1 Pyzer-Knapp, E., x
Garnett, J., 23 Schwaller, P., 61
Green, D., 81 Wilkins, D., 1
Grisafi, A., 1 Willatt, M., 1
Hellström, M., 49 Zubarev, D., 103
Laino, T., x, 61

125
Subject Index
A Molecular Transformer for a Bromo Suzuki
coupling reaction, product-reactants
Atomistic simulations, high-dimensional neural attention, 73f
network potentials, 49 reactants and reagents, chemical reaction
HDNNPs representation settings, 71f
HDNNP, construction, 52 template extraction, automatic, 69
local chemical environment, symmetry USPTO reaction subsets, large-scale data-
functions as descriptors, 51 driven reaction prediction models, 70f
overview, 51 representation and formats, 64
HDNNPs, limitations and strengths, 56
introduction, 49 D
machine-learning potentials, 50
NaOH solutions and ZnO/water interface, Decisions in drug discovery, using machine
case studies learning, 81
aqueous NaOH solutions, 53 future challenges, 93
nuclear quantum effects, 54 laboratory automation, 93
ZnO/water interface, 55 hit and tool molecules, computational
summary, 56 approaches, 84
NN potentials, 57 AL screening cycle, 85f
iterative product design, classic DMT cycle,
C 86f
hit and tool molecules, screening, 82
Chemical reaction prediction, data-driven conventional image analysis pipeline and
learning systems, 61 multiscale CNN, comparison, 84f
chemical reaction data, 65 HTS experiment, typical plate patterns, 83f
data-driven reaction prediction models, human-machine interface, 93
USPTO dataset family tree, 67f introduction and scope, 81
nontrivial atom-mapping, Bromo Grignard technology hype cycle, 82f
reaction, 66f lead optimization, 86
conclusion and outlook, 74 conformal prediction, 89
data-driven models, 75 3D information, efficient exploration of
introduction, 61 novel chemical space, 90f
reaction prediction problem, artificial lead optimization cycle, chemist-centric,
intelligence, 62 87f
Web of Science from 1924 to 2019, number structural interpretation of QSAR and
of publications with topic "t;Organic quantitative structure-property
Chemistry"t;, 63f relationships models, interpretative QSAR
reaction prediction approaches, 67 contributions, 88f
data-driven reaction prediction approaches, mechanistic models vs ML, 93
comparison of input, output, data, and risk management, 92
model architecture, 68t
synthetic tractability, 91
different USPTO subsets, top-3 accuracies
of recent data-driven reaction prediction
models, 72t

127
M validation set, model validation, 41

Machine learning methods using compositional S

features, prediction of Mohs hardness, 23
conclusions, 43 Statistical learning of tensorial properties,
introduction, 23 atomic-scale representation, 1
covalent and ionic crystals, hardness, 25 conclusions, 18
ionic materials, hardness, 25 covariant descriptors, 5
molecular dynamics, 24 aligning each molecule into a fixed reference
methods frame, learn tensorial properties, 6f
binary and ternary SVC, grid optimization, covariant regression, 7
34 water environments, reciprocal alignment,
classes, 27 8f
datasets, 27 dielectric response series, 16
evaluation criteria, 34 Zundel cation dielectric response series,
features, 28 learning curves, 16f
importance, RFs and Gini feature, 31 electronic charge densities, 17
ML models, 30 butane molecules, learning curves of
naturally occurring mineral and artificially predicted charge density, 17f
grown single crystals, histogram of the examples, 15
Mohs hardnesses of the datasets, 28f implementation, 14
primary features, list, 29t introduction, 1
radial basis function, 32 geometry and composition of a molecule or
study, nine ML models utilized, 33t condensed phase, structural descriptors, 2f
study based on Mohs hardness values, λ-SOAP(1) representation, 11
binary and ternary classes, 29t λ-SOAP(2) representation, 12
study models, 33 linear regression, 2
SVMs, 31 Gaussian process regression, 3
results and discussion non-linearity, 13
binary and ternary RBF SVCs, cross- SOAP representation, 10
validation grid optimization accuracies, 36f spherical representation, 8
binary and ternary SVCs, grid optimization Cartesian tensors, regression, 9
results, 35 tensors, symmetries and correlations
binary models, performance from models reflections, 5
trained under 500 stratified train-test splits, rotations, 4
37f translations, 4
considerations, 42
feature importances, 39 T
model validation, workflow of performance
testing, 42f 5th discovery paradigm, cognitive materials
Mohs hardness values, ROC plots using discovery, 103
false positive ratio and false negative ratio, bottlenecks, 5th paradigm
39f actionability bottleneck, 108
naturally occurring minerals dataset, model cognition bottleneck, 108
performance, 37 data availability bottleneck, 107
ternary RBF SVC, 38 inventions in polymer materials domain,
10,000 tree RF for binary and ternary computational approaches and volume of
multiclass models, Gini feature patents, 109f
importances, 40f conclusion, 117

128
introduction, 103 polymer materials domain, basic ontology,
discovery, relevant paradigms, 104f 114f
machine learning, 105 polymer materials domain, knowledge
PHT polymer, crosslinked, 106 extraction, 112f
material science, discovery bottlenecks, 109 scientific applications, KGs, 113
polymer materials, 109 similarity network of chemical entities,
5th paradigm via cognitive systems, 111 binary label propagation, 116f
human SME, 115

129