The RDKit Book - The RDKit 2020.03.1 Documentation PDF
The RDKit Book - The RDKit 2020.03.1 Documentation PDF
The RDKit Book - The RDKit 2020.03.1 Documentation PDF
Instead of using patterns to match known aromatic systems, the aromaticity perception code in the RDKit uses a
set of rules. The rules are relatively straightforward.
Aromaticity is a property of atoms and bonds in rings. An aromatic bond must be between aromatic atoms, but a
bond between aromatic atoms does not need to be aromatic.
For example the fusing bonds here are not considered to be aromatic by the RDKit:
The RDKit supports a number of different aromaticity models and allows the user to define their own by providing a
function that assigns aromaticity.
A ring, or fused ring system, is considered to be aromatic if it obeys the 4N+2 rule. Contributions to the electron
count are determined by atom type and environment. Some examples:
Notice that exocyclic bonds to electronegative atoms “steal” the valence electron from the ring atom and that
dummy atoms contribute whatever count is necessary to make the ring aromatic.
/
The use of fused rings for aromaticity can lead to situations where individual rings are not aromatic, but the fused
system is. An example of this is azulene:
An extreme example, demonstrating both fused rings and the influence of exocyclic double bonds:
>>> m=Chem.MolFromSmiles('O=C1C=CC(=O)C2=C1OC=CO2')
>>> m.GetAtomWithIdx(6).GetIsAromatic()
True
>>> m.GetAtomWithIdx(7).GetIsAromatic()
True
>>> m.GetBondBetweenAtoms(6,7).GetIsAromatic()
False
A special case, heteroatoms with radicals are not considered candidates for aromaticity:
>>> m = Chem.MolFromSmiles('C1=C[N]C=C1')
>>> m.GetAtomWithIdx(0).GetIsAromatic()
False
>>> m.GetAtomWithIdx(2).GetIsAromatic()
False
>>> m.GetAtomWithIdx(2).GetNumRadicalElectrons()
1
/
>>> m = Chem.MolFromSmiles('C1=CC=CC=C[C+]1')
>>> m.GetAtomWithIdx(0).GetIsAromatic()
False
>>> m.GetAtomWithIdx(6).GetIsAromatic()
False
>>> m.GetAtomWithIdx(6).GetFormalCharge()
1
>>> m.GetAtomWithIdx(6).GetNumRadicalElectrons()
1
>>> m = Chem.MolFromSmiles('C1=[C]NC=C1')
>>> m.GetAtomWithIdx(0).GetIsAromatic()
True
>>> m.GetAtomWithIdx(1).GetIsAromatic()
True
>>> m.GetAtomWithIdx(1).GetNumRadicalElectrons()
1
This one is quite simple: only five- and six-membered simple rings are considered candidates for aromaticity. The
same electron-contribution counts listed above are used.
This isn’t well documented (at least not publicly), so we tried to reproduce what’s provided in the oechem
documentation (https://docs.eyesopen.com/toolkits/python/oechemtk/aromaticity.html)
/
Note: For reasons of computational expediency, aromaticity perception is only done for fused-ring systems where
all members are at most 24 atoms in size.
Aromaticity
>>> m = Chem.MolFromSmiles('OC(=O)c1[te]ccc1')
>>> m.GetAtomWithIdx(4).GetIsAromatic()
True
Dative bonds
<- and -> create a dative bond between the atoms, direction does matter.
Dative bonds have the special characteristic that they don’t affect the valence on the start atom, but do affect the
end atom. So in this case, the N atoms involved in the dative bond have the valence of 3 that we expect from bipy,
while the Cu has a valence of 4:
>>> bipycu.GetAtomWithIdx(4).GetTotalValence()
3
>>> bipycu.GetAtomWithIdx(12).GetTotalValence()
4
Ring closures
%(N) notation is supported for ring closures, where N is a single digit %(N) up to five digits %(NNNNN). Here is an
example:
>>> m = Chem.MolFromSmiles('C%(1000)OC%(1000)')
>>> m.GetAtomWithIdx(0).IsInRing()
True
>>> m.GetAtomWithIdx(2).IsInRing()
True
CXSMILES extensions
The RDKit supports parsing and writing a subset of the extended SMILES functionality introduced by ChemAxon
[4].
atomic coordinates /
atomic values
atomic labels
atomic properties
coordinate bonds (these are translated into double bonds)
radicals
enhanced stereo (these are converted into StereoGroups)
The features which are written by rdkit.Chem.rdmolfiles.MolToCXSmiles() (note the specialized writer
function) include:
atomic coordinates
atomic values
atomic labels
atomic properties
radicals
enhanced stereo
>>> m = Chem.MolFromSmiles('OC')
>>> m.GetAtomWithIdx(0).SetProp('p1','2')
>>> m.GetAtomWithIdx(1).SetProp('p1','5')
>>> m.GetAtomWithIdx(1).SetProp('p2','A1')
>>> m.GetAtomWithIdx(0).SetProp('atomLabel','O1')
>>> m.GetAtomWithIdx(1).SetProp('atomLabel','C2')
>>> Chem.MolToCXSmiles(m)
'CO |$C2;O1$,atomProp:0.p1.5:0.p2.A1:1.p1.2|'
Here’s the (hopefully complete) list of SMARTS features that are not supported:
Hybridization queries
>> Chem.MolFromSmiles('CC=CF').GetSubstructMatches(Chem.MolFromSmarts('[^2]'))
((1,), (2,))
Dative bonds
<- and -> match the corresponding dative bonds, direction does matter.
>>> Chem.MolFromSmiles('C1=CC=CC=N1->[Fe]').GetSubstructMatches(Chem.MolFromSmarts('
((5, 6),)
>>> Chem.MolFromSmiles('C1=CC=CC=N1->[Fe]').GetSubstructMatches(Chem.MolFromSmarts('*
((6, 5),)
/
Heteroatom neighbor queries
the atom query z matches atoms that have the specified number of heteroatom (i.e. not C or H)
neighbors. For example, z2 would match the second C in CC(=O)O.
the atom query Z matches atoms that have the specified number of aliphatic heteroatom (i.e.
not C or H) neighbors.
>>> Chem.MolFromSmiles('O=C(O)c1nc(O)ccn1').GetSubstructMatches(Chem.MolFromSmarts('
((1,), (3,), (5,))
>>> Chem.MolFromSmiles('O=C(O)c1nc(O)ccn1').GetSubstructMatches(Chem.MolFromSmarts('
((1,),)
>>> Chem.MolFromSmiles('O=C(O)c1nc(O)ccn1').GetSubstructMatches(Chem.MolFromSmarts('
((5,),)
Range queries
Ranges of values can be provided for many query types that expect numeric values. The query types that currently
support range queries are: D, h, r, R, v, x, X, z, Z, +, -
D{2-4} matches atoms that have between 2 and 4 (inclusive) explicit connections.
D{-3} matches atoms that have less than or equal to 3 explicit connections.
D{2-} matches atoms that have at least 2 explicit connections.
>>> Chem.MolFromSmiles('CC(=O)OC').GetSubstructMatches(Chem.MolFromSmarts('[z{1-}]')
((1,), (4,))
>>> Chem.MolFromSmiles('CC(=O)OC').GetSubstructMatches(Chem.MolFromSmarts('[D{2-3}]'
((1,), (3,))
>>> Chem.MolFromSmiles('CC(=O)OC.C').GetSubstructMatches(Chem.MolFromSmarts('[D{-2}]
((0,), (2,), (3,), (4,), (5,))
SMARTS Reference
Note that the text versions of the tables below include some backslash characters to escape special characters.
This is a wart from the documentation system we are using. Please ignore those characters.
Atoms
Bonds
/
Ring Finding and SSSR
[Section taken from “Getting Started” document]
As others have ranted about with more energy and eloquence than I intend to, the definition of a molecule’s
smallest set of smallest rings is not unique. In some high symmetry molecules, a “true” SSSR will give results that
are unappealing. For example, the SSSR for cubane only contains 5 rings, even though there are “obviously” 6.
This problem can be fixed by implementing a small (instead of smallest) set of smallest rings algorithm that returns
symmetric results. This is the approach that we took with the RDKit.
Because it is sometimes useful to be able to count how many SSSR rings are present in the molecule, there is a
GetSSSR function, but this only returns the SSSR count, not the potentially non-unique set of rings.
For situations where you just care about knowing whether or not atoms/bonds are in rings, the RDKit provides the
function rdkit.Chem.rdmolops.FastFindRings(). This does a depth-first traversal of the molecule graph and
identifies atoms and bonds that are in rings.
Some features
Mapped dummy atoms in the product template are replaced by the corresponding atom in the reactant:
“Any” bonds in the products are replaced by the corresponding bond in the reactant:
/
Intramolecular reactions can be expressed flexibly by including reactants in parentheses. This is demonstrated in
this ring-closing metathesis example [5]:
Chirality
This section describes how chirality information in the reaction defition is handled. A consistent example,
esterification of secondary alcohols, is used throughout [6].
If no chiral information is present in the reaction definition, the stereochemistry of the reactants is preserved, as is
membership in enhanced stereo groups:
You get the same result (retention of stereochemistry) if a mapped atom has the same chirality in both reactants
and products:
A mapped atom with different chirality in reactants and products leads to inversion of stereochemistry:
If a mapped atom has chirality specified in the reactants, but not in the products, the reaction destroys chirality at
that center:
/
>>> rxn = AllChem.ReactionFromSmarts('[C@H1:1][OH:2].[OH][C:3]=[O:4]>>[C:1][O:2][C:3
>>> ps=rxn.RunReactants((alcohol1,acid))
>>> Chem.MolToSmiles(ps[0][0],True)
'CC(=O)OC(C)CCN'
>>> ps=rxn.RunReactants((alcohol2,acid))
>>> Chem.MolToSmiles(ps[0][0],True)
'CC(=O)OC(C)CCN'
>>> ps=rxn.RunReactants((alcohol3,acid))
>>> Chem.MolToSmiles(ps[0][0],True)
'CC(=O)OC(C)CCN'
And, finally, if chirality is specified in the products, but not the reactants, the reaction creates a stereocenter with the
specified chirality:
Note that this doesn’t make sense without including a bit more context around the stereocenter in the reaction
definition:
Note that the chirality specification is not being used as part of the query: a molecule with no chirality specified can
match a reactant with specified chirality.
In general, the reaction machinery tries to preserve as much stereochemistry information as possible. This works
when a single new bond is formed to a chiral center:
In this case, there’s just not sufficient information present to allow the information to be preserved. You can help by
providing mapping information:
So if you want to copy the bond order from the reactant, use an “Any” bond:
Chemical Features
Chemical features are defined by a Feature Type and a Feature Family. The Feature Family is a general
classification of the feature (such as “Hydrogen-bond Donor” or “Aromatic”) while the Feature Type provides
additional, higher-resolution, information about features. Pharmacophore matching is done using Feature Family’s.
Each feature type contains the following pieces of information:
A SMARTS pattern that describes atoms (one or more) matching the feature type.
Weights used to determine the feature’s position based on the positions of its defining atoms.
AtomType definitions
An AtomType definition allows you to assign a shorthand name to be used in place of a SMARTS string defining an
atom query. This allows FDef files to be made much more readable. For example, defining a non-polar carbon atom
like this:
creates a new name that can be used anywhere else in the FDef file that it would be useful to use this SMARTS. To
reference an AtomType, just include its name in curly brackets. For example, this excerpt from an FDef file defines
another atom type - Hphobe - which references the Carbon_NonPolar definition:
Note that {Carbon_NonPolar} is used in the new AtomType definition without any additional decoration (no
square brackes or recursive SMARTS markers are required).
Repeating an AtomType results in the two definitions being combined using the SMARTS “,” (or) operator. Here’s
an example:
AtomType d1 [N&!H0]
AtomType d1 [O&!H0] /
This is equivalent to:
AtomType d1 [N&!H0,O&!H0]
AtomType d1 [N,O;!H0]
Note that these examples tend to use SMARTS’s high-precedence and operator “&” and not the low-precedence
and “;”. This can be important when AtomTypes are combined or when they are repeated. The SMARTS “,”
operator is higher precedence than “;”, so definitions that use “;” can lead to unexpected results.
AtomType d1 [N,O,S]
AtomType !d1 [H0]
The negative query gets combined with the first to produce a definition identical to this:
AtomType d1 [!H0;N,O,S]
Note that the negative AtomType is added to the beginning of the query.
Feature definitions
A feature definition is more complex than an AtomType definition and stretches across multiple lines:
The first line of the feature definition includes the feature type and the SMARTS string defining the feature. The next
two lines (order not important) define the feature’s family and its atom weights (a comma-delimited list that is the
same length as the number of atoms defining the feature). The atom weights are used to calculate the feature’s
locations based on a weighted average of the positions of the atom defining the feature. More detail on this is
provided below. The final line of a feature definition must be EndFeature. It is perfectly legal to mix AtomType
definitions with feature definitions in the FDef file. The one rule is that AtomTypes must be defined before they are
referenced.
Any line that begins with a # symbol is considered a comment and will be ignored.
A backslash character, , at the end of a line is a continuation character, it indicates that the data from that line
is continued on the next line of the file. Blank space at the beginning of these additional lines is ignored. For
example, this AtomType definition:
/
Atom weights and feature locations
In this case both definitions of the HDonor1 feature type will be active. This is functionally identical to:
However the formulation of this feature definition with a duplicated feature type is considerably less efficient
and more confusing than the simpler combined definition.
/
Figure 1: Bit numbering in pharmacophore fingerprints
The general rule used in the RDKit is that if you don’t specify a property in the query, then it’s not used as part of the
matching criteria and that Hs are ignored. This leads to the following behavior:
Demonstrated here:
>>> Chem.MolFromSmiles('CCO').HasSubstructMatch(Chem.MolFromSmiles('CCO'))
True
>>> Chem.MolFromSmiles('CC[O-]').HasSubstructMatch(Chem.MolFromSmiles('CCO'))
True
>>> Chem.MolFromSmiles('CCO').HasSubstructMatch(Chem.MolFromSmiles('CC[O-]'))
False
>>> Chem.MolFromSmiles('CC[O-]').HasSubstructMatch(Chem.MolFromSmiles('CC[O-]'))
True
>>> Chem.MolFromSmiles('CC[O-]').HasSubstructMatch(Chem.MolFromSmiles('CC[OH]'))
True
>>> Chem.MolFromSmiles('CCOC').HasSubstructMatch(Chem.MolFromSmiles('CC[OH]'))
True
>>> Chem.MolFromSmiles('CCOC').HasSubstructMatch(Chem.MolFromSmiles('CCO'))
True
>>> Chem.MolFromSmiles('CCC').HasSubstructMatch(Chem.MolFromSmiles('CCC'))
True
>>> Chem.MolFromSmiles('CC[14C]').HasSubstructMatch(Chem.MolFromSmiles('CCC'))
True
>>> Chem.MolFromSmiles('CCC').HasSubstructMatch(Chem.MolFromSmiles('CC[14C]'))
False
>>> Chem.MolFromSmiles('CC[14C]').HasSubstructMatch(Chem.MolFromSmiles('CC[14C]'))
True
>>> Chem.MolFromSmiles('OCO').HasSubstructMatch(Chem.MolFromSmiles('C'))
True
>>> Chem.MolFromSmiles('OCO').HasSubstructMatch(Chem.MolFromSmiles('[CH]'))
False
>>> Chem.MolFromSmiles('OCO').HasSubstructMatch(Chem.MolFromSmiles('[CH2]'))
False
>>> Chem.MolFromSmiles('OCO').HasSubstructMatch(Chem.MolFromSmiles('[CH3]'))
False
>>> Chem.MolFromSmiles('OCO').HasSubstructMatch(Chem.MolFromSmiles('O[CH3]'))
True
>>> Chem.MolFromSmiles('O[CH2]O').HasSubstructMatch(Chem.MolFromSmiles('C'))
True
>>> Chem.MolFromSmiles('O[CH2]O').HasSubstructMatch(Chem.MolFromSmiles('[CH2]'))
False
Molecular Sanitization
The molecule parsing functions all, by default, perform a “sanitization” operation on the molecules read. The idea is
to generate useful computed properties (like hybridization, ring membership, etc.) for the rest of the code and to
ensure that the molecules are “reasonable”: that they can be represented with octet-complete Lewis dot structures.
/
1. clearComputedProps: removes any computed properties that already exist
on the molecule and its atoms and bonds. This step is always performed.
The individual steps can be toggled on or off when calling MolOps::sanitizeMol or Chem.SanitizeMol.
Implementation Details
“Magic” Property Values
The following property values are regularly used in the RDKit codebase and may be useful to client code.
Atom
/
Property Name Use
_CIPCode the CIP code (R or S) of the atom
_CIPRank the integer CIP rank of the atom
_ChiralityPossible set if an atom is a possible chiral center
_MolFileRLabel integer R group label for an atom, read from/written to CTABs.
set on an atom in a product template of a reaction if its degree changes in the
_ReactionDegreeChanged
reaction
atoms with this property set will not be considered as matching reactant queries in
_protected
reactions
dummyLabel (on dummy atoms) read from/written to CTABs as the atom symbol
molAtomMapNumber the atom map number for an atom, read from/written to SMILES and CTABs
molfileAlias the mol file alias for an atom (follows A tags), read from/written to CTABs
molFileValue the mol file value for an atom (follows V tags), read from/written to CTABs
used to flag whether stereochemistry at an atom changes in a reaction, read
molFileInversionFlag
from/written to CTABs, determined automatically from SMILES
molRxnComponent which component of a reaction an atom belongs to, read from/written to CTABs
which role an atom plays in a reaction (1=Reactant, 2=Product, 3=Agent), read
molRxnRole
from/written to CTABs
smilesSymbol determines the symbol that will be written to a SMILES for the atom
Known Problems
/
InChI generation and (probably) parsing. This seems to be a limitation of the IUPAC InChI code.
In order to allow the code to be used in a multi-threaded environment, a mutex is used to
ensure that only one thread is using the IUPAC code at a time. This is only enabled if the RDKit
is built with the RDK_TEST_MULTITHREADED option enabled.
The MolSuppliers (e.g. SDMolSupplier, SmilesMolSupplier?) change their internal state when a
molecule is read. It is not safe to use one supplier on more than one thread.
Substructure searching using query molecules that include recursive queries. The recursive
queries modify their internal state when a search is run, so it’s not safe to use the same query
concurrently on multiple threads. If the code is built using the RDK_BUILD_THREADSAFE_SSS
argument (the default for the binaries we provide), a mutex is used to ensure that only one
thread is using a given recursive query at a time.
The RDKit’s TPSA implementation only includes, by default, contributions from N and O atoms. Table 1 of the
TPSA publication. however, includes parameters for polar S and P in addition to N and O. What’s going on?
In the RDKit implementation, we chose to reproduce the behavior of the tpsa.c Contrib program and what is
provided in Table 3 of the paper, so polar S and P are ignored. Based on a couple of user requests, for the
2018.09 release of the RDKit we added the option to include S and P contributions:
Here’s a sample block from an SDF that demonstrates all of the features, they are explained below:
property_example
RDKit 2D
3 3 0 0 0 0 0 0 0 0999 V2000
0.8660 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.4330 0.7500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 /
-0.4330 -0.7500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
3 1 1 0
M END
> <atom.dprop.PartialCharge> (1)
0.008 -0.314 0.008
$$$$
Every atom property list should contain a number of space-delimited elements equal to the number of atoms.
Missing values are, by default, indicated with the string n/a. The missing value marker can be changed by
beginning the property list with a value in square brackets. So, for example, the property PartiallyMissing is
set to “one” for atom 0, “three” for atom 2, and is not set for atom 1. Similarly the property PartiallyMissingInt
is set to 2 for atom 0, 2 for atom 1, and is not set for atom 2.
This behavior is enabled by default and can be turned on/off with the
rdkit.Chem.rdmolfiles.SetProcessPropertyLists method.
If you have atom properties that you would like to have written to SDF files, you can use the functions
rdkit.Chem.rdmolfiles.CreateAtomStringPropertyList(),
rdkit.Chem.rdmolfiles.CreateAtomIntPropertyList(),
rdkit.Chem.rdmolfiles.CreateAtomDoublePropertyList(), or
rdkit.Chem.rdmolfiles.CreateAtomBoolPropertyList() :
>>> m = Chem.MolFromSmiles('CO')
>>> m.GetAtomWithIdx(0).SetDoubleProp('foo',3.14)
>>> Chem.CreateAtomDoublePropertyList(m,'foo')
>>> m.GetProp('atom.dprop.foo')
'3.1400000000000001 n/a'
>>> from io import StringIO
>>> sio = StringIO()
>>> w = Chem.SDWriter(sio)
>>> w.write(m)
>>> w=None
>>> print(sio.getvalue())
RDKit 2D
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.2990 0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
> <atom.dprop.foo> (1)
3.1400000000000001 n/a
$$$$
mixture
mixture
mixture
single
single
single
mixture
single
Representation
Stored as a vector of rdkit.Chem.rdchem.StereoGroup objects on a molecule. Each StereoGroup keeps
track of its type and the set of atoms that make it up.
Use cases
The initial target is to not lose data on an V3k mol -> RDKit -> V3k mol round trip. Manipulation and
depiction are future goals.
/
>>> m = Chem.MolFromSmiles('C[C@H](F)C[C@H](O)Cl |&1:1|')
>>> m.GetStereoGroups()[0].GetGroupType()
rdkit.Chem.rdchem.StereoGroupType.STEREO_AND
>>> [x.GetIdx() for x in m.GetStereoGroups()[0].GetAtoms()]
[1]
>>> from rdkit.Chem.EnumerateStereoisomers import EnumerateStereoisomers
>>> [Chem.MolToCXSmiles(x) for x in EnumerateStereoisomers(m)]
['C[C@@H](F)C[C@H](O)Cl |&1:1|', 'C[C@H](F)C[C@H](O)Cl |&1:1|']
Reactions also preserve StereoGroup``s. Product atoms are included in the ``StereoGroup as
long as the reaction doesn’t create or destroy chirality at the atom.
OR AND
Y Y Y Y Y Y Y
N Y N N Y Y Y
N N Y N N Y Y
N N N Y N N N
N Y N N N Y Y
/
OR AND
N N N N N Y Y
OR
N N N N N N Y
AND
Substructure search using molecules with enhanced stereochemistry follows these rules (where substructure <
superstructure):
achiral < everything, because an achiral query means ignore chirality in the match
chiral < AND, because AND includes both the chiral molecule and another one
chiral < OR, because OR includes either the chiral molecule or another one
OR < AND, because AND includes both molecules that OR could actually mean.
one group of two atoms < two groups of one atom, because the latter is 4 different
RDKit Fingerprints
This is an RDKit-specific fingerprint that is inspired by (though it differs significantly from) public descriptions of the
Daylight fingerprint [7]. The fingerprinting algorithm identifies all subgraphs in the molecule within a particular range
of sizes, hashes each subgraph to generate a raw bit ID, mods that raw bit ID to fit in the assigned fingerprint size,
and then sets the corresponding bit. Options are available to generate count-based forms of the fingerprint or “non-
folded” forms (using a sparse representation).
The default scheme for hashing subgraphs is to hash the individual bonds based on:
the types of the two atoms. Atom types include the atomic number (mod 128), and whether or not the
atom is aromatic.
the degrees of the two atoms in the path.
the bond type (or AROMATIC if the bond is marked as aromatic)
/
Fingerprint-specific options
minPath and maxPath control the size (in bonds) of the subgraphs/paths considered
nBitsPerHash: If this is greater than one, each subgraph will set more than one bit. The
additional bits will be generated by seeding a random number generator with the original raw bit
ID and generating the appropriate number of random numbers.
useHs: toggles whether or not Hs are included in the subgraphs/paths (assuming that there are
Hs in the molecule graph.
tgtDensity: if this is greater than zero, the fingerprint will be repeatedly folded in half until the
density of set bits is greater than or equal to this value or the fingerprint only contains minSize
bits. Note that this means that the resulting fingerprint will not necessarily be the size you
requested.
branchedPaths: if this is true (the default value), the algorithm will use subgraphs (i.e features
can be branched. If false, only linear paths will be considered.
useBondOrder: if true (the default) bond types will be considered when hashing subgraphs,
otherwise this component of the hash will be ignored.
Pattern Fingerprints
These fingerprints were designed to be used in substructure screening. These are, as far as I know, unique to the
RDKit. The algorithm identifies features in the molecule by doing substructure searches using a small number (12 in
the 2019.03 release of the RDKit) of very generic SMARTS patterns - like [*]~[*]~[*](~[*])~[*] or
[R]~1[R]~[R]~[R]~1, and then hashing each occurrence of a pattern based on the atom and bond types
involved. The fact that particular pattern matched the molecule at all is also stored by hashing the pattern ID and
size. If a particular feature contains either a query atom or a query bond (e.g. something generated from SMARTS),
the only information that is hashed is the fact that the generic pattern matched.
For the 2019.03 release, the atom types use just the atomic number of the atom and the bond types use the bond
type, or AROMATIC for aromatic bonds).
NOTE: Because it plays an important role in substructure screenout, the internals of this fingerprint (the generic
patterns used and/or the details of the hashing algorithm) may change from one release to the next.
These fingerprints were originally “intended” to be used in count-vectors and they seem to work better that way. The
default behavior of the explicit bit-vector forms of both fingerprints is to use a “count simulation” procedure where
multiple bits are set for a given feature if it occurs more than once. The default behavior is to use 4 fingerprint bits
for each feature (so a 2048 bit fingerprint actually stores information about the same number of features as a 512
bit fingerprint that isn’t using count simulation). The bins correspond to counts of 1, 2, 4, and 8. As an example of
how this works: if a feature occurs 5 times in a molecule, the bits corresponding to counts 1, 2, and 4 will be set.
Layered Fingerprints
/
These are another “RDKit original” and were developed with the intention of using them as a substructure
fingerprint. Since the pattern fingerprint is far simpler and has proven to be quite effective as a substructure
fingerprint, the layered fingerprint hasn’t received much attention. It may still be interesting for something, so we
continue to include it.
The idea of the fingerprint is generate features using the same subgraph (or path) enumeration algorithm used in
the RDKit fingerprint. After a subgraph has been generated, it is used to set multiple bits based on different atom
and bond type definitions.
Footnotes
[1] http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html
2(1,2)http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html
3(1,2,3)http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
[4] https://docs.chemaxon.com/display/docs/ChemAxon+Extended+SMILES+and+SMARTS+-
+CXSMILES+and+CXSMARTS
[5] Thanks to James Davidson for this example.
[6] Thanks to JP Ebejer and Paul Finn for this example.
[7] http://www.daylight.com/dayhtml/doc/theory/theory.finger.html
[8] http://pubs.acs.org/doi/abs/10.1021/ci00046a002
[9] http://pubs.acs.org/doi/abs/10.1021/ci00054a008
[10] http://pubs.acs.org/doi/abs/10.1021/ci100050t
[11] https://doi.org/10.1002/(SICI)1097-0290(199824)61:1%3C47::AID-BIT9%3E3.0.CO;2-Z
License
This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 License. To view a copy of this
license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 543 Howard
Street, 5th Floor, San Francisco, California, 94105, USA.
The intent of this license is similar to that of the RDKit itself. In simple words: “Do whatever you want with it, but
please give us some credit.”