1. RDKit: where did we come from and where are
we going?
Greg Landrum (@dr_greg_landrum)
12th International Conference on Chemical Structures
12 June, 2022
2. The Trustees of the CSA Trust are pleased to announce that
Greg Landrum has been awarded the 2022 Mike Lynch
Award, in recognition of his work on the development of
RDKit and his fostering of the community around it, a
transformative software resource for cheminformatics and
machine learning. https://csa-trust.org/2022/05/13/mike-lynch-award-2022-greg-landrum/
The purpose of the Award is to recognise and encourage outstanding
accomplishments in education, research and development activities that are
related to the systems and methods used to store, process and retrieve
information about chemical structures, reactions and properties.
The Mike Lynch Award will be presented at a prestigious, relevant conference
to be identified prior to each presentation and the awardee will be asked to
give a presentation at the conference. https://csa-trust.org/awards-and-grants/awards/
4. 4
Acknowledgements
● Everyone who has contributed code, questions,
answers, bug reports, etc
● The people who manage RDKit packaging
● The organizers and sponsors of the RDKit
UGMs
● People who have funded RDKit development
(directly or indirectly)
● The others in our community who've been
pushing the idea and adoption of open source
5. 5
An open source toolkit for cheminformatics
● Business-friendly BSD license
● Core data structures and algorithms in
C++
● Python 3.x wrapper generated using
Boost.Python
● Java and C# wrappers generated with
SWIG
● JavaScript wrappers
● CFFI wrapper for usage from other
languages
● 2D and 3D molecular operations
● Descriptor generation for machine
learning
● Molecular database cartridge for
PostgreSQL
● Cheminformatics nodes for KNIME
(distributed from the KNIME
community site:
http://www.knime.org/rdkit)
7. 7
Releases, reproducibility, and citability
● 2 feature releases per year
● ~monthly patch releases with bug fixes
● Every release is assigned a DOI and archived on Zenodo
https://zenodo.org/record/6483170
10. 10
Sustainability: the bus problem
RDKit maintainers:
- Greg
- Brian Kelley (Relay Therapeutics)
- Ricardo Rodriguez (Schrödinger)
- Paolo Tosco (Novartis)
Regular code contributors:
- David Cosgrove
- Peter Gedeck
- Gareth Jones
- Eisuke Kawashima
- Dan Nealschneider
- Sereina Riniker
- Roger Sayle
- Riccardo Vianello
14. 14
The early days
● 2000-2006: initial development work at Rational Discovery
● 2006: code open sourced and released on sourceforge.net
15. 15
Aside: some motivations for open-sourcing scientific code
● Recognition
● Helping the scientific community
● Feedback and help from others
● You get to keep using the code when you move on
to your next position
16. 16
Some history
● 2000-2006: initial development work at Rational Discovery
● 2006: code open sourced and released on sourceforge.net
● 2007: First NIBR contribution (chemical reaction handling); Noel discovers the RDKit
● 2008: first POC of Java wrapper; Mac support added; SLN and Mol2 parsers;
● 2009: Morgan fingerprints; switch to cmake; switch to VF2 for SSS
● 2010: PostgreSQL cartridge; First iteration of the KNIME nodes; $RDBASE/Contrib appears;
SaltRemover and FunctionalGroups code
● 2011: New Java wrappers; more functionality moved to C++; InChI support; AvalonTools
integration
● 2012: First UGM; Speed improvements; MCS implementation; IPython integration; “RDKit
Cookbook” appears
● 2013: Move to github; Pandas integration; MMFF and Open3DAlign support; PDB support;
rdkit blog started
17. 17
Some history, cntd
● 2014: python3 support; conda integration; experimental lucene integration; MCS implementation in
C++
● 2015: new drawing code; improved canonicalization algorithm; ETKDG; reduced memory usage
● 2016: Regular patch releases; easier builds; performance improvements; KNIME nodes move to
Github
● 2017: Modern C++; R-group decomposition, first GSoC participation, conda-forge packages
● 2018: CoordGen integration; molecular standardization
● 2019: Azure DevOps, substructure speedup, new molecule hashing code, Neo4J integration, new JS
wrappers
● 2020: new CIP implementation, scaffold network, abbreviations, tautomer-insensitive substructure
search
● 2021: rdkit-cffi, more drawing improvements, R-group decomposition improvements
● 2022: C++17, generics for searching, non-tetrahedral symmetry…
20. 20
Longer term RDKit objectives
● Improved support for other classes of molecules
■ Polymers
■ Organometallics
● Ensuring that the PostgreSQL cartridge is a plausible
candidate for use in a corporate “data warehouse”1
● Ensuring all the pieces are in place to make it easy to
write a compound registration system
1
or whatever such things are called these days
21. 21
Future directions: the cartridge
Ensuring that the PostgreSQL cartridge is a plausible candidate
for use in a corporate “data warehouse”
- Integration of tautomer insensitive search
- Integration of the MolStandardize code
- Improvements to the chemical reaction handling
- Integration of the generics for searching
Further ideas
- Adding some 3D search capabilities
23. 23
Aside: Goals of a compound registration system
We want to be able to answer these questions:
- Have we seen this compound before?
- Give me a key for this compound
- Give me the structure for this key
24. 24
Aside: Goals of a compound registration system
We want to be able to answer these questions:
- Have we seen this compound before?
- Give me a key for this compound
- Give me the structure for this key
So what do we need to be able to do?
- Standardize molecules
- Generate hashes/keys for standardized molecules
- Store structures
25. 25
Using keys for registration
Idea: use a hash to combine:
- The molecular structure (via a fixed H
InChI)
- A stereo code
- A stereo comment
https://github.com/rdkit/UGM_2015/blob/8f562e70add17bab35f43823af0f03673f8a
1f2d/Presentations/KeyToRegistration.GregLandrum.pdf
26. 26
Future directions: registration systems
Ensuring all the pieces are in place to make it easy to write a compound registration system
- Improvements to MolStandardize code
- Improvements to the molecular hashing code
- Support for more other classes of molecules
27. 27
Let’s talk about molecular identity
This isn’t just a topic for standard compound registration systems.
28. 28
Molecular identity and computational questions
● Which molecules were used to generate this
result?
● Have I already done a calculation using this
molecule?
● Was this molecule part of my training set?
All of these require us to be able to answer
the question
“are these two molecules the same?”
Here be dragons…
30. 30
Some things making molecular identity nontrivial
● Counterions, solvents
● Resonance forms
● Charges
● Tautomers
● Stereochemistry
Sometimes we care about these differences, sometimes we don’t. It depends on the context
around when asking the question “are these two molecules the same?”
This is not a comprehensive list
31. 31
Identity hashes for molecules
Idea: convert the molecule into some form which allows us to test whether or not it’s
identical to other molecules via a simple string (or numerical) comparison.
What “identical” means will be determined by the identity hash used.
Familiar examples:
- Canonical SMILES
- InChI
32. 32
Contextual identity
Instead of having a single key/hash for a molecule, store a collection of layers with different
levels of detail/types of information. When searching, choose the layers which are relevant
for the current use case
● Store molecules using some relatively lossless format (e.g. v3000 SDF)
● Use molecular hashes capturing different levels of information to establish whether or
not duplicates exist
Note: it’s possible to do a limited version of this via careful manipulation of InChI strings
33. 33
Some more identity hashes
https://www.nextmovesoftware.com/talks/OBoyle_MolHash_ACS_201908.pdf
Available in the RDKit since the 2019.09 release
34. 34
Some of the basic identity hashes in rdMolHash
● Molecular formula
● Anonymous graph
● Element graph
● Murcko scaffold
● Tautomer
● Canonical smiles
There are many others
35. 35
Hashes for registration
The team at Schrödinger1
have contributed a new RDKit module for calculating layered
hashes which are useful for compound identity testing and registration. This will be in the
2022.09 release.
Layers it currently supports:
- Formula
- Canonical SMILES : with and without stereo
- Tautomer hash: with and without stereo
- Sgroup data (for some help with polymers and things like atropisomers)
- “Escape layer” (free text allowing a structure to be different even if everything else says
it’s the same)
1
Chris Von Bargen, Hussein Faara, Dan Nealschneider, Ricardo Rodriguez, Rachel Walker
39. 39
Handling atropisomers
Structures from: https://doi.org/10.1016/j.xphs.2021.10.011
The bold and hashed bonds are just drawing features and don’t survive translation
to things like CXSMILES or mol files. But we can use S groups to indicate the
stereochemistry
43. 43
Handling enhanced stereochemistry
{<HashLayer.CANONICAL_SMILES: 1>:
'CC[C@@H](CO)NCCN[C@@H](CC)CO',
…,
<HashLayer.NO_STEREO_SMILES: 4>:
'CCC(CO)NCCNC(CC)CO',
…}
{<HashLayer.CANONICAL_SMILES: 1>:
'CC[C@@H](CO)NCCN[C@@H](CC)CO |&1:2,9|',
…,
<HashLayer.NO_STEREO_SMILES: 4>:
'CCC(CO)NCCNC(CC)CO',
…}
We get the same hash if the molecule is drawn with
wedged bonds.
44. 44
Using the escape layer
Suppose I start with the racemic mixture, run it through a chiral column, and
collect the two fractions
I want to register the two fractions separately without determining the absolute
stereochemistry
46. 46
Aside: using the escape layer for comp chem
{…
<HashLayer.ESCAPE: 2>: ‘conformer 1',
…}
{…
<HashLayer.ESCAPE: 2>: ‘conformer 2',
…}
Suppose I want to store multiple conformers/poses of the same molecule
47. 47
Wrapping up: molecular identity
● For many computational tasks we want to be
able to figure out whether or not we have
seen/used a particular molecule
● The definition of “same” for molecules
depends on the context/question being asked
● Layered registration hashes make it easy (and
cheap) to store sets of molecules and answer
the context-dependent “are these the same?”
question