SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
RDkit Gems
Roger Sayle, PhD
NextMove Software LTD
cambridge, uk
introduction
• This presentation is intended as a short talktorial
describing real world code examples, recent
features and commonly used RDKit idioms.
• Although many refect personal/NextMove
Software experiences developing with the C++
APIs, most (though not all) observations apply
equally to the Java and Python APIs.
looping over molecule atoms
• Poor
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); atomIt++) …
• Better
for (RWMol::AtomIterator atomIt=mol->beginAtoms();
atomIt != mol->endAtoms(); ++atomIt) …
• The C++ post increment operator returns
the original value prior to the increment,
requiring a copy to made.
looping over aTOM's bonds
• Documented idiom:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::OEDGE_ITER beg,end;
boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr);
while(beg!=end){
const Bond * bond=(*molPtr)[*beg].get();
... do something with the Bond ...
++beg;
}
looping over aTOM's bonds
• Proposed replacement:
... molPtr is a const ROMol * ...
... atomPtr is a const Atom * ...
ROMol::OBOND_ITER_PAIR bondIt;
bondIt = molPtr->getAtomBonds(atomPtr);
while(bondIt.first != bondIt.second){
const Bond * bond=(*molPtr)[*bondIt.first].get();
... do something with the Bond ...
++bondIt;
}
Accepting string arguments
• Good practice to accept both C++
constant literals and STL std::strings.
• Poor
void foo(const char *ptr) { foo(std::string(ptr)); }
void foo(const std::string str) { … }
• Better
void foo(const std::string &str) { foo(str.c_str()); }
void foo(const char *ptr) { … }
isoelectronic valences
• Supporting PF6
-
in RDKit currently requires
adding 7 to the neutral valences of P.
Instead, P-1
should be isoelectronic with S.
• Poor
valence(atomicNo,charge) ~= valence(atomicNo,0)-charge
• Better
valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
representing reactions
• Two possible representations
– A new object containing vectors of reactant
and product (and agent?) molecules.
– Annotation of existing molecule with reaction
roles annotated on each atom.
• Both are representations are equivalent
and one may be converted to the other
and vice versa.
common ancestry/base class
• A significant advantage of avoiding new
object types is the common requirement in
cheminformatics of reading reactions,
molecules, queries, peptides, transforms,
and pharmacophores for the same file
format.
• OpenBabel and GGA's Indigo require you
to know the contents of a file/record (mol
vs. rxn) before reading it!?
Proposed rdkit interface
• Atom::setProp(“rxnRole”,(int)role)
– RXN_ROLE_NONE 0
– RXN_ROLE_REACTANT 1
– RXN_ROLE_AGENT 2
– RXN_ROLE_PRODUCT 3
• Atom::setProp(“rxnGroup”,(int)group)
• Atom::setProp(“molAtomMapNumber”,idx)
• RWMol::setProp(“isRxn”,isrxn?1:0)
and where this gets used...
• Code/GraphMol/FileParser/MolFileWriter.cpp:
• Function GetMolFileAtomLine contains
• Currently unused local variables
– rxnComponentType
– rxnComponentNumber
• This would allow support for CPSS-style
reactions [i.e. reactions in SD files]
inverted protein hierarchy
• Optimize data structures for common ops.
• Poor
for (modelIt mi=mol->getModels(); mi; ++mi)
for (chainIt ci=(*mi)->getChains(); ci; ++ci)
for (resIt ri=(*ci)->getResidues(); ri; ++ri)
for (atomIt ai=(*ri)->getAtoms(); ai; ++ai)
… /* Do something! */
• Better
for (atomIt ai=mol->getAtoms(); ai; ++ai)
… /* Do something! */
rdkit pdb file writer features
• By default
– Atoms are written as HETAM records
– Bonds are written as CONECT records
– Bond orders encoded by popular extension
– All atoms are uniquely named
– Molecule title written as COMPND record
– Formal charges are preserved.
– Default residue is “UNL” (not “LIG”).
example pdb format output
COMPND benzene
HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C
HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C
CONECT 1 2 2 6
CONECT 2 3
CONECT 3 4 4
CONECT 4 5
CONECT 5 6 6
END
unique atom names
• Counts unique per element type
• C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ
• This allows for up to 1036 indices.
beware openbabel's lig residue
• PDB residue LIG is reserved for 3-pyridin-
4-yl-2,4-dihydro-indeno[1,2-c]pyrazole.
• Use of residue UNL (for UNknown Ligand)
is much preferred.
PDB file readers
• Not all connectivity is explicit
• Two options for determining bonds
– Template methods
– Proximity methods
• Two options for determining bond orders
– Template methods
– 3D geometry methods
• See Sayle, “PDB Cruft to Content”, MUG 2001
template-based bond orders
• Need only specify the double bonds
• Most (11/20) amino acids only need C=O.
• Arg: C=O,CZ=NH2
• Asn/Asp: C=O,CG=OD1
• Gln/Glu: C=O,CD=OE1
• His: C=O,CG=CD2,ND1=CE1
• Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ
• Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
switching on short strings
// These are macros to allow their use in C++ constants
#define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C))
#define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D))
switch (rescode) {
case BCNAM('A','L','A'):
case BCNAM('C','Y','S'):
case BCNAM('G','L','Y'):
case BCNAM('I','L','E'):
case BCNAM('L','E','U'):
...
if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' '))
return true;
break;
proximity bonding
• The ideal distance between two bonded
atoms is the sum of their covalent radii.
• The ideal distance between two non-
bonded atoms is the sum of their van der
Waal's radii.
• Assume bonding between all atoms closer
than the sum of Rcov plus a delta (0.45Å).
• Impose minimum distance threshold (0.4Å)
compare the square
• A rate limiting step in many (poorly written)
proximity calculations is the calculation of
square roots.
• Poor
if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ...
• Better
if(dx*dx + dy*dy + dz*dz < radius*radius) …
c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
fail early, fail fast (like pharma)
• Poor
return dx*dx + dy*dy + dz*dz <= radius2
• Better
dist2 = dx*dx
if (dist > radius2)
return false
dist2 += dy*dy
if (dist > radius2)
return false
dist2 += dz*dz
return dist2 <= radius2
Proximity algorithms
• Brute Force [FreON] N2
• Z (Ordinate) Sort [OpenBabel] NlogN
• Binary Space Partition [CDK] NlogN..N2
• Fixed Dimension Grid [RasMol] ~N
• Fixed Spacing Grid [Coot] ~N
• Grid Hashing [RDKit/OEChem] ~N
proximity comparison
proximity comparison
proximity comparison
Brute force
z (ordinate) sort
binary space partition (bsp)
fixed dimension grid
fixed spacing grid
finite fixed spacing grid
infinite fixed spacing grid
grid hashing
future work/wish list
• Atom reordering in RDKit.
• C++ compressed (gzip) file support.
• Non-forward PDB supplier.
• In-place removeHs/addHs.

Weitere ähnliche Inhalte

Andere mochten auch

Real estate industry in chittagong
Real estate industry in chittagongReal estate industry in chittagong
Real estate industry in chittagongAlexander Decker
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forestYasunori Ozaki
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразовательkulibin
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismssherinshaju
 
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonawww.patkane.global
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneEdgewater
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesRebecca Feldman
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotessherinshaju
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policeThe Star Newspaper
 

Andere mochten auch (18)

Real estate industry in chittagong
Real estate industry in chittagongReal estate industry in chittagong
Real estate industry in chittagong
 
Retailer 01-2013-preview
Retailer 01-2013-previewRetailer 01-2013-preview
Retailer 01-2013-preview
 
Mishimasyk141025
Mishimasyk141025Mishimasyk141025
Mishimasyk141025
 
R -> Python
R -> PythonR -> Python
R -> Python
 
Rdkitの紹介
Rdkitの紹介Rdkitの紹介
Rdkitの紹介
 
10分でわかるRandom forest
10分でわかるRandom forest10分でわかるRandom forest
10分でわかるRandom forest
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
Частотный преобразователь
Частотный преобразовательЧастотный преобразователь
Частотный преобразователь
 
AQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organismsAQA Biology-Physical factors affecting organisms
AQA Biology-Physical factors affecting organisms
 
Tech
TechTech
Tech
 
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelonaJournalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
Journalism, Networks, Ontology: Pat kane presentation at Media140 barcelona
 
Cómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosasCómo hacer presentaciones exitosas
Cómo hacer presentaciones exitosas
 
Mobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows PhoneMobile for SharePoint with Windows Phone
Mobile for SharePoint with Windows Phone
 
Content Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored UpdatesContent Marketing: How to Attract Talent using Sponsored Updates
Content Marketing: How to Attract Talent using Sponsored Updates
 
Diana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_cDiana maria morales hernandez actividad1 mapa_c
Diana maria morales hernandez actividad1 mapa_c
 
BIOSTER Technology Research Institute
BIOSTER Technology Research InstituteBIOSTER Technology Research Institute
BIOSTER Technology Research Institute
 
22 of the best marketing quotes
22 of the best marketing quotes22 of the best marketing quotes
22 of the best marketing quotes
 
PDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the policePDF of the 101 Things you need to know about the police
PDF of the 101 Things you need to know about the police
 

Ähnlich wie RDKit Gems

MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...moiz89
 
Nptel cad2-06 capcitances
Nptel cad2-06 capcitancesNptel cad2-06 capcitances
Nptel cad2-06 capcitanceschenna_kesava
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structureBITS
 
Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14cjfoss
 
Design and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationsDesign and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationscuashok07
 
Proposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsProposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsBlaise Mouttet
 
90981041 control-system-lab-manual
90981041 control-system-lab-manual90981041 control-system-lab-manual
90981041 control-system-lab-manualGopinath.B.L Naidu
 
MetroScientific Week 1.pptx
MetroScientific Week 1.pptxMetroScientific Week 1.pptx
MetroScientific Week 1.pptxBipin Saha
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Kshitij Singh
 
Benchmark Calculations of Atomic Data for Modelling Applications
 Benchmark Calculations of Atomic Data for Modelling Applications Benchmark Calculations of Atomic Data for Modelling Applications
Benchmark Calculations of Atomic Data for Modelling ApplicationsAstroAtom
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppttariqqureshi33
 

Ähnlich wie RDKit Gems (20)

En528032
En528032En528032
En528032
 
Lp seminar
Lp seminarLp seminar
Lp seminar
 
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
MICROPROCESSOR BASED SUN TRACKING SOLAR PANEL SYSTEM TO MAXIMIZE ENERGY GENER...
 
Nptel cad2-06 capcitances
Nptel cad2-06 capcitancesNptel cad2-06 capcitances
Nptel cad2-06 capcitances
 
Bits protein structure
Bits protein structureBits protein structure
Bits protein structure
 
Pot.ppt.pdf
Pot.ppt.pdfPot.ppt.pdf
Pot.ppt.pdf
 
Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14Quantum-Espresso_10_8_14
Quantum-Espresso_10_8_14
 
Design and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applicationsDesign and implementation of cyclo converter for high frequency applications
Design and implementation of cyclo converter for high frequency applications
 
Making a peaking filter by Julio Marqués
Making a peaking filter by Julio MarquésMaking a peaking filter by Julio Marqués
Making a peaking filter by Julio Marqués
 
Proposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and ApplicationsProposals for Memristor Crossbar Design and Applications
Proposals for Memristor Crossbar Design and Applications
 
90981041 control-system-lab-manual
90981041 control-system-lab-manual90981041 control-system-lab-manual
90981041 control-system-lab-manual
 
MetroScientific Week 1.pptx
MetroScientific Week 1.pptxMetroScientific Week 1.pptx
MetroScientific Week 1.pptx
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
Basic execution
Basic executionBasic execution
Basic execution
 
Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1Digital logic-formula-notes-final-1
Digital logic-formula-notes-final-1
 
Buck converter design
Buck converter designBuck converter design
Buck converter design
 
Benchmark Calculations of Atomic Data for Modelling Applications
 Benchmark Calculations of Atomic Data for Modelling Applications Benchmark Calculations of Atomic Data for Modelling Applications
Benchmark Calculations of Atomic Data for Modelling Applications
 
Wien2k getting started
Wien2k getting startedWien2k getting started
Wien2k getting started
 
PPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).pptPPT FINAL (1)-1 (1).ppt
PPT FINAL (1)-1 (1).ppt
 
Structure prediction
Structure predictionStructure prediction
Structure prediction
 

Mehr von NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsNextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 

Mehr von NextMove Software (20)

DeepSMILES
DeepSMILESDeepSMILES
DeepSMILES
 
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Comparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule ImplementationsComparing Cahn-Ingold-Prelog Rule Implementations
Comparing Cahn-Ingold-Prelog Rule Implementations
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Building on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfilesBuilding on Sand: Standard InChIs on non-standard molfiles
Building on Sand: Standard InChIs on non-standard molfiles
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 

Kürzlich hochgeladen

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

RDKit Gems

  • 1. RDkit Gems Roger Sayle, PhD NextMove Software LTD cambridge, uk
  • 2. introduction • This presentation is intended as a short talktorial describing real world code examples, recent features and commonly used RDKit idioms. • Although many refect personal/NextMove Software experiences developing with the C++ APIs, most (though not all) observations apply equally to the Java and Python APIs.
  • 3. looping over molecule atoms • Poor for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); atomIt++) … • Better for (RWMol::AtomIterator atomIt=mol->beginAtoms(); atomIt != mol->endAtoms(); ++atomIt) … • The C++ post increment operator returns the original value prior to the increment, requiring a copy to made.
  • 4. looping over aTOM's bonds • Documented idiom: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OEDGE_ITER beg,end; boost::tie(beg,end) = molPtr->getAtomBonds(atomPtr); while(beg!=end){ const Bond * bond=(*molPtr)[*beg].get(); ... do something with the Bond ... ++beg; }
  • 5. looping over aTOM's bonds • Proposed replacement: ... molPtr is a const ROMol * ... ... atomPtr is a const Atom * ... ROMol::OBOND_ITER_PAIR bondIt; bondIt = molPtr->getAtomBonds(atomPtr); while(bondIt.first != bondIt.second){ const Bond * bond=(*molPtr)[*bondIt.first].get(); ... do something with the Bond ... ++bondIt; }
  • 6. Accepting string arguments • Good practice to accept both C++ constant literals and STL std::strings. • Poor void foo(const char *ptr) { foo(std::string(ptr)); } void foo(const std::string str) { … } • Better void foo(const std::string &str) { foo(str.c_str()); } void foo(const char *ptr) { … }
  • 7. isoelectronic valences • Supporting PF6 - in RDKit currently requires adding 7 to the neutral valences of P. Instead, P-1 should be isoelectronic with S. • Poor valence(atomicNo,charge) ~= valence(atomicNo,0)-charge • Better valence(atomicNo,charge) ~= valence(atomicNo-charge,0)
  • 8. representing reactions • Two possible representations – A new object containing vectors of reactant and product (and agent?) molecules. – Annotation of existing molecule with reaction roles annotated on each atom. • Both are representations are equivalent and one may be converted to the other and vice versa.
  • 9. common ancestry/base class • A significant advantage of avoiding new object types is the common requirement in cheminformatics of reading reactions, molecules, queries, peptides, transforms, and pharmacophores for the same file format. • OpenBabel and GGA's Indigo require you to know the contents of a file/record (mol vs. rxn) before reading it!?
  • 10. Proposed rdkit interface • Atom::setProp(“rxnRole”,(int)role) – RXN_ROLE_NONE 0 – RXN_ROLE_REACTANT 1 – RXN_ROLE_AGENT 2 – RXN_ROLE_PRODUCT 3 • Atom::setProp(“rxnGroup”,(int)group) • Atom::setProp(“molAtomMapNumber”,idx) • RWMol::setProp(“isRxn”,isrxn?1:0)
  • 11. and where this gets used... • Code/GraphMol/FileParser/MolFileWriter.cpp: • Function GetMolFileAtomLine contains • Currently unused local variables – rxnComponentType – rxnComponentNumber • This would allow support for CPSS-style reactions [i.e. reactions in SD files]
  • 12. inverted protein hierarchy • Optimize data structures for common ops. • Poor for (modelIt mi=mol->getModels(); mi; ++mi) for (chainIt ci=(*mi)->getChains(); ci; ++ci) for (resIt ri=(*ci)->getResidues(); ri; ++ri) for (atomIt ai=(*ri)->getAtoms(); ai; ++ai) … /* Do something! */ • Better for (atomIt ai=mol->getAtoms(); ai; ++ai) … /* Do something! */
  • 13. rdkit pdb file writer features • By default – Atoms are written as HETAM records – Bonds are written as CONECT records – Bond orders encoded by popular extension – All atoms are uniquely named – Molecule title written as COMPND record – Formal charges are preserved. – Default residue is “UNL” (not “LIG”).
  • 14. example pdb format output COMPND benzene HETATM 1 C1 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 2 C2 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 3 C3 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 4 C4 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 5 C5 UNL 1 0.000 0.000 0.000 1.00 0.00 C HETATM 6 C6 UNL 1 0.000 0.000 0.000 1.00 0.00 C CONECT 1 2 2 6 CONECT 2 3 CONECT 3 4 4 CONECT 4 5 CONECT 5 6 6 END
  • 15. unique atom names • Counts unique per element type • C1..C99,CA0..CA9,CB0..CZ9,CAA...CZZ • This allows for up to 1036 indices.
  • 16. beware openbabel's lig residue • PDB residue LIG is reserved for 3-pyridin- 4-yl-2,4-dihydro-indeno[1,2-c]pyrazole. • Use of residue UNL (for UNknown Ligand) is much preferred.
  • 17. PDB file readers • Not all connectivity is explicit • Two options for determining bonds – Template methods – Proximity methods • Two options for determining bond orders – Template methods – 3D geometry methods • See Sayle, “PDB Cruft to Content”, MUG 2001
  • 18. template-based bond orders • Need only specify the double bonds • Most (11/20) amino acids only need C=O. • Arg: C=O,CZ=NH2 • Asn/Asp: C=O,CG=OD1 • Gln/Glu: C=O,CD=OE1 • His: C=O,CG=CD2,ND1=CE1 • Phe/Tyr: C=O,CG=CD1,CD2=CE2,CE1=CZ • Trp: C=O,CG=CD1,CD2=CE2,CE3=CZ3,CZ2=CH2
  • 19. switching on short strings // These are macros to allow their use in C++ constants #define BCNAM(A,B,C) (((A)<<16) | ((B)<<8) | (C)) #define BCATM(A,B,C,D) (((A)<<24) | ((B)<<16) | ((C)<<8) | (D)) switch (rescode) { case BCNAM('A','L','A'): case BCNAM('C','Y','S'): case BCNAM('G','L','Y'): case BCNAM('I','L','E'): case BCNAM('L','E','U'): ... if (atm1==BCATM(' ','C',' ',' ') && atm2==BCATM(' ','O',' ',' ')) return true; break;
  • 20. proximity bonding • The ideal distance between two bonded atoms is the sum of their covalent radii. • The ideal distance between two non- bonded atoms is the sum of their van der Waal's radii. • Assume bonding between all atoms closer than the sum of Rcov plus a delta (0.45Å). • Impose minimum distance threshold (0.4Å)
  • 21. compare the square • A rate limiting step in many (poorly written) proximity calculations is the calculation of square roots. • Poor if(sqrt(dx*dx+dy*dy+dz*dz) < radius) ... • Better if(dx*dx + dy*dy + dz*dz < radius*radius) … c.f. http://gcc.gnu.org/ml/gcc-patches/2003-03/msg01683.html
  • 22. fail early, fail fast (like pharma) • Poor return dx*dx + dy*dy + dz*dz <= radius2 • Better dist2 = dx*dx if (dist > radius2) return false dist2 += dy*dy if (dist > radius2) return false dist2 += dz*dz return dist2 <= radius2
  • 23. Proximity algorithms • Brute Force [FreON] N2 • Z (Ordinate) Sort [OpenBabel] NlogN • Binary Space Partition [CDK] NlogN..N2 • Fixed Dimension Grid [RasMol] ~N • Fixed Spacing Grid [Coot] ~N • Grid Hashing [RDKit/OEChem] ~N
  • 35. future work/wish list • Atom reordering in RDKit. • C++ compressed (gzip) file support. • Non-forward PDB supplier. • In-place removeHs/addHs.