SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
State and Future of the IUPAC InChI, NIH, 16-18 Aug 2017
Building on Sand
John	Mayfield,	Roger	Sayle
NextMove Software Ltd
Standard InChIs on non-standard molfiles
MDL VALENCE (MDLBENCH1)
Version Accuracy Precission Version Accuracy Precission
CDK 1.4.13 92.65% 95.11% 2.0 100.00% 100.00%
Open Babel 2.3.90 91.73% 93.34% GitHub 100.00% 100.00%
MDL/BIOVIA Direct 8.0 90.30% 99.76% 2017 97.67% 97.73%
OEChem 1.9 97.20% 99.78% 20170613 97.20% 99.78%
ChemAxon 5.1 88.98% 92.99% 17.17 93.13% 97.33%
GGA/EPAM Indigo 1.1.4 70.80% 97.52% 1.3.0.r16 97.22% 97.22%
RDKit 2012.09 13.62% 22.74% 2017.03.03 67.30% 85.83%
Valence defined either explicitly (safe) or implicitly as a
default value
“The correct valence is specified by MDL/ISIS”
Roger Sayle, MDL Bench, Cheminformatics Toolkits: A Personal Perspective, RDKit UGM, Oct 2012
2012 2017
MDL VALENCE-MAGEDDON
BIOVIA	2017	changes	the	interpretation	of	MDL	files	
Changes	MF	of	213,097	records	in	PubChem	Compound
MDL MASS DELTA (MDLBENCH2)
MDL files originally stored atomic mass delta
‣InChI inherited this decision
‣Resolved by M ISO in molfile
BIOVIA Direct 2017 11B 128Te 266Sg
CDK 2.0 11B 130Te 258Sg
ChemAxon 17.17 11B 130Te 0Sg
DataWarrior 4.6.0 11B 130Te 0Sg
InChI 1.0.5 11B 130Te 269Sg
Indigo 1.3.0b 11B 128Te 271Sg
OEChem 20170613 11B 130Te 263Sg
Open Babel 2.4.1 10B 127Te 271Sg
RDKit 2017.03.03 11B 130Te 271Sg
stereo parity (MDLBENCH3)
0D 2D 3D
0 1 2 3 0 1 2 3 0 1 2 3
ChemAxon 17.17 - S R - - - - - R R R R
CDK 2.0 - S R - - - - - R R R -
Open Babel 2.4.1 - S R - - - - - R R R R
OEChem 20170613 - S R - - S R - R R R R
InChI 1.0.5 - - - - - - - - R R R R
RDKit 2017.03.03 - - - - - - - - - - - -
BIOVIA Direct 2017 - - - - - - - - - R R R
Indigo 1.3.0b - - - - - - - - R R R R
Table shows default behaviour, often can be tweaked – Open Babel and CDK have options
to use parity value for 2D input.
Plain
CoordinationDashedCharge Separated
zero-order bonds
O-
O
N
N
N
Fe
N
-O O
O-
O
N
N
N
Fe
N
-O O
O-
O
N
N
N
Fe
N
-O O
O-
O
N+
N+
N
Fe2-
N
-O O
Omitted
O-
O
N
N
N
Fe
N
-O O
Bonding required to describe configuration
Representation part of the solution (and sometimes
part of the problem), normalisation still required
How can they be represented in a molfile?
ctab representation
…
CTfile	Formats	“Nov	2011	onwards”	V3000	only,	many	tools	allow	it	in	V2000
Alex	Clark.	Accurate	Specification	of	Molecular	Structures:	The	Case	for	Zero-
Order	Bonds	and	Explicit	Hydrogen	Counting.	J.	Chem.	Inf.	Model.	2011,	51,	
3149–3157
(Syntax	Extensions)
ctab representation
M STY 1 1 DAT
M SAL 1 2 12 29
M SDT 1 MRV_COORDINATE_BOND_TYPE
M SDD 1 0.0000 0.0000 DR ALL 0 0
M SED 1 31
ChemAxon	specific	information	in	MDL	MOL	files,		
http://docs.chemaxon.com
(Semantic	Extensions)
PubChem	SD	File	Formatted	Data	V2.0.1	
ftp://ftp.ncbi.nih.gov/pubchem/specifications
BondTypeID Meaning
---------- -----------------
5 Dative Bond
6 Complex Bond
7 Ionic Bond
255 Unspecified or Unknown Connectivity
summary
Systematic benchmarks highlight differences in
interpretations
‣Often simple to change, but can need agreement
‣Chemistry is a moving target
Existing different ways the format has been enhanced to
handle zero-order bonds
‣Can cause unexpected behaviour elsewhere
‣Normalisation still difficult
Acknowledgements
Noel O’Boyle and Shuzhe Wang
ENDS
sgroups
Annotation layer over part of a structure
Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J.
Chem. Inf. Comput. Sci. (1991)
Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and
Other Special Cases. Online - StructurePendium Technologies GmbH
Display Shortcut Polymer Mixture Data
25%
75%
enhanced stereo 1
Enhanced stereo is for handling racemic mixtures and relative
stereochemistry
&1
&1
&2
&2&1
&1
and enantiomer
A B C D
E
BIOVIA	(NEMA-KEY)
A,B,C,D 47CZTH5YZKMZ9K3MVCCVHSUF2378UH
E NULL
ChemAxon	(CXSMILES)
A,D C[C@H](O)[C@@H](O)C=O |&1:1,3,r|
B C[C@@H](O)[C@H](O)C=O |&1:1,3,r|
C C[C@H](O)[C@@H](O)C=O |r|
D C[C@@H](O)[C@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…|
DataWarrior
A,B,C,D gNq`AjdmsURQAh@
E dgLF@@rnT|bTtARfcUSUQHPUDtZP@
enhanced stereo 1
Enhanced stereo is a shortcut for racemic mixtures and relative
stereochemistry
A B C D
E
n/a&1 &2&1
BIOVIA	(NEMA-KEY)
A,B,D,E NULL
ChemAxon	(CXSMILES)
A,D C[C@H](O)[C@@H](O)C=O |&1:3,r|
B C[C@@H](O)[C@H](O)C=O |&1:1,r|
D C[C@@H](O)[C@@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…|
DataWarrior
A,B,D gNq`AjdmsURQA`@
E dgLF@@rnT|bTtARfcUSUQHPUDdZP@

Weitere ähnliche Inhalte

Ă„hnlich wie Building on Sand: Standard InChIs on non-standard molfiles

Keynote HotSWUp 2012
Keynote HotSWUp 2012Keynote HotSWUp 2012
Keynote HotSWUp 2012Martin Pinzger
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingAlpen-Adria-Universität
 
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfOPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfVignesh V Menon
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...NextMove Software
 
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}Rajarshi Guha
 
CMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) DatasheetCMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) DatasheetPanox Display
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software developmentMartin Pinzger
 
Mexico 3070 user group meeting 2012 test coverage john
Mexico 3070 user group meeting 2012  test coverage johnMexico 3070 user group meeting 2012  test coverage john
Mexico 3070 user group meeting 2012 test coverage johnInterlatin
 
Integrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningIntegrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningRajarshi Guha
 
Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level FeatureDongmin Choi
 
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .netMarco Parenzan
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Agepetermurrayrust
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Wesley De Neve
 
SuperAGILE Standard Orbital data Analysis pipeline
SuperAGILE Standard Orbital  data Analysis pipelineSuperAGILE Standard Orbital  data Analysis pipeline
SuperAGILE Standard Orbital data Analysis pipelineFrancesco Lazzarotto
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...Kamel Mansouri
 

Ă„hnlich wie Building on Sand: Standard InChIs on non-standard molfiles (20)

Keynote HotSWUp 2012
Keynote HotSWUp 2012Keynote HotSWUp 2012
Keynote HotSWUp 2012
 
OPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video StreamingOPTE: Online Per-title Encoding for Live Video Streaming
OPTE: Online Per-title Encoding for Live Video Streaming
 
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdfOPTE: Online Per-title Encoding for Live Video Streaming.pdf
OPTE: Online Per-title Encoding for Live Video Streaming.pdf
 
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
CINF 17: Comparing Cahn-Ingold-Prelog Rule Implementations: The need for an o...
 
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
R & CDK: A Sturdy Platform in the Oceans of Chemical Data}
 
CMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) DatasheetCMEL 2.8 inch Amoled(240x320) Datasheet
CMEL 2.8 inch Amoled(240x320) Datasheet
 
A tale of bug prediction in software development
A tale of bug prediction in software developmentA tale of bug prediction in software development
A tale of bug prediction in software development
 
Mexico 3070 user group meeting 2012 test coverage john
Mexico 3070 user group meeting 2012  test coverage johnMexico 3070 user group meeting 2012  test coverage john
Mexico 3070 user group meeting 2012 test coverage john
 
Integrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data MiningIntegrating R with the CDK: Enhanced Chemical Data Mining
Integrating R with the CDK: Enhanced Chemical Data Mining
 
Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Applying QbD to Biotech Process Validation
Applying QbD to Biotech Process ValidationApplying QbD to Biotech Process Validation
Applying QbD to Biotech Process Validation
 
NMR Prediction Accuracy Validation
NMR Prediction Accuracy ValidationNMR Prediction Accuracy Validation
NMR Prediction Accuracy Validation
 
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
Data Con LA 2022 - Pre - recorded - Quantum Computing, The next new technolog...
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
Lec11 object-re-id
Lec11 object-re-idLec11 object-re-id
Lec11 object-re-id
 
Omdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital AgeOmdi2021 Ontologies for (Materials) Science in the Digital Age
Omdi2021 Ontologies for (Materials) Science in the Digital Age
 
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...Learning Biologically Relevant Features Using Convolutional Neural Networks f...
Learning Biologically Relevant Features Using Convolutional Neural Networks f...
 
SuperAGILE Standard Orbital data Analysis pipeline
SuperAGILE Standard Orbital  data Analysis pipelineSuperAGILE Standard Orbital  data Analysis pipeline
SuperAGILE Standard Orbital data Analysis pipeline
 
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
QSAR STUDY ON READY BIODEGRADABILITY OF CHEMICALS. Presented at the 3rd Chemo...
 

Mehr von NextMove Software

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...NextMove Software
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...NextMove Software
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedNextMove Software
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESNextMove Software
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionNextMove Software
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...NextMove Software
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...NextMove Software
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...NextMove Software
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKitNextMove Software
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...NextMove Software
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsNextMove Software
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics DatabaseNextMove Software
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesNextMove Software
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...NextMove Software
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)NextMove Software
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeNextMove Software
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsNextMove Software
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]NextMove Software
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChemNextMove Software
 

Mehr von NextMove Software (20)

CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...CINF 170: Regioselectivity: An application of expert systems and ontologies t...
CINF 170: Regioselectivity: An application of expert systems and ontologies t...
 
Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...Building a bridge between human-readable and machine-readable representations...
Building a bridge between human-readable and machine-readable representations...
 
CINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speedCINF 35: Structure searching for patent information: The need for speed
CINF 35: Structure searching for patent information: The need for speed
 
A de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILESA de facto standard or a free-for-all? A benchmark for reading SMILES
A de facto standard or a free-for-all? A benchmark for reading SMILES
 
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs RevolutionRecent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
Recent Advances in Chemical & Biological Search Systems: Evolution vs Revolution
 
Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...Can we agree on the structure represented by a SMILES string? A benchmark dat...
Can we agree on the structure represented by a SMILES string? A benchmark dat...
 
Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...Eugene Garfield: the father of chemical text mining and artificial intelligen...
Eugene Garfield: the father of chemical text mining and artificial intelligen...
 
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
Chemical similarity using multi-terabyte graph databases: 68 billion nodes an...
 
Recent improvements to the RDKit
Recent improvements to the RDKitRecent improvements to the RDKit
Recent improvements to the RDKit
 
Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...Pharmaceutical industry best practices in lessons learned: ELN implementation...
Pharmaceutical industry best practices in lessons learned: ELN implementation...
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Challenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptionsChallenges and successes in machine interpretation of Markush descriptions
Challenges and successes in machine interpretation of Markush descriptions
 
PubChem as a Biologics Database
PubChem as a Biologics DatabasePubChem as a Biologics Database
PubChem as a Biologics Database
 
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction DatabasesCINF 13: Pistachio - Search and Faceting of Large Reaction Databases
CINF 13: Pistachio - Search and Faceting of Large Reaction Databases
 
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
Chemical Structure Representation of Inorganic Salts and Mixtures of Gases: A...
 
Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)Advanced grammars for state-of-the-art named entity recognition (NER)
Advanced grammars for state-of-the-art named entity recognition (NER)
 
Challenges in Chemical Information Exchange
Challenges in Chemical Information ExchangeChallenges in Chemical Information Exchange
Challenges in Chemical Information Exchange
 
Automatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patentsAutomatic extraction of bioactivity data from patents
Automatic extraction of bioactivity data from patents
 
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
RDKit: Six Not-So-Easy Pieces [RDKit UGM 2016]
 
Chemical structure representation in PubChem
Chemical structure representation in PubChemChemical structure representation in PubChem
Chemical structure representation in PubChem
 

KĂĽrzlich hochgeladen

fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSĂ©rgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...SĂ©rgio Sacani
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSĂ©rgio Sacani
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSĂ©rgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 

KĂĽrzlich hochgeladen (20)

fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 

Building on Sand: Standard InChIs on non-standard molfiles

  • 1. State and Future of the IUPAC InChI, NIH, 16-18 Aug 2017 Building on Sand John Mayfield, Roger Sayle NextMove Software Ltd Standard InChIs on non-standard molfiles
  • 2. MDL VALENCE (MDLBENCH1) Version Accuracy Precission Version Accuracy Precission CDK 1.4.13 92.65% 95.11% 2.0 100.00% 100.00% Open Babel 2.3.90 91.73% 93.34% GitHub 100.00% 100.00% MDL/BIOVIA Direct 8.0 90.30% 99.76% 2017 97.67% 97.73% OEChem 1.9 97.20% 99.78% 20170613 97.20% 99.78% ChemAxon 5.1 88.98% 92.99% 17.17 93.13% 97.33% GGA/EPAM Indigo 1.1.4 70.80% 97.52% 1.3.0.r16 97.22% 97.22% RDKit 2012.09 13.62% 22.74% 2017.03.03 67.30% 85.83% Valence defined either explicitly (safe) or implicitly as a default value “The correct valence is specified by MDL/ISIS” Roger Sayle, MDL Bench, Cheminformatics Toolkits: A Personal Perspective, RDKit UGM, Oct 2012 2012 2017
  • 4. MDL MASS DELTA (MDLBENCH2) MDL files originally stored atomic mass delta ‣InChI inherited this decision ‣Resolved by M ISO in molfile BIOVIA Direct 2017 11B 128Te 266Sg CDK 2.0 11B 130Te 258Sg ChemAxon 17.17 11B 130Te 0Sg DataWarrior 4.6.0 11B 130Te 0Sg InChI 1.0.5 11B 130Te 269Sg Indigo 1.3.0b 11B 128Te 271Sg OEChem 20170613 11B 130Te 263Sg Open Babel 2.4.1 10B 127Te 271Sg RDKit 2017.03.03 11B 130Te 271Sg
  • 5. stereo parity (MDLBENCH3) 0D 2D 3D 0 1 2 3 0 1 2 3 0 1 2 3 ChemAxon 17.17 - S R - - - - - R R R R CDK 2.0 - S R - - - - - R R R - Open Babel 2.4.1 - S R - - - - - R R R R OEChem 20170613 - S R - - S R - R R R R InChI 1.0.5 - - - - - - - - R R R R RDKit 2017.03.03 - - - - - - - - - - - - BIOVIA Direct 2017 - - - - - - - - - R R R Indigo 1.3.0b - - - - - - - - R R R R Table shows default behaviour, often can be tweaked – Open Babel and CDK have options to use parity value for 2D input.
  • 6. Plain CoordinationDashedCharge Separated zero-order bonds O- O N N N Fe N -O O O- O N N N Fe N -O O O- O N N N Fe N -O O O- O N+ N+ N Fe2- N -O O Omitted O- O N N N Fe N -O O Bonding required to describe configuration Representation part of the solution (and sometimes part of the problem), normalisation still required How can they be represented in a molfile?
  • 8. ctab representation M STY 1 1 DAT M SAL 1 2 12 29 M SDT 1 MRV_COORDINATE_BOND_TYPE M SDD 1 0.0000 0.0000 DR ALL 0 0 M SED 1 31 ChemAxon specific information in MDL MOL files, http://docs.chemaxon.com (Semantic Extensions) PubChem SD File Formatted Data V2.0.1 ftp://ftp.ncbi.nih.gov/pubchem/specifications BondTypeID Meaning ---------- ----------------- 5 Dative Bond 6 Complex Bond 7 Ionic Bond 255 Unspecified or Unknown Connectivity
  • 9. summary Systematic benchmarks highlight differences in interpretations ‣Often simple to change, but can need agreement ‣Chemistry is a moving target Existing different ways the format has been enhanced to handle zero-order bonds ‣Can cause unexpected behaviour elsewhere ‣Normalisation still difficult Acknowledgements Noel O’Boyle and Shuzhe Wang
  • 10. ENDS
  • 11. sgroups Annotation layer over part of a structure Gushurst et al. The substance module: the representation, storage, and searching of complex structures. J. Chem. Inf. Comput. Sci. (1991) Blanke G. Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases. Online - StructurePendium Technologies GmbH Display Shortcut Polymer Mixture Data 25% 75%
  • 12. enhanced stereo 1 Enhanced stereo is for handling racemic mixtures and relative stereochemistry &1 &1 &2 &2&1 &1 and enantiomer A B C D E BIOVIA (NEMA-KEY) A,B,C,D 47CZTH5YZKMZ9K3MVCCVHSUF2378UH E NULL ChemAxon (CXSMILES) A,D C[C@H](O)[C@@H](O)C=O |&1:1,3,r| B C[C@@H](O)[C@H](O)C=O |&1:1,3,r| C C[C@H](O)[C@@H](O)C=O |r| D C[C@@H](O)[C@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…| DataWarrior A,B,C,D gNq`AjdmsURQAh@ E dgLF@@rnT|bTtARfcUSUQHPUDtZP@
  • 13. enhanced stereo 1 Enhanced stereo is a shortcut for racemic mixtures and relative stereochemistry A B C D E n/a&1 &2&1 BIOVIA (NEMA-KEY) A,B,D,E NULL ChemAxon (CXSMILES) A,D C[C@H](O)[C@@H](O)C=O |&1:3,r| B C[C@@H](O)[C@H](O)C=O |&1:1,r| D C[C@@H](O)[C@@H](O)C=O.C[C@H](O)[C@@H](O)C=O |…| DataWarrior A,B,D gNq`AjdmsURQA`@ E dgLF@@rnT|bTtARfcUSUQHPUDdZP@