SlideShare a Scribd company logo
1 of 1
Download to read offline
Canonicalized systematic nomenclature in chemoinformatics
And some new canonicalization tools from OpenEye
Jeremy J. Yang
Introduction

Morgan demo and study

Canonicalization in chemoinformatics facilitates rigorous, unambiguous
expression and handling of chemical data and knowledge. However, just as
chemistry encompasses multiple levels of abstraction and modelling, no
single canonicalization method is sufficient to solve all problems. This study
reviews some existing canonicalization methodology and describes new
methods implemented by chemoinformatics library OEChem and other
OpenEye tools.

New: canonicalizing molfiles
Fig 1: Morgan demo.
Extended connectivity
values and atom orders.
Uses OEChem and
Ogham. NCI Diversity
set processed with no
errors.

Definition of canonicalization
A canonicalization algorithm must determine a single representation among
many possible representations for an individual in its domain.

Benefits of canonicalization
•  testing equality of molecules
•  database search speed
•  rigorous informatics and thinking

N! (graph isomorphism is hard) –
Morgan to the rescue
algorithm1

The Morgan
is the basis of most chemical canonicalization work
since, and deserves careful study. In 1965 Harry L. Morgan published the
algorithm already implemented at CAS for its compound registry system.
This work, based on generic graph theory, comprises a theoretical solution
to the problem of molecular canonicalization, and material validation of its
efficacy.

More Morgan, and more
The Morgan algorithm was a huge step forward, but the basic algorithm has
some shortcomings, in performance and comprehensiveness, which have
been corrected by subsequent investigators. The resulting methods have
been implemented and widely used in large scale database systems. Some
key contributions:
•  Morgan, 1965 à note to Harry: “You da man!” à CAS
•  Wipke & Dyott, 1974 à stereo-enhanced Morgan à MDL
•  Jochum & Gasteiger, 1977 à Morgan refinement à CACTVS
•  Shelley & Munk, 1977 à Morgan refinement
•  Weininger, 1988 à CANSMI canonical line notation à Daylight
•  Bradshaw, 1998 à parent compounds à GSK,Daylight
•  Delany & Sayle, 1999 à tautomers à OpenEye
•  INChi, 2004 à global canonical line notation

This study: canonical molecular
descriptions, not descriptors
The study of graph theory and canonicalization applied to chemistry is
extensive and diverse. Canonical descriptors which do not fully represent
the model can be of great utility in statistical analyses but are not the focus
of this nomenclature study.

Canonicalizing a connection table is not new and was discussed by Morgan1
and others. But generating canonical forms of current standard formats is not
widely done, for historical and practical reasons, although the available
benefits. This is increasingly true now that longer strings are more easily
handled by existing computers.
OEChem provides sufficient control to
accomplish this task. Proposed algorithm:

The OpenEye chemoinformatics toolkit OEChem12 employs an optimized
Morgan-like canonical algorithm to generate canonical smiles. In addition,
the api provides a rich set of tools which can facilitate generation of
canonical representations of many types, for many chemical and
informational models, and for many standard file formats.

•  Remove non-structural data
•  Supress hydrogens
•  Canonical atom order
•  Canonical bond order
•  Canonical Kekule bonding based on (selected) aromaticity model

•  OEChem::OECanonicalOrderAtoms()
•  OEChem::OECanonicalOrderBonds()
•  OEChem aromaticity models: OE, Daylight, Tripos, MDL, MMFF
•  OEChem: many file formats and flavors, low-level writers
•  QuacPac13: tautomers application and toolkit

However, the advantages of more terse canonical line notations remain.

Fig 2: Morgan
slow due to
symmetry.

RESULTS: Using test program canmol.py, 1990 NCI Diversity set converted
to canonical SDF files, exactly equal to SDF files converted via SMILES
(demo.eyesopen.com/cgi-bin/canmol). Also done with MOL2 format. This test
validates the ability of OEChem to canonicalize molfiles as strings.

Fig 3: Morgan
fails

Aha! -- Chemo-taxonomy is a “stranded
hierarchy”
•  subatomic à atoms à molecules
•  normal weight atoms à isotopes
•  Kekule molecule model à aromatic molecule models
•  non-stereo molecule à stereoisomers
•  single molecule à combinatorial libraries
•  single molecule à queries
•  small molecule à macromolecule + cofactors + ligands
•  single molecule à Markush structures
•  single molecule à tautomer set
•  single molecule à pKa states
•  single molecule à reactions
•  2D à 3D
There is a hierarchical relationship among some of these expansions while
some are independent. For example, combinatorial library may involve
stereoisomeric individuals or non-stereo.
For every combination of
molecular representations, canonicalization could be advantageous for the
reasons described. Hence the task of canonicalization is a multi-faceted
one.

Dealing with reality: practical problems
1.  Existing formats (may often be):
•  ambiguous – poorly defined spec or poor compliance
•  un-rigorous – both syntax and semantics are important
•  non-comprehensive – only organic, covalent, size limits
2.  Stereoisomer canonicalization remains difficult
•  "relative stereo-centers"
3.  Differing valence assumptions and conventions
•  implicit-valence and Hcount formats prone to mishandling
4.  Information content and model differences in existing formats
•  cannot robustly convert if info must be inferred (e.g. bonds)
5.  Disagreement over correct chemistry
•  e.g., valences, aromaticity
6.  Local versus global canonicalization
•  Benefits of canonicalization are available locally or globally.
global canonicalization requires cooperation.
•  Locality definition (time, place, software versions)

OpenEye canonicalization tools

New: canonical tautomers
Tautomers have the same formula (structural isomers), but may differ in
proton and electron location, and formal bond order. Special cases: keto/
enol, zwitterion, ring-chain. In the Delany/Sayle algorithm8,13, hydrogen
donors and acceptors are perceived, and the number of free hydrogens.
Donors and acceptor atoms are ordered canonically.
At this stage all
tautomerically equivalent inputs are represented identically.
Hydrogen
locations are exhaustively enumerated. A simple ruleset for enumeration
order can designate the first to be the canonical tautomer.
Through
additional rules, the liklihood can be increased that the canonical tautomer
is a low-energy form. Applications: registration (exact search), substructure
searching, property prediction, similarity/clustering, protein-ligand analysis.
Failure to perceive tautomerism leads to different results for different
valence models which really represent the same chemical entity.

Fig 4: example:
tautomers listed
separately in ACD98.
The latter is the OEcanonical form.
Results: The Maybridge 2003 database was analyzed by the OE program
tautomers13. Of 71367 molecules, 97 have tautomers (47 pairs and one
triplet). Note that additionally, 2381 were found to be non-unique molecules.

Conclusion
Rigorous and effective chemoinformatics systems require concepts
and methods for canonicalization at multiple levels of chemical
abstraction and organization. The current state of the art presents
many theoretical and practical challenges. OpenEye tools can help.

References
1.  Morgan, H. L., "Generation of a unique machine description for chemical structures - A
technique developed at Chemical Abstracts Services", J. Chem. Doc. 1965, 5, 107.
2.  Stereochemically unique naming algorithm, W. Todd Wipke, Thomas M. Dyott; J. Am.
Chem. Soc.; 1974; 96(15); 4834-4842.
3.  Canonical Numbering and Constitutional Symmetry, Clemens Jochum and Johann
Gasteiger, J. Chem. Inf. Comput. Sci.; 1977; 17(2); 113-117.
4.  Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J.
Chem. Inf. Comput. Sci.; 1977; 17(2); 110-113.
5.  An Approach to the Assignment of Canonical Connection Tables and Topological
Symmetry Perception, Craig A. Shelley, Morton E. Munk, J. Chem. Inf. Comput. Sci.;
1979; 19(4); 247-250.
6.  David Weininger, Arthur Weininger and Joseph L. Weininger, "SMILES 2: Algorithm for
Generation of Unique SMILES Notation", Journal of Chemical Information and
Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989.
7.  A beginner's guide to responsible parenting or knowing your roots,
www.daylight.com/meetings/emug98/Bradshaw/, EuroMUG '98, Cambridge, UK, Oct
1998.
8.  Canonicalization and Enumeration of Tautomers, Jack Delany and Roger Sayle,
www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm EuroMUG '99,
Cambridge, UK, Oct 1999.
9.  Hooked on Protonics, Roger Sayle and Geoff Skillman,
www.eyesopen.com/about/events/presentations/acs02/sld001.htm, 224th ACS
National Meeting, Boston, Aug 2002.
10.  Introduction to Chemical Info Systems, John Bradshaw,
www.daylight.com/meetings/emug02/Bradshaw/Training/, Euromug02 24th-26th
September 2002, Cambridge UK
11.  That INChIFeeling, www.reactivereports.com/40/40_3.html, Reactive Reports, Sep
2004 (issue 40)
12.  OEChem, OpenEye Scientific Software, 2002.
13.  QuacPac, OpenEye Scientific Software, 2004.

Fig 5: tautomer triplet from Maybridge 2003

New: canonical pKa states
But

The canonicalization of alternative pKa states is accomplished for many classes
of molecules by the OpenEye program pkatyper13. This problem resembles
tautomer canonicalization in many respects, and is an area of active research
at OpenEye.

3600 Cerrillos Road
Suite 1107
Santa Fe, New Mexico 87507

505.473.7385
info@eyesopen.com
www.eyesopen.com

More Related Content

What's hot

Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIIndrajeetKumar124
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug designSurmil Shah
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryLee Larcombe
 
Limitations & lessons in the use of x ray structural information in drug design
Limitations & lessons in the use of x ray structural information in drug designLimitations & lessons in the use of x ray structural information in drug design
Limitations & lessons in the use of x ray structural information in drug designDilip Darade
 
Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...
Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...
Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...IOSR Journals
 
molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...GAUTAM KHUNE
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...CSCJournals
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessProf. Dr. Basavaraj Nanjwade
 
Chemo informatics scope and applications
Chemo informatics scope and applicationsChemo informatics scope and applications
Chemo informatics scope and applicationsshyam I
 
De novo str_prediction
De novo str_predictionDe novo str_prediction
De novo str_predictionShwetA Kumari
 
Molecular docking and_virtual_screening
Molecular docking and_virtual_screeningMolecular docking and_virtual_screening
Molecular docking and_virtual_screeningFlorent Barbault
 
Computational Prediction of Binding Affinity between Psychotropic Drugs and N...
Computational Prediction of Binding Affinity between Psychotropic Drugs and N...Computational Prediction of Binding Affinity between Psychotropic Drugs and N...
Computational Prediction of Binding Affinity between Psychotropic Drugs and N...Rita Pizzi
 
STRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERY
STRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERYSTRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERY
STRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERYTHILAKAR MANI
 

What's hot (20)

Drug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AIDrug properties (ADMET) prediction using AI
Drug properties (ADMET) prediction using AI
 
Applying Computer Assisted Structure Elucidation Algorithms For The Purpose O...
Applying Computer Assisted Structure Elucidation Algorithms For The Purpose O...Applying Computer Assisted Structure Elucidation Algorithms For The Purpose O...
Applying Computer Assisted Structure Elucidation Algorithms For The Purpose O...
 
Revisiting the nmr assignments of hexacyclinol
Revisiting the nmr assignments of hexacyclinolRevisiting the nmr assignments of hexacyclinol
Revisiting the nmr assignments of hexacyclinol
 
Cheminformatics in drug design
Cheminformatics in drug designCheminformatics in drug design
Cheminformatics in drug design
 
567
567567
567
 
Molecular modelling for in silico drug discovery
Molecular modelling for in silico drug discoveryMolecular modelling for in silico drug discovery
Molecular modelling for in silico drug discovery
 
Limitations & lessons in the use of x ray structural information in drug design
Limitations & lessons in the use of x ray structural information in drug designLimitations & lessons in the use of x ray structural information in drug design
Limitations & lessons in the use of x ray structural information in drug design
 
dream
dreamdream
dream
 
Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...
Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...
Identification of Skeleton of Monoterpenoids from 13CNMR Data Using Generaliz...
 
molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...molecular docking its types and de novo drug design and application and softw...
molecular docking its types and de novo drug design and application and softw...
 
Homology modelling
Homology modellingHomology modelling
Homology modelling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
Multiple Features Based Two-stage Hybrid Classifier Ensembles for Subcellular...
 
Applications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And ProcessApplications Of Bioinformatics In Drug Discovery And Process
Applications Of Bioinformatics In Drug Discovery And Process
 
Chemo informatics scope and applications
Chemo informatics scope and applicationsChemo informatics scope and applications
Chemo informatics scope and applications
 
De novo str_prediction
De novo str_predictionDe novo str_prediction
De novo str_prediction
 
Molecular docking and_virtual_screening
Molecular docking and_virtual_screeningMolecular docking and_virtual_screening
Molecular docking and_virtual_screening
 
Computational Prediction of Binding Affinity between Psychotropic Drugs and N...
Computational Prediction of Binding Affinity between Psychotropic Drugs and N...Computational Prediction of Binding Affinity between Psychotropic Drugs and N...
Computational Prediction of Binding Affinity between Psychotropic Drugs and N...
 
STRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERY
STRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERYSTRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERY
STRUCTURE BASED DRUG DESIGN - MOLECULAR MODELLING AND DRUG DISCOVERY
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 

Viewers also liked

Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsunyil96
 
Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsabdelazim Galal
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformaticsbaoilleach
 
リアルタイム画風変換とその未来
リアルタイム画風変換とその未来リアルタイム画風変換とその未来
リアルタイム画風変換とその未来LINE Corporation
 

Viewers also liked (6)

Chemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientistsChemoinformatics—an introduction for computer scientists
Chemoinformatics—an introduction for computer scientists
 
Bio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformaticsBio inspiring computing and its application in cheminformatics
Bio inspiring computing and its application in cheminformatics
 
Chemoinformatic
Chemoinformatic Chemoinformatic
Chemoinformatic
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
Cheminformatics
CheminformaticsCheminformatics
Cheminformatics
 
リアルタイム画風変換とその未来
リアルタイム画風変換とその未来リアルタイム画風変換とその未来
リアルタイム画風変換とその未来
 

Similar to Canonicalized systematic nomenclature in cheminformatics

Cheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirCheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirKAUSHAL SAHU
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...Natalio Krasnogor
 
A guide to molecular mechanics and quantum chemical calculations
A guide to molecular mechanics and quantum chemical calculationsA guide to molecular mechanics and quantum chemical calculations
A guide to molecular mechanics and quantum chemical calculationsSapna Jha
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomicssonam786
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptxwadhava gurumeet
 
43_EMIJ-06-00212.pdf
43_EMIJ-06-00212.pdf43_EMIJ-06-00212.pdf
43_EMIJ-06-00212.pdfUmeshYadava1
 
II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network" II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network" Dr. Haxel Consult
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayiKiranmayiKnv
 
cadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxcadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxNoorelhuda2
 
(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習Ichigaku Takigawa
 
Computational Organic Chemistry
Computational Organic ChemistryComputational Organic Chemistry
Computational Organic ChemistryIsamu Katsuyama
 
How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...Ichigaku Takigawa
 
Structural Bioinformatics.pdf
Structural Bioinformatics.pdfStructural Bioinformatics.pdf
Structural Bioinformatics.pdfRahmatEkoSanjaya1
 
Computational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptxComputational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptxashharnomani
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...Natalio Krasnogor
 
Main_Ms_JBNMR_Final_version
Main_Ms_JBNMR_Final_versionMain_Ms_JBNMR_Final_version
Main_Ms_JBNMR_Final_versionAbhilash Kannan
 

Similar to Canonicalized systematic nomenclature in cheminformatics (20)

Cheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sirCheminformatics, concept by kk sahu sir
Cheminformatics, concept by kk sahu sir
 
CADD
CADDCADD
CADD
 
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
P
 Systems 
Model 
Optimisation 
by
 Means 
of 
Evolutionary 
Based 
Search
 ...
 
A guide to molecular mechanics and quantum chemical calculations
A guide to molecular mechanics and quantum chemical calculationsA guide to molecular mechanics and quantum chemical calculations
A guide to molecular mechanics and quantum chemical calculations
 
Systems biology & Approaches of genomics and proteomics
 Systems biology & Approaches of genomics and proteomics Systems biology & Approaches of genomics and proteomics
Systems biology & Approaches of genomics and proteomics
 
Pallavi gupta
Pallavi guptaPallavi gupta
Pallavi gupta
 
Chemoinformatic File Format.pptx
Chemoinformatic File Format.pptxChemoinformatic File Format.pptx
Chemoinformatic File Format.pptx
 
A systematic approach for the generation and verification of structural hypot...
A systematic approach for the generation and verification of structural hypot...A systematic approach for the generation and verification of structural hypot...
A systematic approach for the generation and verification of structural hypot...
 
43_EMIJ-06-00212.pdf
43_EMIJ-06-00212.pdf43_EMIJ-06-00212.pdf
43_EMIJ-06-00212.pdf
 
II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network" II-SDV 2017: The "International Chemical Ontology Network"
II-SDV 2017: The "International Chemical Ontology Network"
 
Structure based drug design- kiranmayi
Structure based drug design- kiranmayiStructure based drug design- kiranmayi
Structure based drug design- kiranmayi
 
cadd-191129134050 (1).pptx
cadd-191129134050 (1).pptxcadd-191129134050 (1).pptx
cadd-191129134050 (1).pptx
 
(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習(2018.9) 分子のグラフ表現と機械学習
(2018.9) 分子のグラフ表現と機械学習
 
Computational Organic Chemistry
Computational Organic ChemistryComputational Organic Chemistry
Computational Organic Chemistry
 
How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...How to use data to design and optimize reaction? A quick introduction to work...
How to use data to design and optimize reaction? A quick introduction to work...
 
Structural Bioinformatics.pdf
Structural Bioinformatics.pdfStructural Bioinformatics.pdf
Structural Bioinformatics.pdf
 
Computational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptxComputational Prediction Of Protein-1.pptx
Computational Prediction Of Protein-1.pptx
 
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...Evolutionary Symbolic Discovery for Bioinformatics,  Systems and Synthetic Bi...
Evolutionary Symbolic Discovery for Bioinformatics, Systems and Synthetic Bi...
 
Assignment 105B.pptx
Assignment 105B.pptxAssignment 105B.pptx
Assignment 105B.pptx
 
Main_Ms_JBNMR_Final_version
Main_Ms_JBNMR_Final_versionMain_Ms_JBNMR_Final_version
Main_Ms_JBNMR_Final_version
 

More from Jeremy Yang

TIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsTIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsJeremy Yang
 
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizerDrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizerJeremy Yang
 
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesMining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesJeremy Yang
 
TIN-X v2: modernized architecture with REST API
TIN-X v2: modernized architecture with REST APITIN-X v2: modernized architecture with REST API
TIN-X v2: modernized architecture with REST APIJeremy Yang
 
Ex-files: Sex-Specific Gene Expression Profiles Explorer
Ex-files: Sex-Specific Gene Expression Profiles ExplorerEx-files: Sex-Specific Gene Expression Profiles Explorer
Ex-files: Sex-Specific Gene Expression Profiles ExplorerJeremy Yang
 
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Jeremy Yang
 
Open Phenotypic Drug Discovery Resource poster
Open Phenotypic Drug Discovery Resource posterOpen Phenotypic Drug Discovery Resource poster
Open Phenotypic Drug Discovery Resource posterJeremy Yang
 
Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Jeremy Yang
 
Bibliological data science and drug discovery
Bibliological data science and drug discoveryBibliological data science and drug discovery
Bibliological data science and drug discoveryJeremy Yang
 
BioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingBioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingJeremy Yang
 
The Language Diversity of Computing
The Language Diversity of ComputingThe Language Diversity of Computing
The Language Diversity of ComputingJeremy Yang
 
RMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsRMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsJeremy Yang
 
Molecular scaffolds poster
Molecular scaffolds posterMolecular scaffolds poster
Molecular scaffolds posterJeremy Yang
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryJeremy Yang
 
The BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDThe BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDJeremy Yang
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesJeremy Yang
 
How am I supposed to organize a protein database when I can't even organize m...
How am I supposed to organize a protein database when I can't even organize m...How am I supposed to organize a protein database when I can't even organize m...
How am I supposed to organize a protein database when I can't even organize m...Jeremy Yang
 
UNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsUNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsJeremy Yang
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingJeremy Yang
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNJeremy Yang
 

More from Jeremy Yang (20)

TIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS AnalyticsTIGA: Target Illumination GWAS Analytics
TIGA: Target Illumination GWAS Analytics
 
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizerDrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
DrugCentralDb and BioClients: Dockerized PostgreSql with Python API-tizer
 
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypothesesMining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
Mining ClinicalTrials.gov via CTTI AACT for drug target hypotheses
 
TIN-X v2: modernized architecture with REST API
TIN-X v2: modernized architecture with REST APITIN-X v2: modernized architecture with REST API
TIN-X v2: modernized architecture with REST API
 
Ex-files: Sex-Specific Gene Expression Profiles Explorer
Ex-files: Sex-Specific Gene Expression Profiles ExplorerEx-files: Sex-Specific Gene Expression Profiles Explorer
Ex-files: Sex-Specific Gene Expression Profiles Explorer
 
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
Illuminating the Druggable Genome with Knowledge Engineering and Machine Lear...
 
Open Phenotypic Drug Discovery Resource poster
Open Phenotypic Drug Discovery Resource posterOpen Phenotypic Drug Discovery Resource poster
Open Phenotypic Drug Discovery Resource poster
 
Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)Badapple: promiscuity patterns from noisy evidence (poster)
Badapple: promiscuity patterns from noisy evidence (poster)
 
Bibliological data science and drug discovery
Bibliological data science and drug discoveryBibliological data science and drug discovery
Bibliological data science and drug discovery
 
BioMISS: Language Diversity of Computing
BioMISS: Language Diversity of ComputingBioMISS: Language Diversity of Computing
BioMISS: Language Diversity of Computing
 
The Language Diversity of Computing
The Language Diversity of ComputingThe Language Diversity of Computing
The Language Diversity of Computing
 
RMSD: routine measure stirs doubts
RMSD: routine measure stirs doubtsRMSD: routine measure stirs doubts
RMSD: routine measure stirs doubts
 
Molecular scaffolds poster
Molecular scaffolds posterMolecular scaffolds poster
Molecular scaffolds poster
 
Molecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discoveryMolecular scaffolds are special and useful guides to discovery
Molecular scaffolds are special and useful guides to discovery
 
The BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARDThe BADAPPLE promiscuity plugin for BARD
The BADAPPLE promiscuity plugin for BARD
 
Cheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case StudiesCheminformatics Software Development: Case Studies
Cheminformatics Software Development: Case Studies
 
How am I supposed to organize a protein database when I can't even organize m...
How am I supposed to organize a protein database when I can't even organize m...How am I supposed to organize a protein database when I can't even organize m...
How am I supposed to organize a protein database when I can't even organize m...
 
UNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applicationsUNM Division of Biocomputing public web applications
UNM Division of Biocomputing public web applications
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Promiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCNPromiscuous patterns and perils in PubChem and the MLSCN
Promiscuous patterns and perils in PubChem and the MLSCN
 

Canonicalized systematic nomenclature in cheminformatics

  • 1. Canonicalized systematic nomenclature in chemoinformatics And some new canonicalization tools from OpenEye Jeremy J. Yang Introduction Morgan demo and study Canonicalization in chemoinformatics facilitates rigorous, unambiguous expression and handling of chemical data and knowledge. However, just as chemistry encompasses multiple levels of abstraction and modelling, no single canonicalization method is sufficient to solve all problems. This study reviews some existing canonicalization methodology and describes new methods implemented by chemoinformatics library OEChem and other OpenEye tools. New: canonicalizing molfiles Fig 1: Morgan demo. Extended connectivity values and atom orders. Uses OEChem and Ogham. NCI Diversity set processed with no errors. Definition of canonicalization A canonicalization algorithm must determine a single representation among many possible representations for an individual in its domain. Benefits of canonicalization •  testing equality of molecules •  database search speed •  rigorous informatics and thinking N! (graph isomorphism is hard) – Morgan to the rescue algorithm1 The Morgan is the basis of most chemical canonicalization work since, and deserves careful study. In 1965 Harry L. Morgan published the algorithm already implemented at CAS for its compound registry system. This work, based on generic graph theory, comprises a theoretical solution to the problem of molecular canonicalization, and material validation of its efficacy. More Morgan, and more The Morgan algorithm was a huge step forward, but the basic algorithm has some shortcomings, in performance and comprehensiveness, which have been corrected by subsequent investigators. The resulting methods have been implemented and widely used in large scale database systems. Some key contributions: •  Morgan, 1965 à note to Harry: “You da man!” à CAS •  Wipke & Dyott, 1974 à stereo-enhanced Morgan à MDL •  Jochum & Gasteiger, 1977 à Morgan refinement à CACTVS •  Shelley & Munk, 1977 à Morgan refinement •  Weininger, 1988 à CANSMI canonical line notation à Daylight •  Bradshaw, 1998 à parent compounds à GSK,Daylight •  Delany & Sayle, 1999 à tautomers à OpenEye •  INChi, 2004 à global canonical line notation This study: canonical molecular descriptions, not descriptors The study of graph theory and canonicalization applied to chemistry is extensive and diverse. Canonical descriptors which do not fully represent the model can be of great utility in statistical analyses but are not the focus of this nomenclature study. Canonicalizing a connection table is not new and was discussed by Morgan1 and others. But generating canonical forms of current standard formats is not widely done, for historical and practical reasons, although the available benefits. This is increasingly true now that longer strings are more easily handled by existing computers. OEChem provides sufficient control to accomplish this task. Proposed algorithm: The OpenEye chemoinformatics toolkit OEChem12 employs an optimized Morgan-like canonical algorithm to generate canonical smiles. In addition, the api provides a rich set of tools which can facilitate generation of canonical representations of many types, for many chemical and informational models, and for many standard file formats. •  Remove non-structural data •  Supress hydrogens •  Canonical atom order •  Canonical bond order •  Canonical Kekule bonding based on (selected) aromaticity model •  OEChem::OECanonicalOrderAtoms() •  OEChem::OECanonicalOrderBonds() •  OEChem aromaticity models: OE, Daylight, Tripos, MDL, MMFF •  OEChem: many file formats and flavors, low-level writers •  QuacPac13: tautomers application and toolkit However, the advantages of more terse canonical line notations remain. Fig 2: Morgan slow due to symmetry. RESULTS: Using test program canmol.py, 1990 NCI Diversity set converted to canonical SDF files, exactly equal to SDF files converted via SMILES (demo.eyesopen.com/cgi-bin/canmol). Also done with MOL2 format. This test validates the ability of OEChem to canonicalize molfiles as strings. Fig 3: Morgan fails Aha! -- Chemo-taxonomy is a “stranded hierarchy” •  subatomic à atoms à molecules •  normal weight atoms à isotopes •  Kekule molecule model à aromatic molecule models •  non-stereo molecule à stereoisomers •  single molecule à combinatorial libraries •  single molecule à queries •  small molecule à macromolecule + cofactors + ligands •  single molecule à Markush structures •  single molecule à tautomer set •  single molecule à pKa states •  single molecule à reactions •  2D à 3D There is a hierarchical relationship among some of these expansions while some are independent. For example, combinatorial library may involve stereoisomeric individuals or non-stereo. For every combination of molecular representations, canonicalization could be advantageous for the reasons described. Hence the task of canonicalization is a multi-faceted one. Dealing with reality: practical problems 1.  Existing formats (may often be): •  ambiguous – poorly defined spec or poor compliance •  un-rigorous – both syntax and semantics are important •  non-comprehensive – only organic, covalent, size limits 2.  Stereoisomer canonicalization remains difficult •  "relative stereo-centers" 3.  Differing valence assumptions and conventions •  implicit-valence and Hcount formats prone to mishandling 4.  Information content and model differences in existing formats •  cannot robustly convert if info must be inferred (e.g. bonds) 5.  Disagreement over correct chemistry •  e.g., valences, aromaticity 6.  Local versus global canonicalization •  Benefits of canonicalization are available locally or globally. global canonicalization requires cooperation. •  Locality definition (time, place, software versions) OpenEye canonicalization tools New: canonical tautomers Tautomers have the same formula (structural isomers), but may differ in proton and electron location, and formal bond order. Special cases: keto/ enol, zwitterion, ring-chain. In the Delany/Sayle algorithm8,13, hydrogen donors and acceptors are perceived, and the number of free hydrogens. Donors and acceptor atoms are ordered canonically. At this stage all tautomerically equivalent inputs are represented identically. Hydrogen locations are exhaustively enumerated. A simple ruleset for enumeration order can designate the first to be the canonical tautomer. Through additional rules, the liklihood can be increased that the canonical tautomer is a low-energy form. Applications: registration (exact search), substructure searching, property prediction, similarity/clustering, protein-ligand analysis. Failure to perceive tautomerism leads to different results for different valence models which really represent the same chemical entity. Fig 4: example: tautomers listed separately in ACD98. The latter is the OEcanonical form. Results: The Maybridge 2003 database was analyzed by the OE program tautomers13. Of 71367 molecules, 97 have tautomers (47 pairs and one triplet). Note that additionally, 2381 were found to be non-unique molecules. Conclusion Rigorous and effective chemoinformatics systems require concepts and methods for canonicalization at multiple levels of chemical abstraction and organization. The current state of the art presents many theoretical and practical challenges. OpenEye tools can help. References 1.  Morgan, H. L., "Generation of a unique machine description for chemical structures - A technique developed at Chemical Abstracts Services", J. Chem. Doc. 1965, 5, 107. 2.  Stereochemically unique naming algorithm, W. Todd Wipke, Thomas M. Dyott; J. Am. Chem. Soc.; 1974; 96(15); 4834-4842. 3.  Canonical Numbering and Constitutional Symmetry, Clemens Jochum and Johann Gasteiger, J. Chem. Inf. Comput. Sci.; 1977; 17(2); 113-117. 4.  Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J. Chem. Inf. Comput. Sci.; 1977; 17(2); 110-113. 5.  An Approach to the Assignment of Canonical Connection Tables and Topological Symmetry Perception, Craig A. Shelley, Morton E. Munk, J. Chem. Inf. Comput. Sci.; 1979; 19(4); 247-250. 6.  David Weininger, Arthur Weininger and Joseph L. Weininger, "SMILES 2: Algorithm for Generation of Unique SMILES Notation", Journal of Chemical Information and Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989. 7.  A beginner's guide to responsible parenting or knowing your roots, www.daylight.com/meetings/emug98/Bradshaw/, EuroMUG '98, Cambridge, UK, Oct 1998. 8.  Canonicalization and Enumeration of Tautomers, Jack Delany and Roger Sayle, www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm EuroMUG '99, Cambridge, UK, Oct 1999. 9.  Hooked on Protonics, Roger Sayle and Geoff Skillman, www.eyesopen.com/about/events/presentations/acs02/sld001.htm, 224th ACS National Meeting, Boston, Aug 2002. 10.  Introduction to Chemical Info Systems, John Bradshaw, www.daylight.com/meetings/emug02/Bradshaw/Training/, Euromug02 24th-26th September 2002, Cambridge UK 11.  That INChIFeeling, www.reactivereports.com/40/40_3.html, Reactive Reports, Sep 2004 (issue 40) 12.  OEChem, OpenEye Scientific Software, 2002. 13.  QuacPac, OpenEye Scientific Software, 2004. Fig 5: tautomer triplet from Maybridge 2003 New: canonical pKa states But The canonicalization of alternative pKa states is accomplished for many classes of molecules by the OpenEye program pkatyper13. This problem resembles tautomer canonicalization in many respects, and is an area of active research at OpenEye. 3600 Cerrillos Road Suite 1107 Santa Fe, New Mexico 87507 505.473.7385 info@eyesopen.com www.eyesopen.com