Promiscuous patterns and perils in PubChem and the MLSCN
Canonicalized systematic nomenclature in cheminformatics
1. Canonicalized systematic nomenclature in chemoinformatics
And some new canonicalization tools from OpenEye
Jeremy J. Yang
Introduction
Morgan demo and study
Canonicalization in chemoinformatics facilitates rigorous, unambiguous
expression and handling of chemical data and knowledge. However, just as
chemistry encompasses multiple levels of abstraction and modelling, no
single canonicalization method is sufficient to solve all problems. This study
reviews some existing canonicalization methodology and describes new
methods implemented by chemoinformatics library OEChem and other
OpenEye tools.
New: canonicalizing molfiles
Fig 1: Morgan demo.
Extended connectivity
values and atom orders.
Uses OEChem and
Ogham. NCI Diversity
set processed with no
errors.
Definition of canonicalization
A canonicalization algorithm must determine a single representation among
many possible representations for an individual in its domain.
Benefits of canonicalization
• testing equality of molecules
• database search speed
• rigorous informatics and thinking
N! (graph isomorphism is hard) –
Morgan to the rescue
algorithm1
The Morgan
is the basis of most chemical canonicalization work
since, and deserves careful study. In 1965 Harry L. Morgan published the
algorithm already implemented at CAS for its compound registry system.
This work, based on generic graph theory, comprises a theoretical solution
to the problem of molecular canonicalization, and material validation of its
efficacy.
More Morgan, and more
The Morgan algorithm was a huge step forward, but the basic algorithm has
some shortcomings, in performance and comprehensiveness, which have
been corrected by subsequent investigators. The resulting methods have
been implemented and widely used in large scale database systems. Some
key contributions:
• Morgan, 1965 à note to Harry: “You da man!” à CAS
• Wipke & Dyott, 1974 à stereo-enhanced Morgan à MDL
• Jochum & Gasteiger, 1977 à Morgan refinement à CACTVS
• Shelley & Munk, 1977 à Morgan refinement
• Weininger, 1988 à CANSMI canonical line notation à Daylight
• Bradshaw, 1998 à parent compounds à GSK,Daylight
• Delany & Sayle, 1999 à tautomers à OpenEye
• INChi, 2004 à global canonical line notation
This study: canonical molecular
descriptions, not descriptors
The study of graph theory and canonicalization applied to chemistry is
extensive and diverse. Canonical descriptors which do not fully represent
the model can be of great utility in statistical analyses but are not the focus
of this nomenclature study.
Canonicalizing a connection table is not new and was discussed by Morgan1
and others. But generating canonical forms of current standard formats is not
widely done, for historical and practical reasons, although the available
benefits. This is increasingly true now that longer strings are more easily
handled by existing computers.
OEChem provides sufficient control to
accomplish this task. Proposed algorithm:
The OpenEye chemoinformatics toolkit OEChem12 employs an optimized
Morgan-like canonical algorithm to generate canonical smiles. In addition,
the api provides a rich set of tools which can facilitate generation of
canonical representations of many types, for many chemical and
informational models, and for many standard file formats.
• Remove non-structural data
• Supress hydrogens
• Canonical atom order
• Canonical bond order
• Canonical Kekule bonding based on (selected) aromaticity model
• OEChem::OECanonicalOrderAtoms()
• OEChem::OECanonicalOrderBonds()
• OEChem aromaticity models: OE, Daylight, Tripos, MDL, MMFF
• OEChem: many file formats and flavors, low-level writers
• QuacPac13: tautomers application and toolkit
However, the advantages of more terse canonical line notations remain.
Fig 2: Morgan
slow due to
symmetry.
RESULTS: Using test program canmol.py, 1990 NCI Diversity set converted
to canonical SDF files, exactly equal to SDF files converted via SMILES
(demo.eyesopen.com/cgi-bin/canmol). Also done with MOL2 format. This test
validates the ability of OEChem to canonicalize molfiles as strings.
Fig 3: Morgan
fails
Aha! -- Chemo-taxonomy is a “stranded
hierarchy”
• subatomic à atoms à molecules
• normal weight atoms à isotopes
• Kekule molecule model à aromatic molecule models
• non-stereo molecule à stereoisomers
• single molecule à combinatorial libraries
• single molecule à queries
• small molecule à macromolecule + cofactors + ligands
• single molecule à Markush structures
• single molecule à tautomer set
• single molecule à pKa states
• single molecule à reactions
• 2D à 3D
There is a hierarchical relationship among some of these expansions while
some are independent. For example, combinatorial library may involve
stereoisomeric individuals or non-stereo.
For every combination of
molecular representations, canonicalization could be advantageous for the
reasons described. Hence the task of canonicalization is a multi-faceted
one.
Dealing with reality: practical problems
1. Existing formats (may often be):
• ambiguous – poorly defined spec or poor compliance
• un-rigorous – both syntax and semantics are important
• non-comprehensive – only organic, covalent, size limits
2. Stereoisomer canonicalization remains difficult
• "relative stereo-centers"
3. Differing valence assumptions and conventions
• implicit-valence and Hcount formats prone to mishandling
4. Information content and model differences in existing formats
• cannot robustly convert if info must be inferred (e.g. bonds)
5. Disagreement over correct chemistry
• e.g., valences, aromaticity
6. Local versus global canonicalization
• Benefits of canonicalization are available locally or globally.
global canonicalization requires cooperation.
• Locality definition (time, place, software versions)
OpenEye canonicalization tools
New: canonical tautomers
Tautomers have the same formula (structural isomers), but may differ in
proton and electron location, and formal bond order. Special cases: keto/
enol, zwitterion, ring-chain. In the Delany/Sayle algorithm8,13, hydrogen
donors and acceptors are perceived, and the number of free hydrogens.
Donors and acceptor atoms are ordered canonically.
At this stage all
tautomerically equivalent inputs are represented identically.
Hydrogen
locations are exhaustively enumerated. A simple ruleset for enumeration
order can designate the first to be the canonical tautomer.
Through
additional rules, the liklihood can be increased that the canonical tautomer
is a low-energy form. Applications: registration (exact search), substructure
searching, property prediction, similarity/clustering, protein-ligand analysis.
Failure to perceive tautomerism leads to different results for different
valence models which really represent the same chemical entity.
Fig 4: example:
tautomers listed
separately in ACD98.
The latter is the OEcanonical form.
Results: The Maybridge 2003 database was analyzed by the OE program
tautomers13. Of 71367 molecules, 97 have tautomers (47 pairs and one
triplet). Note that additionally, 2381 were found to be non-unique molecules.
Conclusion
Rigorous and effective chemoinformatics systems require concepts
and methods for canonicalization at multiple levels of chemical
abstraction and organization. The current state of the art presents
many theoretical and practical challenges. OpenEye tools can help.
References
1. Morgan, H. L., "Generation of a unique machine description for chemical structures - A
technique developed at Chemical Abstracts Services", J. Chem. Doc. 1965, 5, 107.
2. Stereochemically unique naming algorithm, W. Todd Wipke, Thomas M. Dyott; J. Am.
Chem. Soc.; 1974; 96(15); 4834-4842.
3. Canonical Numbering and Constitutional Symmetry, Clemens Jochum and Johann
Gasteiger, J. Chem. Inf. Comput. Sci.; 1977; 17(2); 113-117.
4. Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J.
Chem. Inf. Comput. Sci.; 1977; 17(2); 110-113.
5. An Approach to the Assignment of Canonical Connection Tables and Topological
Symmetry Perception, Craig A. Shelley, Morton E. Munk, J. Chem. Inf. Comput. Sci.;
1979; 19(4); 247-250.
6. David Weininger, Arthur Weininger and Joseph L. Weininger, "SMILES 2: Algorithm for
Generation of Unique SMILES Notation", Journal of Chemical Information and
Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989.
7. A beginner's guide to responsible parenting or knowing your roots,
www.daylight.com/meetings/emug98/Bradshaw/, EuroMUG '98, Cambridge, UK, Oct
1998.
8. Canonicalization and Enumeration of Tautomers, Jack Delany and Roger Sayle,
www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm EuroMUG '99,
Cambridge, UK, Oct 1999.
9. Hooked on Protonics, Roger Sayle and Geoff Skillman,
www.eyesopen.com/about/events/presentations/acs02/sld001.htm, 224th ACS
National Meeting, Boston, Aug 2002.
10. Introduction to Chemical Info Systems, John Bradshaw,
www.daylight.com/meetings/emug02/Bradshaw/Training/, Euromug02 24th-26th
September 2002, Cambridge UK
11. That INChIFeeling, www.reactivereports.com/40/40_3.html, Reactive Reports, Sep
2004 (issue 40)
12. OEChem, OpenEye Scientific Software, 2002.
13. QuacPac, OpenEye Scientific Software, 2004.
Fig 5: tautomer triplet from Maybridge 2003
New: canonical pKa states
But
The canonicalization of alternative pKa states is accomplished for many classes
of molecules by the OpenEye program pkatyper13. This problem resembles
tautomer canonicalization in many respects, and is an area of active research
at OpenEye.
3600 Cerrillos Road
Suite 1107
Santa Fe, New Mexico 87507
505.473.7385
info@eyesopen.com
www.eyesopen.com