Without standards the world would be a place of utter chaos. With the ever increasing complexity of modern life, standards specifications ensure interoperability between the multitude of systems that society has grown to depend upon. In the world of cheminformatics, a recent trend has been the promotion of authority prescribed “de jure” standards (such as InChI and HELM), over the more traditional “de facto” standards (such as V2000 Mol files, SMILES strings). Voluntary “de facto” standards are selected by the community, reusing practical solutions that work well, and thereby become dominant, whilst less practical approaches are ignored, in a process akin to Darwinian selection. Obligatory “de jure” standards, however, are imposed on an industry rather than selected by it, often by bureaucrats and lawyers rather than experts. A recent example of this is ISO international standard 11238, entitled “Health informatics — Identification of Medicinal Products — Data elements and structures for the unique identification and exchange of regulated information on substances”. This standard covers file formats for exchanging chemical structures between government agencies including the US Food & Drug Administration (FDA). Amongst the implementation challenges required by this standard is the ability to handle MDL V2000 connection tables with normalized whitespace (i.e. stripped of carriage-returns, linefeeds and the multiple spaces used to align columns). To the author’s knowledge, no cheminformatics suite in the world could meet this requirement at the time the standard was ratified by several countries standards bodies in 2011. With luck, poor “de jure” standards will be ignored by legislators, rather than imposing unreasonable burdens on the communities they were designed to help.
Powering Real-Time Decisions with Continuous Data Streams
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 compliance with ChemAxon tools
1. Implementing iso 11238 standard
compliance with chemaxon tools
Roger Sayle
Nextmove software, cambridge, uk
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
2. What is iso 11238?
• ISO standard 11238 entitled “Health Informatics –
Identification of medicinal products – Data elements
and structures for the unique identification and
exchange of regulated information on substances”.
• Defines a framework for uniquely identifying and
exchanging compounds of pharmaceutical interest.
• The framework serves a similar role to CAS registry
numbers, PubChem CID or InChI-Key, assigning
unique identifiers to substances.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
3. Meet the (IDMP) family
• 11238 is one of a suite of 5 related standards, all for
“unique identification and exchange of …”
– 11238 “… regulated information on substances”.
– 11239 “… dose forms, units, administration, etc.”.
– 11240 “… units of measurement”.
– 11615 “… regulated medicinal product information”.
– 11616 “… regulated pharmaceutical product information”.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
4. Why this is 11238 important?
• EU regulation 520/2012 on “pharmacovigilance”
requires countries, regulatory authorities and
pharma to adopt the 5 IDMP standards (articles 25
and 26) by 1st July 2016 (article 40).
• Executive summary: It’s the law!
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
5. How it works
Code Assignment
(Authority)
Code Look-up
(Authority)
Name/Identifer
Connection Table
Properties
(Significant Text)
Unique Code
Unique Code
Name/Identifer
Connection Table
Properties
(Significant Text)
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
6. Likely implementation
Code Assignment
(Authority)
Code Look-up
(Authority)
Name/Identifer
Connection Table
Properties
(Significant Text)
Unique Code
Unique Code
Name/Identifer
Connection Table
Properties
(Significant Text)
FDA UNII
FDA SRS Search
FDA UNII
XML
INN/USAN/CID
FDA/NCATS GInAS
MOL2000/SMILES/InChI
Protein/NA Sequence
ISO11238 Groups 1-4
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
7. Current status
• The standard has been ratified and it use has been
written into EU law (EU Reg. 520/2012).
• Framework requires use of non-semantic, random,
fixed length unique identifiers, that include an
internal integrity check.
• The standard also details constraints on uniqueness.
• Exact implementation details yet to be determined
(to appear in a future “Implementation Guide”).
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
8. What will the future look like?
• ISO11238 compliant identifiers will be very similar to
the FDA’s UNII (UNique Ingredient Identifier).
• The fixed width non-semantic identifier requirement
rules out the use of plain SMILES, InChI, V2000 Mol
file and similar encodings.
• The random requirement rules out plain CAS registry
numbers, PubChem CIDs and ChEMBL IDs (which use
sequential or monotonic number assignment).
• Alternatively, InChI keys or similar hashes (with [CRC]
checks) of connection tables+text may be possible.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
9. What’s available now
• ISO charge for access to official standards documents
(which is why 5 IDMP standards is more profitable
than one), about 158 CHF ($177 USD) from ISO for
11238 [between $120 and $340 online].
• However, as with many ISO standards, late drafts of
ISO 11238 are freely available on the internet.
• Caution: Many of the technical examples (all XML)
were removed from the final standard and are due to
appear in the upcoming “Implementation Guide(s)”.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
10. Example requirement
• §3.4 “Naming of substances” states “at least one
substance name or company code shall be associated
with each substance”.
• For the envisioned work flows this typically assumes
INN or USAN name has already been assigned.
• One way to guarantee the existence of a suitable
substance name for investigational compounds is to
use IUPAC naming software (such as ChemAxon’s)
during submission to the unique coding authority.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
• Plug: ChemAxon s2n coverage is state-of-the-art.
11. The devil is in the details
• One of the interesting cheminformatics challenges
with working with the published ISO standard and
the examples from the draft annex is the typography.
• The document has been typeset by editors with
expertise outside the field of cheminformatics who
have inadvertently changed whitespace without
appreciating the impact this has on chemistry tools.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
12. Final ISO11238 standard Annex A
• §A.2.3 SMILES uses the example “C1 = CC = CC = C1”
where the spurious spaces create problems for
SMILES readers.
• §A.2.4 InChI both strips the “InChI=” prefix and again
suffers from spaces “1/C6H6 /c1-2-4-6-5-3-1/h1-6H”.
– Interestingly this is an old InChI not a standard InChI.
• §A.2.2 Molfile fails to mention that V2000 mol files
use fixed width columns and blank lines, as a result
the example given in text *next slide+ can’t easily be
read.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
14. Benefit of the doubt?
• These unintentional typographical errors in the
normative text may perhaps be the result of poor
fonts, with the exception of “InChI=”.
• Alas the content of the original Annex B from the
draft indicate these issues were more widespread
and may arise from ignorance of cheminformatics
file formats.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
15. §B.2.2 InChI in XML Example
<STRUCTURAL_REPRESENTATION_TYPE>INCHI</STRUCTURAL_REPRESENTATION_TYPE>
<STRUCTURAL_REPRESENTATION>1S/C2H5NO2.AL.CLH.2H2O.ZR/C3-1-
2(4)5;;;;;/H1,3H2,(H,4,5);;1H;2*1H2;/Q;+3;;;;+4/P-
2</STRUCTURAL_REPRESENTATION>
Missing InChI=
Standard and Non-
Standard InChI?
Converted to
upper case
Indentation
Spurious Spaces
Line Breaks
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
17. All is not lost!
• Back at the 2011 ChemAxon UGM here in Budapest,
Sorel Muressan from AstraZeneca Sweden gave a
presentation on how spelling correction improves
the recall of ChemAxon’s name-to-structure tools.
• The exact same CaffeineFix technology can be
applied to perform aggressive “spelling correction”
on SMILES strings, InChI and V2000 mol files.
• As with IUPAC-like systematic names, these can each
be specified by a formal grammar.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
18. How the algorithm works
• The regular expression describing a V2000 mol files is
compiled into a “finite state machine” with 1333
states.
• The only allowed “corrections” are the deletion of
new lines and the insertion of spaces or new lines,
but only where permitted in the grammar/FSM.
• Depth-first recursion is used to identify a minimal set
of edits to correct the input.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
20. Chemaxon toolkit
implementation
public static Molecule molFileToChemaxonMol(String molFileStr)
throws MolFormatException {
try {
return MolImporter.importMol(molFileStr);
}
catch (MolFormatException e) {
molFileStr = FixMolFile.fixMolFile(molFileStr);
if (molFileStr == null){
throw e;
}
return MolImporter.importMol(molFileStr);
}
}
// Java source code available at http://www.chemaxon.com/forum/ftopic1265.html
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
21. Geek of the week
• A particularly tricky corner case concerns Accerlys’
Pipeline Pilot-style V2000 mol files which abbreviate
the columns in the atom block (to save space).
• In these files there’s potential ambiguity where the
first bond line is mistaken as a continuation of the
last (abbreviated) atom line.
• Our solution relies on the atom stereo care field
being zero in non-query mol files vs. the non-zero
values that appear in the first three fields of bonds.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
22. Lest we forget
• A similar “spelling correction” variant that allows
uppercase characters to be mapped to lowercase,
and the prefix “InChI=” to magically appear at the
start of a string can also be used to fix ISO InChIs.
• Alas uppercasing an InChI (or any molecular formula)
is potentially lossy, e.g. “CsN” vs. “CSn”.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
23. Before and after InChI example
1S/C17H21CLN4O/C1-22-12-3-2-4-13(22)8-11(7-
12)21-17(23)14-5-10(18)6-15-16(14)20-9-19-15/H5-
6,9,11-13H,2-4,7-8H2,1H3,(H,19,20)(H,21,23)
InChI=1S/C17H21ClN4O/c1-22-12-3-2-4-13(22)8-11(7-
12)21-17(23)14-5-10(18)6-15-16(14)20-9-19-15/h5-
6,9,11-13H,2-4,7-8H2,1H3,(H,19,20)(H,21,23)
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
24. How common are the ambiguities?
• 1.35 million standard InChIs from ChEMBL
• Uppercase the InChIs, fix them and check
whether the original InChI can be regenerated
• 99.5% roundtrip (6596 discrepancies)
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
26. conclusions
• The Java source code for recovering V2000 mol files
and InChIs from the types of corruption seen in the
ISO 12238 draft has now been contributed to the
ChemAxon forum, allowing Marvin and JChem to
read the examples given in that document.
• Whether this functionality will be required to fully
support the final (pending) “Implementation Guide”
requirements remains to be seen (and voted upon).
• Attention to detail is important in standards writing.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
27. Final words
• ISO 11238 IDs may become as popular as
Chemical Abstracts’ registry numbers.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014
28. acknowledgements
• Daniel Lowe, NextMove Software, Cambridge, UK.
• Richard Bolton, GSK, Stevenage, UK.
• Evan Bolton, NCBI PubChem, Bethesda, MD, USA.
• Dac-Trung Nguyen, NIH NCATS, Rockville, MD, USA.
• Tyler Peryea, NIH NCATS, Rockville, MD, USA.
• Noel Southall, NIH NCATS, Rockville, MD, USA.
• Yulia Borodina, FDA, Silver Spring, MD, USA.
• Lawrence Callahan, FDA, Silver Spring, MD, USA.
• Andrew Marr, Marr Consultancy, Knebworth, UK.
ChemAxon User Group Meeting 2014, Budapest, Hungary, Wednesday 21st May 2014