ACS Salt Lake City 2009 CINF Talk (InChI Symposium)
1. InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison Markus Sitzmann Computer-Aided Drug Design Group (NCI/CADD), Laboratory of Medicinal Chemistry, NCI-Frederick, NIH, DHHS
2. The Adaption and Use of the IUPAC InChI/InChIKey NCI/CADD Identifiers InChI/InChIKey Chemical Structure Lookup Service FICTS FICuS uuuuu Std. InChI/InChIKey 74 million structure records – 46 million unique structures
3.
4. charged form A3DAE0788050DDE4 3ECEF579D7DF025A tautomers isotope “ errors” E92E4BA2869F3611 8A7AD1EB498CC76A stereoisomers 6C16DE2351F9FF50 salt 9850FD9F9E2B4E25 H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 2 O - O N a + H N N N H 3 + O - O 8F7A1DE5A733F0E0 O H N N N H 2 O N a 60525E1AF41497B6 H N N N H O H O B2FDA68AEDA06DB9 N H N 1 5 N H 2 O H O
7. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive un-sensitive un-sensitive un-sensitive un-sensitive Tautomers Stereochemistry sensitive sensitive Na + Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
8. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive F I C FICTS identifier: representation of the exact drawing un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive T ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive ≠ ≠ S Na + = = ≠ ≠ Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
9. NCI/CADD Structure Identifiers Fragments Isotopes Charges sensitive sensitive sensitive F I C FICuS identifier: comes closest to how a chemist perceives a compound un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive u ≠ ≠ ≠ ≠ Tautomers Stereochemistry sensitive sensitive = = ≠ ≠ S Na + Structure Normalization D D D D D D O O C O O H N H 2 O - O N H 3 + O H O N H 2 O O H O O H C O O H H N H 2 C O O H N H 2 H O O - O O H
10. NCI/CADD Structure Identifier Fragments Isotopes Charges Tautomers Stereochemistry Na + sensitive sensitive sensitive sensitive sensitive = = = = = = = = uuuuu identifier: closely related forms of the same compound u u u u u un-sensitive un-sensitive un-sensitive un-sensitive un-sensitive Structure Normalization O O - D D D D D D O - O N H 3 + O O H O O H C O O H H N H 2 C O O H N H 2 H O O H O O C O O H N H 2 O H O N H 2
11. NCI/CADD Structure Identifier correct structure: add hydrogen atoms correct functional groups correct metal atom bonds input structure normalize or discard stereo information define canonical tautomer discard isotope labels d Structure Normalization get largest fragment & uncharge: delete complex center get largest organic fragment delete radical center uncharge structure uuuuu uuuuS uuuTu uuuTS FICuu FICuS FICTS FICTu n n n n d d d define canonical resonance form/ protonation state parent structures
12. NCI/CADD Structure Identifier 9850FD9F9E2B4E25 -FICTS-01-57 9850FD9F9E2B4E25 -FICuS-01-78 9850FD9F9E2B4E25 -uuuuu-01-27 <CACTVS hashcode (E_HASHISY)>-<tag>-<version>-<checksum> H N N N H 2 O H O
13. A3DAE0788050DDE4-FICTS E5F83F10C5DB080A -FICTS B2FDA68AEDA06DB9-FICTS 9850FD9F9E2B4E25 -FICTS E5F83F10C5DB080A -FICTS E92E4BA2869F3611-FICTS 8A7AD1EB498CC76A-FICTS 6C16DE2351F9FF50-FICTS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICTS charged form tautomers isotope salt stereoisomers FICTS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
14. A3DAE0788050DDE4-FICuS E5F83F10C5DB080A -FICuS B2FDA68AEDA06DB9-FICuS 9850FD9F9E2B4E25 -FICuS E5F83F10C5DB080A -FICuS E92E4BA2869F3611-FICuS 8A7AD1EB498CC76A-FICuS 9850FD9F9E2B4E25 -FICuS H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -FICuS charged form tautomers isotope salt stereoisomers FICuS “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
15. 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -FICuS 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu 9850FD9F9E2B4E25 -uuuuu H N N N H 2 O - O N a + 9850FD9F9E2B4E25 -uuuuu charged form tautomers isotope stereoisomers salt uuuuu “ errors” H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
16. HNDVDQJCIGZPNO -UHFFFAOYSA-N HNDVDQJCIGZPNO -CDYZYAPPSA-N HNDVDQJCIGZPNO -RXMQYKEDSA-N HNDVDQJCIGZPNO -YFKPBYRVSA-N HNDVDQJCIGZPNO -UHFFFAOYSA-N H N N N H 2 O - O N a + HNDVDQJCIGZPNO -UHFFFAOYSA-N charged form tautomers isotope stereoisomers salt Std. InChIKey “ errors” HNDVDQJCIGZPNO -UHFFFAOYSA-N UHPNKBYGGMJTIM-UHFFFAOYSA-M UHPNKBYGGMJTIM-UHFFFAOYSA-M H N N N H 2 O H O N N H N H 2 O H O H N N O H O N H 2 H N N O H O N H 2 H N N N H 3 + O - O O H N N N H 2 O N a H N N N H O H O N H N 1 5 N H 2 O H O
20. Tautomers Structure Normalization A6199E68A788F2F5 -FICTS 959B273B619C709F -FICTS 61248C4A7D045A47 -FICTS 675R4FCC50F45026 -FICTS 0B345B47F6625113 -FICTS 181CA9BCE3EF47F4 -FICTS 1AD375920BE60DAD -FICTS 67196F0B20B1D934 -FICTS BCCDA7D0CDACF120 -FICTS CE8F480C11DBFC4F -FICTS D46A1E6500B06AB6 -FICTS D979CF9770AC0BA5 -FICTS 56FFE8B5619FB01 -FICTS F802E527EC5C61BF -FICTS EF060DA9D97091DE -FICTS BCCDA7D0CDACF120 -FICuS guanine UYTPUPDQBNUYGX-UHFFFAOYSA-N N N H N H N O H 2 N N N H N H N O H 2 N N N H N N O H H 2 N H N N N H N O H 2 N N N N H N O H H 2 N H N N N H N O H 2 N N N N H N O H H 2 N H N N N N O H H 2 N H N N H N H N O H N N N H N H N O H H N H N N H N H N O H N N N H N H N O H H N H N N H N N O H H N H N N N H N O H H N H N N N H N O H H N
22. tautomer tautomer methyl propenyl ketone Structure Normalization Tautomerism & Stereochemistry O Z O E O H
23. 76D03F08ACDF6C0C -FICuS FICUS disregards stereo-chemistry on double bonds if the double bond is not located during tautomer generation. tautomer tautomer methyl propenyl ketone InChI/InChIKey - NCI/CADD Identifier comparison Tautomerism & Stereochemistry O Z O E O H O
24. 76D03F08ACDF6C0C -FICuS FICUS disregards stereo-chemistry on double bonds if the double bond is not located during tautomer generation. tautomer InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3/b4-3+ LABTWGUMFABVFG -ONEGZZNKSA-N InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4,6H,1H2,2H3/b5-4- LYGWZVOQSCPYDG -PLNGDYQASA-N InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3/b4-3- LABTWGUMFABVFG -ARJAWSKDSA-N tautomer methyl propenyl ketone InChI/InChIKey - NCI/CADD Identifier comparison Tautomerism & Stereochemistry InChI=1S/C5H8O/c1-3-4-5(2)6/h3-4H,1-2H3 LABTWGUMFABVFG -UHFFFAOYSA-N O Z O E O H O
25. 821D8C17ACE5040E -FICTS 6EB4AA2BAA11965F -FICTS 1677645190718885 -FICTS tautomer tautomer 76D03F08ACDF6C0C -FICTS methyl propenyl ketone FICTS “sees” four different structures InChI/InChIKey - NCI/CADD Identifier comparison Tautomerism & Stereochemistry O Z O E O H O
26. Charges in Resonance Systems Structure Normalization F3A27F03AE77A722 F3A27F03AE77A722 62FADCB01F197FC9 canonical resonance structure? uncharge ≠ uncharge problem! 2E011EE4519F7920 different protonation states N N H N N H H N N H N N H H
27.
28. Structure Normalization (no plausible unpolarized resonance structure can be drawn) münchnones: 1.2 shift 1.2 recombination 1.2 recombination separation (pentavalent N atom) 1.3 shift 1.3 shift 1.3 recombination 1.3 shift 1.3 shift 1.3 shift 1.3 shift Charges in Resonance Systems IUYUGWCTOLFFCL-UHFFFAOYSA-N F68AC07DE0D3379F -FICuS N O O N O O N O O N O O N O O N O O N O O N O O
29.
30.
31. original structure record set (74.2 million) FICuS compound set (46.7 million unique) Standard InchI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison
32. original structure record set (74.2 million) FICuS compound set (46.7 million unique) Standard InchI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) Detailed Comparison 1 conflicts? InChI/InChIKey - NCI/CADD Identifier comparison
33. original structure record set (74.2 million) FICuS compound set (46.7 million unique) Standard InchI/InChIKey set calculated by stdinchi-1 (73.8 million, 48.0 million unique) Detailed Comparison Standard InChI/InChIKey calculated by CACTVS from FICuS compound structure 1 conflicts? InChI/InChIKey - NCI/CADD Identifier comparison same InChI/InChIKey? 2
34. no conflicts between Std. InChI/InChIKey and FICuS Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison FICuS linked to a single InChI/InChIKey both linked to a single structure record both linked to multiple structure records 62.3 34.4 27.9 all structure records (46.9%) (38.0%) 73.7 (84.5%) structure records (million records) 1
35. conflicts between Std. InChI/InChIKey and FICuS Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison structure records (million records) all structure records FICuS is linked to multiple InChI/InChIKeys or vice versa one FICuS is linked to multiple InChI/InChIKeys one InChI/InChIKey is linked to multiple FICuS 10.4 3.6 6.8 (4.6%) (9.3%) (84.5%) 73.7 1
36. conflicts between Std. InChI/InChIKey and FICuS Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison structure records (million records) all structure records FICuS is linked to multiple InChI/InChIKeys or vice versa one FICuS is linked to multiple InChI/InChIKeys one InChI/InChIKey is linked to multiple FICuS 10.4 3.6 6.8 (4.6%) (9.3%) (84.5%) 73.7 number of InChIKeys first block 0.9 number of InChIKeys first block 2.3 (1.2%) (3.1%) 1
37. Detailed Comparison FICuS FICTS uuuuu 46.7 48.0 41.6 6.4 (13.7%) 3.8 (7.9%) 11.9 (28.6%) compounds (unique structures) (million records) all compounds 73.7 9.3 4.6 (29.7%) 21.9 (6.2%) (12.7%) structure records (million records) all records InChI/InChIKey - NCI/CADD Identifier comparison same InChI/InChIKey? InChI changes InChI changes 2
38. Detailed Comparison FICuS FICTS uuuuu 46.7 48.0 41.6 6.4 (13.7%) 3.8 (7.9%) 11.9 (28.6%) compounds (unique structures) (million records) all compounds structure records (million records) all records InChI/InChIKey - NCI/CADD Identifier comparison 3.2 6.3 (7.6%) (8.4%) vs. InChIKey first block InChI changes InChI changes same InChI/InChIKey? 73.7 9.3 4.6 (29.7%) 21.9 (6.2%) (12.7%) 2
39. (formal) tautomer count > 1 (formal) tautomer count > 3 (formal) tautomer count > 10 full stereo contains metal atoms metal complexes salt has resonance charges inorganic compound classification 14.5% 18.5% 28.9% 16.9% 34.5% 52.1% 18.6% 52.1% 33.9% 56.4% 25.4% 5.5% 25.7% 0.8% 0.2% 1.0% 0.2% 0.1% Detailed Comparison InChI/InChIKey - NCI/CADD Identifier comparison occurrence in FICuS set occurrence in FICuS subset ( InChI changes )
40. FICuS : 12 different structure records linked to this structure Std. InChI/InChIKey (stdinchi-1) : calculates 3 different strings/keys for these 12 structure records (all have the same connectivity layer/first block) all of these 3 StdInChI/InChIKey differ from the StdInChI/InChIKey calculated after FICuS normalization (including connectivity layer/ first block) InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N N H O O
41. H N O N N H O O N O N O O N H Z E InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N N H O O
42. H N O N N H O O N O N O O N H Z E tautomer: InChI/InChIKey - NCI/CADD Identifier comparison H N O N N H O O ChemBlock A3422/0145215 N O N N H O O
43. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N O O N H H N O N N H O O N O N N H O O
44. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? S R InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O N O N N H O O
45. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 S R H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O N O N N H O O
46. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? S R InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 N O N N H O O How many structures? ZINC04685909 ChemBlock A3422/0145215 ChemNavigator 47748165 NIST MS-Lib 1967005690 ChemNavigator 34903393 ChemNavigator 65635274 H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O
47. H N O N N H O O N O N O O N H Z E tautomer: tautomeric interconversion? tautomeric interconversion? S R InChI/InChIKey - NCI/CADD Identifier comparison ChemBlock A3422/0145215 N O N N H O O How many structures? InChIKey A InChIKey B InChIKey C same connectivity layer/block FICuS parent structure H N O N O O N H H N O N N H O O N O N N H O O N O N N H O O
49. Dithiazinine InChI/InChIKey - NCI/CADD Identifier comparison S N S N I best representation S N S N I original structure
50. Dithiazinine InChI/InChIKey - NCI/CADD Identifier comparison S N S N I S N S N H I H H H H H S N S N I H H H best representation InChI FICuS Z E E Z E S N S N I original structure
51. The Adaption and Use of the IUPAC InChI/InChIKey NCI/CADD Identifiers InChI/InChIKey FICTS FICuS uuuuu Std. InChI/InChIKey 74 million structure records – 46 million unique structures http://cactus.nci.nih.gov/lookup Chemical Structure Lookup Service
52. Web Service Chemical Structure REST Service (beta) http://cactus.nci.nih.gov/chemical/structure/ {identifier} / {method} http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / smiles http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / names http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / ficus http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / stdinchi http://cactus.nci.nih.gov/chemical/structure/ InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N / image http://cactus.nci.nih.gov/chemical/structure/ ethanol / stdinchikey http://cactus.nci.nih.gov/chemical/structure/ 64-17-5 / stdinchikey URL scheme: returns plain text/gif image if the structure identifier is not resolvable: http 404 status code
53. Acknowledgments ChemNavigator Scott Hutton Tad Hurst CADD Group, LMC, NCI Marc Nicklaus Igor V. Filippov CACTVS, Xemistry GmbH Wolf-Dietrich Ihlenfeldt Thanks to all database providers http://cactus.nci.nih.gov Our web site: