ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

1.123 Aufrufe

Veröffentlicht am

Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?
Josef Eiblmaier (InfoChem, Germany)
In the past decade various systems for the automatic identification and extraction of chemistry-related information from unstructured sources have emerged. They have opened up new possibilities for organizing, querying, and analyzing chemical content to support the research and development process. Patent authorities and scientific publishers make available, on a large scale, not only full text and images, but also ChemDraw CDX files for many sources. The chemical information contained in these CDX files is primarily intended for layout purposes for publications but it is often erroneously considered to be readily available as input for structure and reaction database building processes. Unfortunately, automatic work-up of chemical structures and reactions from these CDX files entails serious obstacles and problems and consequently the information produced is often incorrect or incomplete and thus not properly available to information professionals via structure and reaction searching. This talk will present different approaches to extracting reactions and structures correctly from CDX files and will describe the main difficulties and drawbacks encountered.

Veröffentlicht in: Technologie, Unterhaltung & Humor
0 Kommentare
1 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.123
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
314
Aktionen
Geteilt
0
Downloads
11
Kommentare
0
Gefällt mir
1
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

ICIC 2013 Conference Proceedings Josef Eiblmaier Infochem

  1. 1. 1 / 23 Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge? Josef Eiblmaier, Hans Kraut, Sascha Hausberg, Peter Loew ICIC 2013 Vienna, October 13 – 16 InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  2. 2. 2 / 23 Outline » ChemDraw files: Relevance and the Challenge » Approach » Projects » InfoChem ChemProspector © cora / PIXELIO, www.pixelio.de » Wiley Smart Article » Thieme Science of Synthesis Update / Pharmaceutical Substances » Conclusion / Outlook InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  3. 3. 3 / 23 Patents, Journal Articles and MRW‘s: a Buried Treasure? Chemical structures (images) Chemical names/fragments (text) Markush structures (text, images, CDX) Chemical structures (CDX files) InfoChem GmbH © 2013 Reactions (CDX files) ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  4. 4. 4 / 23 Manuscript  Article  Database … Publishing Manuscript submission Manual Indexing Database production e.g. SciFinder, Reaxys, SPRESI eEROS, ... InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  5. 5. 5 / 23 CDX Scheme vs. Database Record ChemDraw file Database Purpose: presentation / publishing no search Purpose: search / retrieval Unstructured Structured Structures: no strict rules Structures: strict rules General rules: none Database rules: strict Reactant Product Reagent Solvent Catalyst LiOH H2O, THF Pd(OAc)2 Cl-Co2Et, Et3N Acetone, H2O SOCl2 Source: Thieme Pharmaceutical Substances, Ticagrelor (in production) InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  6. 6. 6 / 23 CDX Scheme Processing, what does that mean? Chemical structures (SD files) Reactions (RD files) ICSchemeProcessor Conditions (RD files) Reagent Solvent Catalyst LiOH H2O, THF Pd(OAc)2 Cl-Co2Et, Et3N Acetone, H2O SOCl2 Source: Thieme Pharmaceutical Substances, Ticagrelor (in production) InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  7. 7. 7 / 23 But: CDX files, often an optical illusion! Authors are very inventive for a ‚perfect‘ layout! Appearences are deceiving! » Usage of graphical symbols • Polymer supports • Heteroatoms C Grid: InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  8. 8. 8 / 23 Optical illusions 2 » Unresolvable labels • Labels not defined • Element symbols used as R-group labels • Ambiguous fragment labels (e.g. molecular formula) InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  9. 9. 9 / 23 Optical illusions 3 » Variable points of attachment InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  10. 10. 10 / 23 Optical illusions 4 » Reaction arrows / forked arrows / brackets InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  11. 11. 11 / 23 Approach © Gerd Altmann / PIXELIO, www.pixelio.de InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  12. 12. 12 / 23 Approach » The algorithmic approach: • Application of a set of rules in the software (generic, project unspecific). Software should recognize all cases that might occur! • project (title-) specific rules (drawing conventions must not change), otherwise further development necessary • manual post correction required (cost/time intensive) • problem is infinite, unprecedented issues can not be handled » The templating approach: • software is developed to recognize a defined set of problems (PS) • all content must be manually pre-templated (cost intensive) according to the capabilities of the software » The hybrid approach: • depending on the source the focus can be laid on either approach InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  13. 13. 13 / 23 Templating » Templating: Guidelines for authors and typesetters • Syntax definitions for tables, R-groups etc. • Syntax rules for captions • Reaction arrangement, forked arrows • Rules for reaction conditions (reactants, catalysts, solvents, yields, temperature) InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  14. 14. 14 / 23 Examples: » Algorithmic detection of features » Resolution of repeating groups » Enumeration of R-groups » Resolution of aliases/labels • source specific alias databases • continuously extended » Table Enumeration • compound enumeration • reaction factual data: Caption/Yield » Variable points of attachment » Forked arrows InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  15. 15. 15 / 23 Projects InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  16. 16. 16 / 23 Sucessful Application of CDX Processing: Chemistry Enrichment Workflow*, (Wiley Smart Article) *Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  17. 17. 17 / 23 Templating* Author‘s CDX File CDX Template Templating Enumerated structures ICSchemeProcessor CDX-Templating Guidelines (Structures) *Reinhard Neudert: Enhancing the User Experience for Wiley Chemistry Content, ICIC 2012 14. – 17. October, Berlin InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  18. 18. 18 / 23 Workflow Science of Synthesis Update R4 R4 O R5 OH HO H2N + NH HO HN  H2O R5 NH R4 R4 O R5 OH H2N HO R4 + R4 NH HO N HN R N O HO R HO N R5 O N R5 R4 R4 R5 R4 40 N  H2O  H2O N  NH3 N H2 4 •• R5 N H2 39 R4 R5 •• R5 N H2 39 N  H2O R5 NH O 4  NH3 N H2 R5 O 40 N H N R5 ICSchemeProcessor N H O R5 OH HO H2N + HO HN Scheme Error Report  NH  H2O R5 NH R4 4 R R4 N N O  Correct / extend process R4 R4   CDXTemplating Guidelines (Reactions) R5 •• R5 N H2 39 HO N H2 N  NH3 R5 O 40  H2O R4 Scheme correction not possible InfoChem GmbH © 2013 N 5 R N H ICIC 2013 Vienna, October 13 – 16 Manual data entry Dr. Josef Eiblmaier
  19. 19. 19 / 23 Sample Pharmaceutical Substances Update Source: Thieme Pharmaceutical Substances, Abiraterone InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  20. 20. 20 / 23 Conclusion » As much as possible algorithmic processing desirable • generic: can be applied to other contents as well • cheaper (humans cost!) » 100% conversion (without human interaction) never possible » Solutions are project / source specific » Relevance of automatic extraction will continuously increase » Authors / Publishers play an essential role in a successful conversion InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  21. 21. 21 / 23 Acknowledgements » Wiley  Michael Forster  Reinhard Neudert » Thieme  Guido Herrmann  Rolf Hoppe  Klaus Köberlein » InfoChem  Hans Kraut, Sascha Hausberg, Thomas Menke, Manuela Rauh Fanny Irlinger, Huyen Ngyen, Dagmar Kunzmann InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  22. 22. 22 / 23 © Thomas Link / Flickr Thank you! InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier
  23. 23. 23 / 23 Questions? InfoChem GmbH © 2013 ICIC 2013 Vienna, October 13 – 16 Dr. Josef Eiblmaier

×