Extraction of structural information from ChemDraw CDX files: easy, or an underestimated, difficult challenge?
Josef Eiblmaier (InfoChem, Germany)
In the past decade various systems for the automatic identification and extraction of chemistry-related information from unstructured sources have emerged. They have opened up new possibilities for organizing, querying, and analyzing chemical content to support the research and development process. Patent authorities and scientific publishers make available, on a large scale, not only full text and images, but also ChemDraw CDX files for many sources. The chemical information contained in these CDX files is primarily intended for layout purposes for publications but it is often erroneously considered to be readily available as input for structure and reaction database building processes. Unfortunately, automatic work-up of chemical structures and reactions from these CDX files entails serious obstacles and problems and consequently the information produced is often incorrect or incomplete and thus not properly available to information professionals via structure and reaction searching. This talk will present different approaches to extracting reactions and structures correctly from CDX files and will describe the main difficulties and drawbacks encountered.