ChemSpider is a free access online database of over 26 million chemical compounds sourced from over 400 different sources including government laboratories, chemical vendors, public resources and publications. ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. ChemSpider has aggregated over 3000 high quality NMR spectra and continues to expand as the community deposits additional data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused. The validation of the data can be performed by members of the community but an automated validation of the data was undertaken using ACD/Labs software using NMR prediction and verification routines. The dataset is a “real world” dataset containing the contributions of a number of laboratories around the world supplying data of varying quality including S/N issues, misreferencing, impurities etc. This work will report on the batch analysis of the ChemSpider spectral data including the identification of multiple errors in the spectra.
Handwritten Text Recognition for manuscripts and early printed texts
Validating the ChemSpider Open Spectral Database NMR Collection using ACD/Labs Verification Algorithms
1. Validating the Open Spectral
Ryan Sasaki1, Sergey Golotvin2
and Antony Williams3
1 Advanced Chemistry Development, Inc.
Database NMR Collection using ACD/Labs
(ACD/Labs)
2 ACD Moscow Inc., Moscow,
Russian Federation
Verification Algorithms
3 ChemSpider, Royal Society of Chemistry,
904 Tamaras Circle, Wake Forest,
North Carolina 27587, USA
Introduction 2) Chemical shift, integration, and multiplicity information are Other encountered issues include spectra with low resolution,
In parallel with the development of new 2D NMR techniques, new predicted for the proposed chemical structure and compared with incorrect spectrometer frequency, unknown solvents, and of course a
ChemSpider is a free online database of over 26 million unique the related properties extracted from the experimental spectrum. series of incorrectly proposed structures
chemical compounds sourced from over 400 different sources A comparison is then made based on an auto-assignment
including government laboratories, chemical vendors, and public procedure3 that finds the best possible fit as the minimum of a
resources. ChemSpider allows its users to deposit data including special objective function.
structures, properties, links to external resources, and various forms
of spectral data. ChemSpider has aggregated over 2000 high quality A similar approach is taken for 13C NMR verification but compares
NMR spectra and continues to expand as the community deposits the experimental and predicted chemical shift values and peak
additional data. The data are generally validated by the community heights. In both cases the output for each verification procedure
but a batch-wise verification of all 1D 1H and 13C NMR spectral data is a Match Factor metric (0-1) produced to illustrate the level of
in the database was performed using ACD/Labs NMR verification consistency between the proposed structure and the experimental Figure 2: Example of a 1H NMR spectrum with a mixture of
software. spectrum. For the purpose of the 1H NMR study, structure-spectrum components as evidenced by integral values.
pairs that generate a match factor >0.8 were considered consistent.
Sources of Spectral Data For 13C NMR, a match factor of >0.75 was considered consistent. Inconsistent results for the 13C NMR data were also evaluated. Close
Databases of structures with associated NMR assignments are inspection revealed that the biggest culprit was due to poor S/N that
available as commercial or open data. However, databases of Analysis of Data led to the absence of 13C peaks for quaternary carbons. As a result,
NMR spectral curves are less common and generally limited to The ACD/Labs automated 1H and 13C verification routines were run the software was unable to find peaks corresponding to quaternary
metabonomics data (for example, the BMRB1 and DrugBank2). One on the NMR spectra dataset from ChemSpider. The results of this carbons in many proposed structures and thus a significant number
component of the ChemSpider project is to gather, host, and make procedure are shown in Figure 1 below: of inconsistent results were observed.
available a structure searchable database of spectral data: 1D/2D 7%
8%
NMR, IR, Raman, and MS. The majority of data are deposited by users 16% Conclusions
of ChemSpider. Submission of spectra in the form of JCAMP-DX (for 25%
ChemSpider is an online structure database allowing the community
1D spectra) or images/PDF (for 1D or 2D spectra) are supported. In to participate in the deposition of additional data. A growing NMR
order to deposit a spectrum a user simply searches ChemSpider for 77%
67%
spectral curve data collection is available to download. In this way
the associated structure and uploads the JCAMP-DX or image form of Consistent
Ambiguous a major reference source of Open NMR data can be provided. The
A B
the spectrum. Community-based curators validate and annotate the Inconsistent
validation of the existing set of spectral data has been performed
data as appropriate to ensure that only the highest quality data are Figure 1: (A) The ACD/Labs 1H verification methodology suggests using ACD/Labs NMR Verification routines. The data validation work
available in the database. As the data collection grew, a batchwise that 77% of the 744 NMR spectra submitted to ChemSpider were highlighted a number of errors in the data, that have now been
validation of the data quality was required and ACD/Labs NMR consistent with the proposed chemical structure. (B) The ACD/Labs resolved, as well as providing a thorough test of the algorithms on
verification software was used to perform the analysis. 13C verification methodology suggests that 67% of the 704 NMR
real-world data.
spectra submitted to ChemSpider were consistent with the proposed
ACD/Labs NMR Verification Routines chemical structure.
References
The ACD/Labs approach to 1H NMR verification consists of two steps: 1) Biological Magnetic Resonance Bank: http://www.bmrb.wisc.edu/
1) The experimental spectrum with an attached chemical structure Identified Issues with the Data 2) DrugBank: http://www.drugbank.ca/
is automatically processed and analyzed. Analysis includes Structures that were deemed inconsistent by the ACD/Labs system 3) Automated Structure Verification Based on 1H NMR Prediction S.S.
automated peak picking, integration, and multiplicity analysis were manually reviewed. The most frequent reason for inconsistent Golotvin, E.Vodopianov, B.A. Lefebvre, A.J. Williams, and T.D. Spitzer
(GSK) Magn. Reson. Chem., 44 (5) 524–538, 2006.
(extraction of coupling constants and coupling patterns). In 1H NMR verification results were in spectra where multiple
addition, all extraneous signals present in the spectrum are components were observed, i.e., a mixture of isomers. Typically
identified (i.e., solvent, reference, known admixtures, etc. ) these were observed based on two signals in close proximity with Tel: (416) 368-3435
partial integrals (for example 0.6H and 0.4H instead of 1H). Manual Fax: (416) 368-5596
Toll Free: 1-800-304-3988
inspection of all inconsistent results revealed 22 such cases where Email: info@acdlabs.com
mixtures were present. www.acdlabs.com