High quality chemical databases are struggling with protecting their data from the flow of wild machine-generated chemistry and lower-quality data. The period of primarily human curation prior to deposition in a database is gone and quality-conscious databases need to heavily rely on automated validation checks. An automated chemical validation system is being developed by the cheminformatics team at the Royal Society of Chemistry to be the “quality gatekeeper” of databases at the point of deposition. ChemSpider is leading a community-wide standardization approach starting with our support of the Open PHACTS semantic web project, an Innovative Medicines Initiative. The Chemical Validation and Standardization Platform (CVSP) is being designed as an open, flexible chemical validation and standardization platform that validates and standardizes chemical records. This presentation will review the existing beta version of the system and work in progress.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The RSC chemical validation and standardization platform, a potential path to quality-conscious databases
1. Chemistry Validation and
Standardization Platform
Modularization and
“Hadoop”ization
Kenneth Karapetyan, Colin Batchelor,
Valery Tkachenko, Antony Williams
ACS New Orleans April 2013
2. Overview
• Motivation
• What we support
• Modularization
• Parallelization
• Examples
3. Motivation: validation
Open and free chemical validation system for:
•Structure validation
– Warn on query atoms, pseudo atoms, polymers,
etc.
– Nonsensical stereo
•SDF field mapping for validating depositor-
provided names, InChI, SMILES
4. Motivation: standardization
Allows users to use CVSP default standardization workflow (or
FDA, Open PHACTS and so on)
Allows users to put together their own workflow using
modules provided:
•Apply default CVSP or user-defined SMIRKS rules
•Layout
•Neutralize
•Get canonical tautomer using ChemAxon’s algorithms
•Get biggest organic fragment
5. What we support
• SD files and mol files
• ChemDraw files (in-house code)
• Tab-delimited text files of names, InChIs,
SMILES
• Zipped files
• GZipped files
12. “Hadoop”ization
Apache Hadoop is a framework for the distributed processing of large data
sets across clusters of computers.
CVSP is written in C#. To run it on Linux machines we use Mono (cross-
platform .NET runtime environment)
Farm:
•28 CPU cores
•42G memory
•2T disk space
Processor intensive tasks
•Tautomerization
13. Deposit ID in
Input file Convert to SD format
database
Upload to farm for
Hadoop processing
processing on Hadoop
Upload results to
database for user Download results
preview
14. Hadoop queues
Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP
submissions
•“Small” submission queue for submissions under 500 records
•Large submissions queue
•Internal queue
– For internal projects, e.g. tautomer analysis of ChemSpider or
ChemSpider standardization
All records have to be processed on Hadoop to user to see the results (no partial
preview)
17. DrugBank dataset (6516 records)
Errors
•2 records with query(any) bond
•2 records with R groups
•3 polymers
•18 porphyrins with metal coordinated inside with one of the
metal-nitrogen bonds stereogenic
•Unusual valence: ~20
Warnings
•INCHI not matching structure (100+)
•SMILES not matching structure (100+)