The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

•Als PPT, PDF herunterladen•

0 gefällt mir•1,081 views

US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

High quality chemical databases are struggling with protecting their data from the flow of wild machine-generated chemistry and lower-quality data. The period of primarily human curation prior to deposition in a database is gone and quality-conscious databases need to heavily rely on automated validation checks. An automated chemical validation system is being developed by the cheminformatics team at the Royal Society of Chemistry to be the “quality gatekeeper” of databases at the point of deposition. ChemSpider is leading a community-wide standardization approach starting with our support of the Open PHACTS semantic web project, an Innovative Medicines Initiative. The Chemical Validation and Standardization Platform (CVSP) is being designed as an open, flexible chemical validation and standardization platform that validates and standardizes chemical records. This presentation will review the existing beta version of the system and work in progress.

Technologie

Chemistry Validation and
Standardization Platform
Modularization and
“Hadoop”ization
Kenneth Karapetyan, Colin Batchelor,
Valery Tkachenko, Antony Williams
ACS New Orleans April 2013

Overview
• Motivation
• What we support
• Modularization
• Parallelization
• Examples

Motivation: validation
Open and free chemical validation system for:
•Structure validation
– Warn on query atoms, pseudo atoms, polymers,
etc.
– Nonsensical stereo
•SDF field mapping for validating depositor-
provided names, InChI, SMILES

Motivation: standardization
Allows users to use CVSP default standardization workflow (or
FDA, Open PHACTS and so on)
Allows users to put together their own workflow using
modules provided:
•Apply default CVSP or user-defined SMIRKS rules
•Layout
•Neutralize
•Get canonical tautomer using ChemAxon’s algorithms
•Get biggest organic fragment

What we support
• SD files and mol files
• ChemDraw files (in-house code)
• Tab-delimited text files of names, InChIs,
SMILES

• Zipped files
• GZipped files

“Hadoop”ization
Apache Hadoop is a framework for the distributed processing of large data
sets across clusters of computers.

CVSP is written in C#. To run it on Linux machines we use Mono (cross-
platform .NET runtime environment)

Farm:
•28 CPU cores
•42G memory
•2T disk space

Processor intensive tasks
•Tautomerization

Deposit ID in
Input file Convert to SD format
database

Upload to farm for
Hadoop processing
processing on Hadoop

Upload results to
database for user Download results
preview

Hadoop queues
Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP
submissions
•“Small” submission queue for submissions under 500 records
•Large submissions queue
•Internal queue
– For internal projects, e.g. tautomer analysis of ChemSpider or
ChemSpider standardization

All records have to be processed on Hadoop to user to see the results (no partial
preview)

Examples
DrugBank
•~6500 records, approximately 2 records per
second
PubMed
•~100 000 records, about 9 h

Rate-limiting step?
Canonical tautomerization
This molecule took
45 min to
canonicalize.

DrugBank dataset (6516 records)
Errors
•2 records with query(any) bond
•2 records with R groups
•3 polymers
•18 porphyrins with metal coordinated inside with one of the
metal-nitrogen bonds stereogenic
•Unusual valence: ~20

Warnings
•INCHI not matching structure (100+)
•SMILES not matching structure (100+)

DrugBank ID: DB00755
InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13-
20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16-
14+

DrugBank ID: DB00614

Stereo issues

J. Brecher, Pure Appl. Chem.,
2008,
doi:10.1351/pac200880020277

DB08128 DB06287

Please try CVSP at

http://cv.beta.rsc-us.org

Thank you

E-mail: karapetyank@rsc.org, batchelorc@rsc.org

Weitere ähnliche Inhalte

Was ist angesagt?

Structured Streaming for Columnar Data Warehouses with Jack GudenkaufDatabricks

Less is More: 2X Storage Efficiency with HDFS Erasure CodingZhe Zhang

Performance Tuning in HDF5 The HDF-EOS Tools and Information Center

Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth

Apache CarbonData:New high performance data format for faster data analysisliang chen

Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.

Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3The HDF-EOS Tools and Information Center

HUG Nov 2010: HDFS Raid - FacebookYahoo Developer Network

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Databricks

CaffeOnSpark Update: Recent Enhancements and Use CasesDataWorks Summit

HDF5 I/O PerformanceThe HDF-EOS Tools and Information Center

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015NoSQLmatters

TeraCache: Efficient Caching Over Fast Storage DevicesDatabricks

Introducing Apache Carbon Data - Hadoop Native Columnar Data FormatVimal Das Kammath

HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyDataWorks Summit

MODIS Land and HDF-EOSThe HDF-EOS Tools and Information Center

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks

Hadoop Meetup Jan 2019 - Overview of OzoneErik Krogen

Set Up & Operate Real-Time Data Loading into HadoopContinuent

Why you should care about data layout in the file system with Cheng Lian and ...Databricks

Was ist angesagt? (20)

Structured Streaming for Columnar Data Warehouses with Jack Gudenkauf

Less is More: 2X Storage Efficiency with HDFS Erasure Coding

Performance Tuning in HDF5

Hadoop operations-2015-hadoop-summit-san-jose-v5

Apache CarbonData:New high performance data format for faster data analysis

Scalable and High available Distributed File System Metadata Service Using gR...

Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3

HUG Nov 2010: HDFS Raid - Facebook

Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...

CaffeOnSpark Update: Recent Enhancements and Use Cases

HDF5 I/O Performance

Bruno Guedes - Hadoop real time for dummies - NoSQL matters Paris 2015

TeraCache: Efficient Caching Over Fast Storage Devices

Introducing Apache Carbon Data - Hadoop Native Columnar Data Format

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency

MODIS Land and HDF-EOS

Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...

Hadoop Meetup Jan 2019 - Overview of Ozone

Set Up & Operate Real-Time Data Loading into Hadoop

Why you should care about data layout in the file system with Cheng Lian and ...

Ähnlich wie The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

The RSC chemical validation and standardization platform, a potential path to...Ken Karapetyan

2012 sept 18_thug_biotechAdam Muise

4th Systems Paper Survey SeminarRyo Matsumiya

Chemical Databases and Open Chemistry on the DesktopMarcus Hanwell

Etl with apache impala by athemasterAthemaster Co., Ltd.

Scaling Hadoop at LinkedInDataWorks Summit

Crossing Analytics Systems: Case for Integrated Provenance in Data LakesIsuru Suriarachchi

Introduction to Galaxy and RNA-SeqEnis Afgan

A Closer Look at Apache KuduAndriy Zabavskyy

Ceph for Big Science - Dan van der SterCeph Community

Avogadro, Open Chemistry and SemanticsMarcus Hanwell

ChemValidator – an online service for validating and standardizing chemical s...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

The Open Chemistry ProjectMarcus Hanwell

Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson

What's new in hadoop 3.0Heiko Loewe

August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network

Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j

The state of SQL-on-Hadoop in the CloudDataWorks Summit/Hadoop Summit

Spark Summit EU talk by Berni SchieferSpark Summit

Big data talk barcelona - jsr - jcJames Saint-Rossy

Ähnlich wie The RSC chemical validation and standardization platform, a potential path to quality-conscious databases (20)

The RSC chemical validation and standardization platform, a potential path to...

2012 sept 18_thug_biotech

4th Systems Paper Survey Seminar

Chemical Databases and Open Chemistry on the Desktop

Etl with apache impala by athemaster

Scaling Hadoop at LinkedIn

Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes

Introduction to Galaxy and RNA-Seq

A Closer Look at Apache Kudu

Ceph for Big Science - Dan van der Ster

Avogadro, Open Chemistry and Semantics

ChemValidator – an online service for validating and standardizing chemical s...

The Open Chemistry Project

Hadoop for Bioinformatics: Building a Scalable Variant Store

What's new in hadoop 3.0

August 2013 HUG: Removing the NameNode's memory limitation

Novo Nordisk's journey in developing an open-source application on Neo4j

The state of SQL-on-Hadoop in the Cloud

Spark Summit EU talk by Berni Schiefer

Big data talk barcelona - jsr - jc

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Scaling API-first – The story of a global engineering organizationRadu Cotescu

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

🐬 The future of MySQL is Postgres 🐘RTylerCroy

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

A Year of the Servo Reboot: Where Are We Now?Igalia

How to convert PDF to text with Nanonetsnaman860154

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Slack Application Development 101 Slidespraypatel2

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Histor y of HAM Radio presentation slidevu2urc

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Scaling API-first – The story of a global engineering organization

08448380779 Call Girls In Friends Colony Women Seeking Men

🐬 The future of MySQL is Postgres 🐘

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

How to Troubleshoot Apps for the Modern Connected Worker

Presentation on how to chat with PDF using ChatGPT code interpreter

Tata AIG General Insurance Company - Insurer Innovation Award 2024

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

What Are The Drone Anti-jamming Systems Technology?

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

A Year of the Servo Reboot: Where Are We Now?

How to convert PDF to text with Nanonets

Boost Fertility New Invention Ups Success Rates.pdf

Slack Application Development 101 Slides

GenCyber Cyber Security Day Presentation

Histor y of HAM Radio presentation slide

Handwritten Text Recognition for manuscripts and early printed texts

Automating Google Workspace (GWS) & more with Apps Script

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

1. Chemistry Validation and Standardization Platform Modularization and “Hadoop”ization Kenneth Karapetyan, Colin Batchelor, Valery Tkachenko, Antony Williams ACS New Orleans April 2013

2. Overview • Motivation • What we support • Modularization • Parallelization • Examples

3. Motivation: validation Open and free chemical validation system for: •Structure validation – Warn on query atoms, pseudo atoms, polymers, etc. – Nonsensical stereo •SDF field mapping for validating depositor- provided names, InChI, SMILES

4. Motivation: standardization Allows users to use CVSP default standardization workflow (or FDA, Open PHACTS and so on) Allows users to put together their own workflow using modules provided: •Apply default CVSP or user-defined SMIRKS rules •Layout •Neutralize •Get canonical tautomer using ChemAxon’s algorithms •Get biggest organic fragment

5. What we support • SD files and mol files • ChemDraw files (in-house code) • Tab-delimited text files of names, InChIs, SMILES • Zipped files • GZipped files

6. CVSP: modularization

7. Reusable workflows

8. SMIRKS-based rules

10.

11.

12. “Hadoop”ization Apache Hadoop is a framework for the distributed processing of large data sets across clusters of computers. CVSP is written in C#. To run it on Linux machines we use Mono (cross- platform .NET runtime environment) Farm: •28 CPU cores •42G memory •2T disk space Processor intensive tasks •Tautomerization

13. Deposit ID in Input file Convert to SD format database Upload to farm for Hadoop processing processing on Hadoop Upload results to database for user Download results preview

14. Hadoop queues Three Hadoop queues are used (capacity queue) to prioritize big/large CVSP submissions •“Small” submission queue for submissions under 500 records •Large submissions queue •Internal queue – For internal projects, e.g. tautomer analysis of ChemSpider or ChemSpider standardization All records have to be processed on Hadoop to user to see the results (no partial preview)

15. Examples DrugBank •~6500 records, approximately 2 records per second PubMed •~100 000 records, about 9 h

16. Rate-limiting step? Canonical tautomerization This molecule took 45 min to canonicalize.

17. DrugBank dataset (6516 records) Errors •2 records with query(any) bond •2 records with R groups •3 polymers •18 porphyrins with metal coordinated inside with one of the metal-nitrogen bonds stereogenic •Unusual valence: ~20 Warnings •INCHI not matching structure (100+) •SMILES not matching structure (100+)

18. DrugBank ID: DB00755 InChI=1S/C20H28O2/c1-15(8-6-9-16(2)14-19(21)22)11-12-18-17(3)10-7-13- 20(18,4)5/h6,8-9,11-12,14H,7,10,13H2,1-5H3,(H,21,22)/b9-6+,12-11+,15-8+,16- 14+ DrugBank ID: DB00614

19. Stereo issues J. Brecher, Pure Appl. Chem., 2008, doi:10.1351/pac200880020277 DB08128 DB06287

20. Please try CVSP at http://cv.beta.rsc-us.org Thank you E-mail: karapetyank@rsc.org, batchelorc@rsc.org

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie The RSC chemical validation and standardization platform, a potential path to quality-conscious databases

Ähnlich wie The RSC chemical validation and standardization platform, a potential path to quality-conscious databases (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The RSC chemical validation and standardization platform, a potential path to quality-conscious databases