ChemSpider as an integration hub for interlinked chemistry data

ChemSpider as an Integration
Hub for Interlinked
Chemistry Data
Antony Williams
SETAC
November 18th 2013

How Much Data Online?
• How much data regarding environmental
toxicology and chemistry is online?
• How can it all be mapped together?

A Grand Challenge….
• Let’s map together all historical chemistry
data and build systems to integrate new data
• Let’s integrate chemistry, toxicology and
biology data and add in disease data too
• Lets model the data and see if we can
extract new relationships – quantitative and
qualitative
• Let’s make it all available on the web

What about this….
• We’re going to map the world
• We’re going to take photos of as many
places as we can and link them together
• We’ll let people annotate and curate the map
• Then let’s make it available free on the web
• We’ll make it available for decision making
• Put it on Mobile Devices, Give it Away

The World of Online Chemistry
•
•
•
•
•
•
•
•

Property databases
Compound aggregators
Screening assay results
Scientific publications
Encyclopedic articles (Wikipedia)
Metabolic pathway databases
ADME/Tox data – eTOX for example
Blogs/Wikis and Open Notebook Science

How to Map Data Together
• Download the structure representations
and map together at the structure level
• Integrate and mesh chemical names,
chemical properties, analytical data
• Carry URL links and retain external links to
original data sets (assume no link decay)
• It sounds easy….

ChemSpider
• Build a HUB connecting as many data
sources as possible
• NOT to harvest all data from each data source
• Today we have >29 million unique chemicals
from >500 data sources
• Focus on improving data quality
• Allow users to enhance, curate and annotate

Identifiers are very useful! But
what when they are “closed”

Mappings and Inconsistencies
Imatinib

Mesylate

ChemSpider

Drugbank

PubChem

InChIStrings Hash to InChIKeys

Vancomycin – Search the
Internet

Vancomycin

Search Molecular
SKELETON

Search Full Molecule

Full Skeleton Search: 529 Hits

Full Molecule Search: 294 Hits

Historical Data for reference
• As evidence that InChI is proliferating and
data is improving:
• Three years ago there were only 104 hits
on the complete InChI online
• Only 4 were correct

What you might not know about
Chemistry Databases on the Internet

PHYSPROP Database

• The freely downloadable
database under the EPI
Suite prediction software
• Very Basic filters suggest
data quality issues

The Stereochemistry challenge.
12500 chemicals with “missed” stereo

But Chemspider is curated right?

Originally 15 compounds “called” Yohimbine
54 Skeletons for Yohimbine

Crowdsourced Curation
• Crowd-sourced curation: identify/tag errors,
edit names, synonyms, identify records to
deprecate

Chemical name dictionaries for:
• Text-mining (publications, patents)
• Used to index PubMed and link Google Patents

• Linking to other databases – think Biology!
• When structures are not available names link

• Searching the web
• Names link to structures link to InChIs

I want to know about “Vincristine”

Vincristine: Identifiers to link

Vincristine: Vendors and Sources
Linked by Structure

Vincristine: Patents
Linked by Name

Vincristine: Articles
Linked by Name

What needs to happen?
• Standards
• Standardization of structures
• More sharing of data – downloadable data
collections for mapping, meshing and integration
• InChI adoption

• Collaboration
• Stop reinventing the wheel
• Share data, share efforts and speed the
process

What if we could capture it all?
Digitally Enhancing the RSC Archive

Start with data in publications

Text Mining
The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride
( 5 ml ) and benzene ( 50 ml ) were charged into a glass
reaction vessel equipped with a mechanical stirrer,
thermometer and reflux condenser.
The reaction mixture was heated at reflux with stirring , for a
period of about one-half hour.
After this time the benzene and unreacted thionyl chloride
were stripped from the reaction mixture under reduced
pressure to yield the desired product N-(β-chloroethyl)-Nmethyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a
solid residue

Turn “Figures” Into Data

FIGURE

EXTRACTED DATA

Conclusions
• There are some amazing online resources for
environmental toxicology and chemistry already!
• ChemSpider has an important role in quality
data and linking resources
• Crowdsourced deposition, validation and
curation works
• Standards are an important part of data linking
• MORE collaboration and data sharing can
benefit us all

Thank you
Email: williamsa@rsc.org
Twitter: @ChemConnector
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

ChemSpider as an integration hub for interlinked chemistry data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie ChemSpider as an integration hub for interlinked chemistry data

Ähnlich wie ChemSpider as an integration hub for interlinked chemistry data (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ChemSpider as an integration hub for interlinked chemistry data