Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists

Chemistry Online – The Vision and
Challenges Associated With Building
the ChemSpider Resource for Chemists

Antony Williams
Merck, October 2012

It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?

Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?

The World of Online Chemistry
 Property databases
 Compound aggregators
 Screening assay results
 Scientific publications
 Encyclopedic articles (Wikipedia)
 Metabolic pathway databases
 ADME/Tox data – eTOX for example
 Blogs/Wikis and Open Notebook Science
 Contributing Open Source code to projects

Collaborative Knowledge Management

We Want to Answer Questions
 Questions a chemist might ask…
 What is the melting point of n-heptanol?
 What is the chemical structure of Xanax?
 Chemically, what is phenolphthalein?
 What are the stereocenters of cholesterol?
 Where can I find publications about xylene?
 What are the different trade names for Ketoconazole?
 What is the NMR spectrum of Aspirin?
 What are the safety handling issues for Thymol Blue?

Available Information…
 Linked to vendors, safety data, toxicity, metabolism

Crowdsourced “Annotations”
 Users can add
 Descriptions/Syntheses/Commentaries
 Links to PubMed articles
 Links to articles via DOIs
 Add spectral data
 Add Crystallographic Information Files
 Add photos
 Add MP3 files
 Add Videos

Chemistry Data online is messy
 We have inherited errors
 All public compound databases, including ours,
have errors
 “Incorrect” structures – assertions, timelines etc
 “Incorrect” names associated with structures
 Properties
 Links
 Publications
 ENORMOUS CHALLENGE

What could create change?
 Harvard Business Review (2010)

“One change would make a substantial
difference [to drug R&D]: the creation of
agreed-upon standards for digitally
representing drug assets.”

Consider drug structures ONLY…

MeSH
 A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione)
derived from plants, VITAMIN K 2 (menaquinone)
from bacteria, and synthetic naphthoquinone
provitamins, VITAMIN K 3 (menadione). Vitamin K 3
provitamins, after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K

What is the Structure of Vitamin K1?

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-
enyl)naphthalene-1,4-dione”
 Variants of systematic names on PubChem

 2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl
 2-methyl-3-(3,7,11,15-tetramethyl
 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Chemistry on The Internet Is Messy

NPC Browser http://tripod.nih.gov/npc/

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem
C&E News (from ACS)

People Use Trusted Resources…

Crowdsourced Curation
 Crowd-sourced curation: identify/tag errors, edit
names, synonyms, identify records to deprecate

What is the outcome of this???
 IF we can get the community to help clean up the
internet of chemistry then we have:
 High quality online reference resources
 Freely available reference data
 Ongoing iterative curation – how many
chemical structures are “reworked”
 And what is the value of “curated chemical
dictionaries???”

Successful Semantic Markup
Depends on Dictionaries

Dictionaries Enhance Publications

I want to know about “Vincristine”

Vincristine: Patents
Linked by Name

Vincristine: Articles
Linked by Name

What are the names for this
compound just in patents????

Crowdsourcing Works
 >130 people have deposited data and
participated in data curation

 Different level curators check each other

 More curators and depositors encouraged!
28 million chemicals is a long list…

ChemSpider for Analytical Sciences
 ChemSpider is being developed with the intention of

 Being the world’s richest resource of freely
accessible curated analytical data
 As a platform for structure verification and
dereplication
 To provide access to supporting prediction
algorithms

Spectral Uploading
 Locate the structure of interest and deposit
spectrum

 Supported formats: JCAMP, PDF

Spectral Uploading
 Various types of NMR spectra supported

Multiple Spectra for One Structure

ChemSpider ID 24528095 C13 NMR

Available Spectra
http://www.chemspider.com/spectra.aspx

How do these data get curated?
 Every spectrum can be commented on

 Incorrect spectra have been annotated and
curated by users…

 But curation through gaming is also possible…

www.SpectralGame.com
http://www.jcheminf.com/content/1/1/9

In progress…
 Storage and display of ASSIGNED spectra

Web Services Open Up Collaboration
 Agilent, Bruker, Waters and Thermo all use our
web-based services for compound lookup

 Many academic sites integrating directly –
metabonomics, name lookup, semantic markup

Where do data come from?
 ChemSpider users deposit data
 Some contributions from NIST
 Chemical vendors are starting to provide data.
Synthonix are one of our major contributors
(www.synthonix.com)

Commercial Database Access
 Recently deposited to ChemSpider
 EPA/NIST IR Database >5000 spectra

 Presently under development
 NIST MS database >200,000 MS spectra

Where next with Analytical Support?
 PharmaSea project for the identification of natural
products – dereplication approaches
 Use mass spectrometry searches of natural
product slices to identify
 Pre-fragment compounds and develop
searches
 Dereplication using NMR data
 NMR features
 Predicted spectra and “Verification approaches”

NMRShiftDB: http://www.ebi.ac.uk/nmrshiftdb/

NMRShiftDB Data Review

• High quality NMR shift set of ca. 100,000 shifts
• Derived prediction algorithms give very similar
performance statistics to commercial algorithms

Crowdsourcing Chemical Synthesis
 How much data generated in a lab, that COULD
go public, is lost forever?

Crowdsourcing Chemical Synthesis
 How much data generated in a lab, that COULD
go public, is lost forever?
 Public Domain reference databases of value?
 Properties
 Spectra
 CIFs
 Images
 Syntheses

An Adventure into the World of Small
but significant contribution..

Micropublishing with Peer Review
(a chemical synthesis blog?)

MOBILE Structure Database Lookup

Open PHACTS Project
 Develop a set of robust standards…
 Implement the standards in a semantic integration hub
 Deliver services to support drug discovery programs in
pharma and public domain
 22 partners, 8 pharmaceutical companies, 3 biotechs
 36 months project – goes live next month

Guiding principle is open access, open usage, open source
- Key to standards adoption -

The Future
Internet Data

Small organic molecules Commercial Software
Undefined materials Pre-competitive Data
Organometallics Open Science
Nanomaterials Open Data
Polymers Publishers
Minerals Educators
Particle bound Open Databases
Links to Biologicals Chemical Vendors

The Future of Chemistry on the Web?
 Public compound databases federate & build
a linked environment of validated data!
 Data validation needs are not ignored
 Publishers layer on information to make
publications discoverable
 Public-Private databases can be linked
 Open Data proliferate
 The “Semantic Web” in action

Can Merck Contribute to this Project?
 Do you have any data that you can release into the
public domain?
 Measured property data
 How many “common” spectra are thrown away?
 How many syntheses are published and locked
behind paywalls?
(www.chemspider.com/reactions)
 Can your scientists contribute annotations and
curations if they use ChemSpider?
 Is the challenge of Legal Clearance too big?

Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists

Ähnlich wie Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Chemistry Online and The vision and challenges associated with building the chem spider resource for chemists