Today ChemSpider (www.chemspider.com) is one of the community’s primary online resources for chemists. Now hosting over 28 million unique chemical compounds linked to over 400 data sources, ChemSpider offers its users a structure centric platform facilitating access to publications and patents, experimental and predicted property data, spectral data and many other forms of data and information that can benefit a chemist. ChemSpider is a crowdsourcing platform allowing the community to contribute data directly to the database by allowing the deposition and sharing of structure data, properties, spectra and reaction syntheses. The crowdsourcing also allows for the annotation and curation of existing data thereby allowing the community to assist in the much-needed curation and validation of chemistry data on the internet. This work is imperative in order to provide the chemistry underpinnings to semantic web projects such as Open PHACTS (www.openphacts.org) of which Merck is sure to benefit when it is released to the community. This presentation will provide an overview of the ChemSpider platform and will also examine the challenges of dealing with heterogeneous data quality when attempting to provide a rich resource of data for the community. If you use the internet to research chemistry based data this presentation will be an essential guide to how to source high quality data.
3. It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?
Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?
4.
5. The World of Online Chemistry
Property databases
Compound aggregators
Screening assay results
Scientific publications
Encyclopedic articles (Wikipedia)
Metabolic pathway databases
ADME/Tox data – eTOX for example
Blogs/Wikis and Open Notebook Science
Contributing Open Source code to projects
11. We Want to Answer Questions
Questions a chemist might ask…
What is the melting point of n-heptanol?
What is the chemical structure of Xanax?
Chemically, what is phenolphthalein?
What are the stereocenters of cholesterol?
Where can I find publications about xylene?
What are the different trade names for Ketoconazole?
What is the NMR spectrum of Aspirin?
What are the safety handling issues for Thymol Blue?
19. Chemistry Data online is messy
We have inherited errors
All public compound databases, including ours,
have errors
“Incorrect” structures – assertions, timelines etc
“Incorrect” names associated with structures
Properties
Links
Publications
ENORMOUS CHALLENGE
20. What could create change?
Harvard Business Review (2010)
“One change would make a substantial
difference [to drug R&D]: the creation of
agreed-upon standards for digitally
representing drug assets.”
Consider drug structures ONLY…
22. MeSH
A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K 1 (phytomenadione)
derived from plants, VITAMIN K 2 (menaquinone)
from bacteria, and synthetic naphthoquinone
provitamins, VITAMIN K 3 (menadione). Vitamin K 3
provitamins, after being alkylated in vivo, exhibit the
antifibrinolytic activity of vitamin K. Green leafy
vegetables, liver, cheese, butter, and egg yolk are
good sources of vitamin K
57. What is the outcome of this???
IF we can get the community to help clean up the
internet of chemistry then we have:
High quality online reference resources
Freely available reference data
Ongoing iterative curation – how many
chemical structures are “reworked”
And what is the value of “curated chemical
dictionaries???”
67. Crowdsourcing Works
>130 people have deposited data and
participated in data curation
Different level curators check each other
More curators and depositors encouraged!
28 million chemicals is a long list…
68. ChemSpider for Analytical Sciences
ChemSpider is being developed with the intention of
Being the world’s richest resource of freely
accessible curated analytical data
As a platform for structure verification and
dereplication
To provide access to supporting prediction
algorithms
80. How do these data get curated?
Every spectrum can be commented on
Incorrect spectra have been annotated and
curated by users…
But curation through gaming is also possible…
97. Web Services Open Up Collaboration
Agilent, Bruker, Waters and Thermo all use our
web-based services for compound lookup
Many academic sites integrating directly –
metabonomics, name lookup, semantic markup
98. Where do data come from?
ChemSpider users deposit data
Some contributions from NIST
Chemical vendors are starting to provide data.
Synthonix are one of our major contributors
(www.synthonix.com)
99. Commercial Database Access
Recently deposited to ChemSpider
EPA/NIST IR Database >5000 spectra
Presently under development
NIST MS database >200,000 MS spectra
100. Where next with Analytical Support?
PharmaSea project for the identification of natural
products – dereplication approaches
Use mass spectrometry searches of natural
product slices to identify
Pre-fragment compounds and develop
searches
Dereplication using NMR data
NMR features
Predicted spectra and “Verification approaches”
104. NMRShiftDB Data Review
• High quality NMR shift set of ca. 100,000 shifts
• Derived prediction algorithms give very similar
performance statistics to commercial algorithms
106. Crowdsourcing Chemical Synthesis
How much data generated in a lab, that COULD
go public, is lost forever?
Public Domain reference databases of value?
Properties
Spectra
CIFs
Images
Syntheses
107. An Adventure into the World of Small
but significant contribution..
113. It is so difficult to navigate…
IP?
What’s the
structure?
Are they in
our file?
What’s
similar?
What’s the
Pharmacology target?
data?
Known
Pathways?
Competitors?
Working On
Connections Now?
to disease?
Expressed in
right cell type?
114. Open PHACTS Project
Develop a set of robust standards…
Implement the standards in a semantic integration hub
Deliver services to support drug discovery programs in
pharma and public domain
22 partners, 8 pharmaceutical companies, 3 biotechs
36 months project – goes live next month
Guiding principle is open access, open usage, open source
- Key to standards adoption -
115. The Future
Internet Data
Small organic molecules Commercial Software
Undefined materials Pre-competitive Data
Organometallics Open Science
Nanomaterials Open Data
Polymers Publishers
Minerals Educators
Particle bound Open Databases
Links to Biologicals Chemical Vendors
116. The Future of Chemistry on the Web?
Public compound databases federate & build
a linked environment of validated data!
Data validation needs are not ignored
Publishers layer on information to make
publications discoverable
Public-Private databases can be linked
Open Data proliferate
The “Semantic Web” in action
117. Can Merck Contribute to this Project?
Do you have any data that you can release into the
public domain?
Measured property data
How many “common” spectra are thrown away?
How many syntheses are published and locked
behind paywalls?
(www.chemspider.com/reactions)
Can your scientists contribute annotations and
curations if they use ChemSpider?
Is the challenge of Legal Clearance too big?