Chemical databases have been around for decades, but in recent years we have observed a qualitative change from rather small in-house built proprietary databases to large-scale, open and increasingly complex chemistry knowledgebases. This tectonic shift has imposed new requirements for database design and system architecture as well as the implementation of completely new components and workflows which did not exist in chemical databases before. Probably the most profound change is being caused by the linked nature of modern resources - individual databases are becoming nodes and hubs of a huge and truly distributed web of knowledge. This change has important aspects such as data and format standards, interoperability, provenance, security, quality control and metainformation standards.
ChemSpider at the Royal Society of Chemistry was first public chemical database which incorporated rigorous quality control by introducing both community curation and automated quality checks at the scale of tens of millions of records. Yet we have come to realize that this approach may now be incomplete in a quickly changing world of linked data. In this presentation we will talk about challenges associated with building modern public and private chemical databases as well as lessons that we have learned from our past and present experience. We will also talk about solutions for some common problems.
Building linked data large-scale chemistry platform - challenges, lessons and solutions
1. Building linked-data, large-scale chemistry
platform: challenges, lessons and solutions
Valery Tkachenko, Alexey Pshenichnov, Aileen Day,
Colin Batchelor, Peter Corbett
Royal Society of Chemistry
ACS Spring 2016
San Diego, CA
March 13th 2016
3. • 45 million chemicals and growing
• Data sourced from >500 different sources
• Crowdsourced curation and annotation
• Ongoing deposition of data from our
journals and our collaborators
• A structure centric hub for web-searching
13. ChemSpider - Summary
• Simple, flattish data model
• InChI as a primary identifier
• Linked by synonyms
• Linked by “ExtId”
• Standard searches (identity, substructure,
similarity)
• Very little semantics
14. Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Sustainable
Access Point
OpenPHACTS: 2011-2014
15. info@openphactsfoundation.org @Open_PHACTS
Open PHACTS Practical Semantics
OpenPHACTS
GlaxoSmithKline – Coordinator
Universität Wien – Managing entity
Technical University of Denmark
University of Hamburg, Center for
Bioinformatics
BioSolveIT GmBH
Consorci Mar Parc de Salut de Barcelona
Leiden University Medical Centre
Royal Society of Chemistry
Vrije Universiteit Amsterdam
Novartis
Merck Serono
H. Lundbeck A/S
Eli Lilly
Netherlands Bioinformatics Centre
Swiss Institute of Bioinformatics
ConnectedDiscovery
EMBL-European Bioinformatics Institute
Janssen Esteve Almirall
OpenLink Scibite
The Open PHACTS Foundation
Spanish National Cancer Research Centre
University of Manchester
Maastricht University
Aqnowledge
University of Santiago de Compostela
Rheinische Friedrich-Wilhelms-Universität
Bonn
AstraZeneca
Pfizer
16. Why is it so hard to….
Competitors?
What’s the
structure?
Are they in our
file?
What’s
similar?
What’s the
target?Pharmacology
data?
Known
Pathways?
Working On
Now?
Connections to
disease?
Expressed in right
cell type?
IP?
18. OpenPHACTS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
21 October 2014 Scientific Lenses – A. J. G. Gray 19
19. Gleevec®: Imatinib Mesylate
21 October 2014 Scientific Lenses – A. J. G. Gray 20
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
20. Scientific Lenses – A. J. G. Gray 21
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
21 October 2014
I need to compute an analysis, give me
details of the active compound in Gleevec.
21. Commercial ibuprofen is a racemic mixture containing the same proportion
of two chiral forms. Both chiral forms are equally active. Typically, the user
will wish to retrieve info for any stereoisomer.
CHEMBL427526
CHEMBL521
CHEMBL175
Lens Effects: Ibuprofen
21 October 2014 Scientific Lenses – A. J. G. Gray 22
22. Commercial ibuprofen is a racemic mixture containing the same proportion
of two chiral forms. Both chiral forms are equally active. Typically, the user
will wish to retrieve info for any stereoisomer.
Default Lens
21 October 2014 Scientific Lenses – A. J. G. Gray 23
23. Commercial ibuprofen is a racemic mixture containing the same proportion
of two chiral forms. Both chiral forms are equally active. Typically, the user
will wish to retrieve info for any stereoisomer.
Stereoisomer Lens
21 October 2014 Scientific Lenses – A. J. G. Gray 24
24. Mapping Generation
21 October 2014 Scientific Lenses – A. J. G. Gray 25
ops:OPS437281
✔
ops:OPS380297
has_stereoundefined_parent
[ci:CHEMINF_000456]
ops:OPS380297
is_stereoisomer_of
[ci:CHEMINF_000461]
Other relationships
• has part
• is tautomer of
• uncharged counterpart
• isotope
…
28. OpenPHACTS - Summary
• Principal difference – inter-domain links
• More complex, but still structure-centric
data model
• Ontological relationships introduced
• Chemical Lenses – new type of search
38. Data quality issue and CVSP
– Robochemistry
– Proliferation of errors in public and
private databases
• ChemSpider
• PubChem
• DrugBank
• KEGG
• ChEBI/ChEMBL
– Automated quality control system
45. Chemistry Data Platform - Summary
• Simplified models within domain
• Domains are described with its own models
with embedded semantics
• No proper domain-specific identifiers
• Extensive quality control – CVSP (DOI
10.1186/s13321-015-0072-8)
Remember this, some of these questions are easier to answer than others
17
Open PHACTS was developed to support the key questions of drug discovery
Business questions have been at the heart of Open PHACTS and have driven the development of the platform
Mx/psa, how calculated who did it?
Mash up. With your data too,
- top layer join together but need them all
commercial
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 5 billion triples – 14 datasets & growing
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data
Import data into cache
API calls populate SPARQL queries
Integration approach
Data kept in original model
Data cached in central triple store
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases
Different results
Interested in physiochemical properties of Gleevec
Validate structure: Source data is messy!
Identify common problems:
Charge imbalance
Stereochemistry
Compute physiochemical properties
Identify related properties based on structure
17 relationship types