9. Data Integration in a
Big Data Context
Open PHACTS Case Study
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
10. Big Data
@gray_alasdair Big Data Integration 11
Volume Velocity
Variety Veracity
http://i.kinja-img.com/gawker-media/image/upload/lvzm0afp8kik5dctxiya.jpg
11. Open PHACTS Use Case
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
Chemical Properties (Chemspider)
Launched drugs (Drugbank)
Human => Mouse (Homologene)
Protein Families (Enzyme)
Bioactivty Data (ChEMBL)
… other info (Uniprot/Entrez etc.)
“Let me compare MW, logP
and PSA for launched
inhibitors of human &
mouse oxidoreductases”
@gray_alasdair Big Data Integration 12
12. Open PHACTS Mission:
Integrate Multiple Research
Biomedical Data Resources
Into A Single Open & Free
Access Point
@gray_alasdair Big Data Integration 13
16. OPS Discovery Platform
@gray_alasdair Big Data Integration 17
Drug Discovery Platform
Apps
Domain API
Interactive
responses
Production quality
integration platform
Method
Calls
Standard Web
Technologies
17. App Ecosystem
@gray_alasdair
An “App Store”?
Explorer Explorer2 ChemBioNavigator Target Dossier Pharmatrek Helium
MOE Collector Cytophacts Utopia Garfield SciBite
KNIME Mol. Data Sheets PipelinePilot scinav.it Taverna
Big Data Integration 18https://www.openphacts.org/2/sci/apps.html
21. API Hits
@gray_alasdair Big Data Integration 22
0
10
20
30
40
50
60
Jan
2013
Feb
2013
Mar
2013
Apr
2013
May
2013
June
2013
July
2013
Aug
2013
Sept
2013
Oct
2013
Nov
2013
Dec
2013
Jan
2014
Feb
2014
Mar
2014
Apr
2014
May
2014
June
2014
July
2014
Aug
2014
Sept
2014
Oct
2014
Nov
2014
Dec
2014
Jan
2015
Feb
2015
Mar
2015
Apr
2015
May
2015
June
2015
NoofHits
Millions
Month
Public launch
of 1.2 API
1.3 API 1.4 API 1.5 API
22. OPS Discovery Platform
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public Ontologies
User
Annotations
Apps
@gray_alasdair Big Data Integration 23
24. John Wilbanks consulted for us
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Data Licensing (Or Lack Of!)
28. P12047
X31045
GB:29384
Identity Mapping
@gray_alasdair Big Data Integration 29
Andy Law's Third Law
“The number of unique identifiers
assigned to an individual is never less
than the number of Institutions
involved in the study”
http://bioinformatics.roslin.ac.uk/lawslaws/
30. Gleevec®: Imatinib Mesylate
@gray_alasdair Big Data Integration 31
DrugbankChemSpider PubChem
Imatinib
MesylateImatinib Mesylate
YLMAHDNUQAMNNX-UHFFFAOYSA-N
Are these records the same?
It depends upon your task!
31. Big Data Integration 32
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Structure Lens
@gray_alasdair
I need to perform an analysis, give me
details of the active compound in Gleevec.
32. Big Data Integration 33
skos:closeMatch
(Drug Name)
skos:closeMatch
(Drug Name)
skos:exactMatch
(InChI)
Strict Relaxed
Analysing Browsing
Name Lens
@gray_alasdair
Which targets are known to interact
with Gleevec?
37. Open PHACTS Approach
1. Know your audience
Web developers
2. Understand your use cases
Prioritised business questions
3. Identify access pathways
Identify data
Identify connections
Implement API
@gray_alasdair Big Data Integration 40
38. Questions
Alasdair J G Gray
A.J.G.Gray@hw.ac.uk
alasdairjggray.co.uk
@gray_alasdair
Open PHACTS
contact@openphacts.org
openphacts.org
@open_phacts
@gray_alasdair Big Data Integration 41
45. 4848
Some tips!
Delay your judgement
Be open to naive and crazy ideas
Openess & enthusiasm
Use associative thinking
Piggyback on ideas of others
46. 4949
Selection of ideas
• Summarize 3 key ideas
• How to select?
– Keep the goal in mind!
– Think in opportunities
– What are you enthusiastic about?
– Personal engagement
– What is needed in the short term?
– Most promising
47. 5050
Selection of ideas
• 5 Votes
• Put your name & e-mail on the sheet if you want to be involved in
working out the idea!
48. THANK YOU FOR YOUR
TIME
Contact me @ Femke.Ongenae@intec.ugent.be
Hinweis der Redaktion
A trend is emerging towards AAL services that are truly personalized. Modern AAL services need to be adapted to the needs and preferences of care receivers and they need to accurately take into account context specificities. Moreover, modern AAL services need to be designed in a way such that they offer added value to the care process.
A trend is emerging towards AAL services that are truly personalized. Modern AAL services need to be adapted to the needs and preferences of care receivers and they need to accurately take into account context specificities. Moreover, modern AAL services need to be designed in a way such that they offer added value to the care process.
In order to achieve true personalization and to evaluate the design of the services, large data sets based on real-life context and profiles are needed.
Living lab environments, such as the Care Living Labs (Zorgproeftuinen) in Flanders, Karolinska Living Lab in Sweden and CASALA (Centre forAffective Solutions for Ambient Living Awareness) in Ireland, have been set up in recent years to enable the collection of such real-life context and profile data. The valorization and dissemination of context-aware and personalized AAL services could be significantly stimulated, by allowing various parties to re-use these data sets in a user-friendly manner.
However, these data-sets are not readily available for further research or the development of novel services as several issues remain to be discussed with regard to a smart data sharing culture for AAL services, such as:
How to express which types of data are available from which living lab environment?
How do we achieve structured, exchangeable data?
How to maintain and express the quality and reliability of the data?
How can these different data sets easily be aligned?
How can these different data sets easily be shared and accessed, without too much effort?
How can these different data sets easily be shared and accessed, without legal constraints?
How to process and synthesize the data so that it is useful and usable by various stakeholders?
Can a payment model be set up for usage of the data and thus support the operation of the living lab?
What about the ethics and privacy related to these data sets?
Who or what should be the frontrunner in realizing this idea? How will this be organized?
What can we learn from other domains where sharing of big data sets has been made possible?
Deriving value from the data
Volume: More data than you can process – relative term; complexity of processing
Velocity: Data constantly being generated
Variety: Multiple sources, formats, models
Veracity: Accuracy of the data
Open PHACTS: Not dealt with Velocity, although it is a challenge for us
1 of 83 business driver questions
Took a team of 5 experienced researchers 6 hours to manually gather the answer
Start of the project couldn’t be answered by a computer system
6 months in 30s with prototype
now subsecond
Pharma are all accessing, processing, storing & re-processing external research data Big waste of resources
No competitive advantage
OPS: 29 partners including many major pharma
83 questions ranked and top 20 taken as target
18 of top 20
A platform for integrated pharmacology data
Relied upon by pharma companies
Public domain, commercial, and private data sources
Provides domain specific API
Making it easy to build multiple drug discovery applications: examples developed in the project
Not just in-house apps
Actively being used for different purposes
Public launch April 2013
Averaging 20 million hits a month from the start of 2015
38 million in the last 30 days
Heavy usage from pharma, academia, and biotech
500+ registered users
Import data into cache
Integration approach
Data kept in original model but cached centrally
API call translated to SPARQL query
Query expressed in terms of original data
Queries expanded by IMS to cover URIs of original datasets
Data provided by many publishers
Originally in many formats: relational, SD files and RDF
Worked closely with publishers
Data licensing was a major issue
Over 3 billion triples – 12 datasets
Hosted on beefy hardware; data in memory (aim)
Extensive memcaching
Pose complex queries to extract data
Interactions needed to satisfy use cases
Gradually added additional types of data and interactions
No standard units
Even in curated sources!
Feedback issues to data providers
Validation & Standardization Platform
Developed by Royal Society of Chemistry
http://bit.ly/NZF5VB
Example drug: Gleevec Cancer drug for leukemia
Lookup in three popular public chemical databases Different results
Chemistry is complicated, often simplified for convenience
Data is messy!
Are these records the same? It depends on what you are doing with the data!
Each captures a subtly different view of the world
Chemistry is complicated, often simplified for convenience
Data is messy!
Interested in physiochemical properties of Gleevec
Interested in biomedical and pharmacological properties
sameAs != sameAs depends on your point of view
Links relate individual data instances: source, target, predicate, reason.
Links are grouped into Linksets which have VoID header providing provenance and justification for the link.
Open for anybody
API grouped into theme areas
Two phase interaction:
Resolve thing to identifier
Retrieve data about the identifier
Sustainability
API -> queries
3 steps we’ll do the first two now...the others will be for after the workshop for the interested participants
Use the paper to write on, use the post-its, one idea per post-it
Make it easy for the moderator to group things