Of the four V's of big data – Volume, Velocity, Variety and Veracity – the most challenging for the health sector is Variety. Health data comes from many sources, formats and standards – how can we bring these together to reap the benefits of big data technologies?
Big Data Europe is tackling this challenge head-on, building a big data infrastructure flexible enough to tackle all seven Societal Challenges identified by Horizon 2020. Here we demonstrate our pilot implementation of Open PHACTS, which integrates life science data for drug discovery.
12 May 2017
Measurement of Radiation and Dosimetric Procedure.pptx
Big Data Europe at eHealth Week 2017: Linking Big Data in Health
1. LINKING BIG DATA IN HEALTH
Open PHACTS in the Big Data Europe infrastructure
Kiera McNeice, Open PHACTS Foundation12 May 2017
2. What is ‘Big Data’?
“extremely large data sets that may be analyzed
computationally”
“data sets that are so large or complex that
traditional data processing application software is
inadequate to deal with them”
4. Big Data Europe Objectives
“Big Data Europe will undertake the foundational
work for enabling European companies to build
innovative multilingual products and services based
on semantically interoperable, large-scale, multi-
lingual data assets and knowledge, available
under a variety of licenses and business models.”
5. Actual Big Data Europe Objectives
Build foundational Big Data infrastructure that:
o Is open source
o Makes it simple to get started with Big Data
o Supports a variety of use cases
o Embraces emerging Big Data technologies
o Enables simple integration with custom components
8. Actual Big Data Europe Objectives
Build foundational Big Data infrastructure that:
o Is open source
o Makes it simple to get started with Big Data
o Supports a variety of use cases
o Embraces emerging Big Data technologies
o Enables simple integration with custom components
11. Applications: The 7 Societal Challenges
Life Sciences and Health
Food and Agriculture
Energy
Transport
Climate
Social Sciences
Security
17. SC2: Food and Agriculture
Partners:
FAO, the largest autonomous agency within the
United Nations system and one of the main
players in the agricultural information community.
Big Data Focus area: Large-scale distributed agricultural data integration
Selected Key Data assets: INFOODS, AQUASTAT Green Learning Network (GLN),
Agricultural Bibliography Network (ABN), AgroVoc, AquaMaps, Fishbase
Semantic Web Company (SWC) is a technology provider headquartered in
Vienna (Austria). SWC supports organizations from all industrial sectors
worldwide to improve their information management. Their core product is to
extract meaning from big data by making use of linked data technologies.
Agroknow is a company that captures, organizes and adds value to the
rich information available in agricultural and food sciences, in order to
make it universally accessible, useful and meaningful.
18. SC2: Food and Agriculture
Pilot focus area:
Viticulture
(from the Latin word for vine)
is the science, production,
and study of grapes.
It deals with the series of
events that occur in the vineyard.
19. SC2: Food and Agriculture
Pilot 2: Support
advanced crop data
discovery, processing,
combining and
visualization from
distributed and
heterogeneous data
repositories
Reasons:
Vine and Wine sector: emerging market in EU
Sustainability and biodiversity challenges: local varieties
are being lost
Exploitation of new grapevine varieties and clones in terms
of climate change adaptation
Quality and health status of viticultural products
Contribution to human health (antioxidants, prevention of
heart diseases etc.)
Wide variety of heterogeneous (and big) data from
various information sources
21. SC3: Energy
Partners:
Big Data Focus area: Real-time turbine monitoring stream processing and analytics
Selected Key Data assets: European Energy Exchange Data, smart meter sensor data, gas/fuels
market/price data, consumption statistics, stratigraphic model data (geology, geophysics)
A public entity supervised by the Ministry of
Environment, Energy and Climate Change in Greece,
founded in September 1987, active in the fields of
Renewable Energy Sources (RES), Rational Use of
Energy (RUE) and Energy Saving (ES).
NCSR "Demokritos", the largest multidisciplinary research
centre of Greece hosts significant scientific research,
technological development and educational activities,
coordinated by eight Institutes.
23. SC3: Energy
Pilot 3: Operation,
maintenance and
production
forecasting for wind
turbines on real-time
sensor data.
Reasons:
Current technology is not able to deal with full
amount of available valuable data
Economic benefit of predicting output and
prevention of damage (if one can predict one part
about to fail it can be prevented that other parts
get damaged)
Large continuous stream of sensor data, perfect to
test our platform
25. SC4: Transport
Partners:
Big Data Focus area: Streaming sensor network & geo-spatial data integration
Selected Key Data assets: GTFS data, OSM/LinkedGeoData, MobilityMaps, Transport sensor
data, ROSATTE Road safety attributes, European Road Data Infrastructure - EuroRoadS
The Fraunhofer Society is a German research organization
with 67 institutes spread throughout Germany, each
focusing on different fields of applied science.
The Centre for Research and Technology-Hellas (CERTH)
founded in 2000 is one of the leading research
centres in Greece. CERTH includes the Hellenic Institute of
Transport (HIT): Land, Sea and Air Transportation as well
as Sustainable Mobility services
ERTICO - ITS Europe is a partnership of around 100
companies and institutions involved in the production of
Intelligent Transport Systems (ITS).
27. SC4: Transport
Pilot 4: Multisource
data collection for
the provision of
accurate info-
mobility and
advanced transport
planning service in
Thessaloniki, Greece
Reasons:
Congestion is a major problem in Europe, especially in
urban areas.
Utilising real-time probe data for the provision of accurate
info-mobility services and advanced transport planning,
leads to better decisions
The use of mobility data coming from multiple sources
presents significant challenges, especially due to the
different nature of the datasets both in content and spatio-
temporal terms as well as due to the fact that the data
should be collected and processed in real time.
29. SC5: Climate
Partners:
Big Data Focus area: Enormous simulation time. Extremely complicated computing model.
Selected Key Data assets: European Grid Infrastructure (EGI). Access to several data centres
hosted at CNRS-Lyon, NCSR-D Athens, INFN-Milan, NIKhEF-Amsterdam.
A public entity supervised by the Ministry of Environment,
Energy and Climate Change in Greece, founded in
September 1987, active in the fields of Renewable Energy
Sources (RES), Rational Use of Energy (RUE) and Energy
Saving (ES).
NCSR "Demokritos", the largest multidisciplinary research
centre of Greece hosts significant scientific research,
technological development and educational activities,
coordinated by eight Institutes.
31. SC5: Climate
Pilot 5: Downscaling,
and retrieval process
on (raw) climate
data via User-
defined parameters
(e.g. geographical
areas, time period,
physical variables,
computational grids,
time steps)
Reasons:
The provision of Climate model data satisfies an
important objective, that of assessing the potential
impacts of climate change on well being for
adaptation, prevention and mitigation measures
and supporting other policy making decisions.
The awareness led to the availability of huge
datasets
Downscaling is a computationally intensive process
33. SC6: Social Sciences
Partners:
Big Data Focus area: Statistical and research data linking & integration
Selected Key Data assets: Federated social sciences data catalogs, statistical data from public
data portals and statistical offices (e.g. EuroStats, UNESCO, WorldBank)
CESSDA provides large scale, integrated and sustainable
data services to the social sciences. CESSDA is organised as a
limited company under Norwegian law owned and financed
by the individual EU member states’ ministry of research or a
delegated institution.
NCSR "Demokritos", the largest multidisciplinary research
centre of Greece hosts significant scientific research,
technological development and educational activities,
coordinated by eight Institutes.
35. SC6: Social Sciences
Pilot 6: Citizens
budget on the
municipal level
Reasons:
Budget: the most important document of public
policy
Budget execution affects everyday lives
Citizens are more involved in city level
Having a platform that integrates heterogeneous
budget data (many municipality have their own
data formats) and calculates infographics would
benefit the citizens, the research community and
policy makers
37. SC7: Security
Partners:
Big Data Focus area: Image data analysis
Selected Key Data assets: Earth Observation data (e.g. Very High Resolution Satellite Imagery acquired
from commercial providers and governmental systems) and collateral data for supporting CFSP/CSDP
missions and operations
NCSR "Demokritos", the largest multidisciplinary research centre of
Greece hosts significant scientific research, technological development
and educational activities, coordinated by eight Institutes.
The Centre supports the decision making of the European Union in the field of
the Common Foreign and Security Policy (CFSP), by providing products and
services resulting from the exploitation of relevant space assets and collateral
data, including satellite imagery and aerial imagery, and related services.
38. SC7: Security
Pilot focus:
Getting insight into man-made surface changes
triggered by automatic detection, news, or social
media information
39. SC7: Security
Pilot 7: Ingestion of
remote sensing
images and social
sensing data to
detect and verify
man-made changes
on the Earth’s surface
for security
applications
Reasons:
Evacuation route planning
Monitoring of critical infrastructures
Border security
Satellite image data is HUGE and
computationally intensive to compare
Smart ‘focus’ algorithms are needed to
prioritize the analysis jobs
46. Focus on researcher needs
ChEMBL DrugBank
Gene
Ontology
Wikipathways
UniProt
ChemSpider
UMLS
ConceptWiki
ChEBI
TrialTrove
GVKBio
GeneGo
TR Integrity
“Find me compounds that
inhibit targets in NFkB
pathway assayed in only
functional assays with a
potency <1 μM”
“What is the selectivity
profile of known p38
inhibitors?”
“Let me compare MW,
logP and PSA for
known oxidoreductase
inhibitors”
DisGeNet
neXtProt
ChEMBL
Target Class
ENZYME
FDA adverse
events
SureChEMBL
47. Ranked research questions
Number sum Nr of 1 Question
15 12 9 All oxidoreductase inhibitors active <100nM in both human and mouse
18 14 8
Given compound X, what is its predicted secondary pharmacology? What are
the on and off,target safety concerns for a compound? What is the evidence and
how reliable is that evidence (journal impact factor, KOL) for findings associated
with a compound?
24 13 8
Given a target find me all actives against that target. Find/predict
polypharmacology of actives. Determine ADMET profile of actives.
32 13 8 For a given interaction profile, give me compounds similar to it.
37 13 8
The current Factor Xa lead series is characterised by substructure X. Retrieve all
bioactivity data in serine protease assays for molecules that contain substructure
X.
38 13 8
Retrieve all experimental and clinical data for a given list of compounds defined
by their chemical structure (with options to match stereochemistry or not).
41 13 8
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are
all the compounds known to modulate the target directly? What are the
compounds that may modulate the target directly? i.e. return all cmpds active in
assays where the resolution is at least at the level of the target family (i.e. PKC)
both from structured assay databases and the literature.
44 13 8 Give me all active compounds on a given target with the relevant assay data
46 13 8
Give me the compound(s) which hit most specifically the multiple targets in a
49. Challenges: Identifiers
Andy Law’s third law:
The number of unique identifiers assigned to an individual is never less
than the number of institutions involved in the study
P12047
X31045
GB:29384
http://bioinformatics.roslin.ac.uk/lawslaws/
50. Challenges: Similarity
Q: Are these records the same?
DrugBankChemSpider PubChem
A: It depends on your task!
56. Open PHACTS architecture
Nanopub
Db
VoID
Data Cache
(Virtuoso Triple Store)
Semantic Workflow Engine
Linked Data API (RDF/XML, TTL, JSON)
Domain
Specific
Services
Identity
Resolution
Service
Chemistry
Registration
Normalisation
& Q/C
Identifier
Management
Service
Indexing
CorePlatform
P12374
EC2.43.4
CS4532
“Adenosine
receptor 2a”
VoID
Db
Nanopub
Db
VoID
Db
VoID
Nanopub
VoID
Public Content Commercial
Public
Ontologies
User
Annotations
Apps
62. Example workflow
Q10: For a given compound, summarise all similar
compounds and their activities
CC1=C(C(C(=C(N1)
C)C(=O)OC)C2=CC
=CC=C2[N+](=O)[
O-])C(=O)OC
65. Benefits of Open PHACTS
Efficiency: Queries that once took days can now be done in less
than an hour
Novelty: Semantically integrated databases allow for
completely new ways of analysing the data
Cost: Sharing cost and effort in a precompetitive project saved
“millions”
“Integration of different databases is difficult, costly, and
time consuming, and probably would not have been done at
this level of quality without Open PHACTS.”
67. …so why rebuild it with BDE?
Integration into a wider platform
Flexibility, scalability, extensibility
Local installation of the entire Open PHACTS
infrastructure!
68. Requirements
Hardware:
150GB of disk space (ideal: 250GB)
16GB of RAM (ideal: 128GB)
4 CPU core (ideal: 8 cores)
Prerequisites:
Recent x64 Linux (Ubuntu 14.04 LTS, Centos 7)
Docker and Docker Compose
Fast Internet connection https://github.com/openphacts/ops-docker
https://data.openphacts.org/
70. Successes of Open PHACTS
Integrated a large variety of data sources using
semantic web linking (RDF triples)
Project focussed on solving real, practical use cases
(and succeeded!)
Re-building within the BDE Docker infrastructure
allows for greater flexibility, local installation
71. What’s next?
Refresh of all data sources
Identify new data sources
o What’s your big data with health problem?
BDE SC1 (Health) Workshop in autumn
o Planned for eHealthTallinn 2017, 16-18 October
http://sm.ee/en/ehealthtallinn-2017