1) Dr. Matthieu-P. Schapranow presented on Analyze Genomes, a federated in-memory database system for life sciences.
2) The system aims to provide real-time analysis of big medical data while maintaining sensitive data locally due to privacy and locality restrictions.
3) It incorporates local compute resources by installing worker nodes to process sensitive data locally and store results in local database instances, while being managed as part of a larger federated database system.
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
1. Analyze Genomes:
A Federated In-Memory Database System For Life Sciences
Dr. Matthieu-P. Schapranow
HPI Future SOC Lab Day, Potsdam, Germany
Nov 4, 2015
Generously supported by
2. ■ Online: Visit we.analyzegenomes.com for latest research results, tools, and news
■ Offline: Read more about it, e.g. High-Performance In-Memory Genome Data Analysis:
How In-Memory Database Technology Accelerates Personalized Medicine, In-Memory
Data Management Research, Springer, ISBN: 978-3-319-03034-0, 2014
■ In Person: Join us for “Festival of Genomics” Jan 19-21, 2016 in London, UK
Important things first:
Where do you find additional information?
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
2
3. ■ Patients
□ Individual anamnesis, family history, and background
□ Require fast access to individualized therapy
■ Clinicians
□ Identify root and extent of disease using laboratory tests
□ Evaluate therapy alternatives, adapt existing therapy
■ Researchers
□ Conduct laboratory work, e.g. analyze patient samples
□ Create new research findings and come-up with treatment alternatives
The Setting
Actors in Oncology
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
3
A Federated In-
Memory Database
System For Life
Sciences
4. IT Challenges
Distributed Heterogeneous Data Sources
Human genome/biological data
600GB per full genome
15PB+ in databases of leading institutes
Prescription data
1.5B records from 10,000 doctors and
10M Patients (100 GB)
Clinical trials
Currently more than 30k
recruiting on ClinicalTrials.gov
Human proteome
160M data points (2.4GB) per sample
>3TB raw proteome data in ProteomicsDB
PubMed database
>24M articlesHospital information systems
Often more than 50GB
Medical sensor data
Scan of a single organ in 1s
creates 10GB of raw dataCancer patient records
>160k records at NCT A Federated In-
Memory Database
System For Life
Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
Chart 4
5. ■ Requirements
□ Real-time data analysis
□ Maintained software
■ Restrictions
□ Data privacy
□ Data locality
□ Volume of “big medical data”
■ Solution?
□ Federated In-Memory Database System vs. Cloud Computing
Software Requirements in Life Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
5
6. Where are all those Clouds go to?
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
6
Gartner's 2014 Hype Cycle for Emerging Technologies
7. Multiple Cloud Service Providers
Schapranow, BIRTE/
VLDB 2015, Aug 31,
2015
A Federated In-
Memory Database
System For Life
Sciences
7
Local System
C loud
Synchronization
Service
R
Local Storage
Local
Synchronization
Service
R
Shared
C loud
Storage
Site A
Local System
R
Local Storage
Local
Synchronization
Service
Site B
C loud
Synchronization
Service
Shared
C loud
Storage
R
Cloud Provider
Site A
C loud Provider
Site B
8. Federated In-Memory Database (FIMDB)
Incorporating Local Compute Resources
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
8
Site B
Federated In-M em ory
D atabase Instance,
Algorithm s, and
Applications M anaged
by Service Provider
CloudService
Provider
Site A
FIMDB
A.1
FIMDB
A.2
FIMDB
A.3
FIMDB
A.4
FIMDB
A.5
FIMDB
B.1
FIMDB
B.2
FIMDB
B.3
FIMDB
C.1
Federated In-M em ory
Database Instances
M aster Data
M anaged by
Service Provider
Sensitive D ata
reside at Site
■ Aim: Provision of managed Analyze
Genomes services while sensitive data
remains locally
■ Process steps
□ Connect existing resources to join
federated database landscape
□ Install Workers on local nodes
to process sensitive data and store
results in local DB instances
9. Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
Analyze Genomes:
Real-time Analysis of Big Medical Data
9
In-Memory Database
Extensions for Life Sciences
Data Exchange,
App Store
Access Control,
Data Protection
Fair Use
Statistical
Tools
Real-time
Analysis
App-spanning
User Profiles
Combined and Linked Data
Genome
Data
Cellular
Pathways
Genome
Metadata
Research
Publications
Pipeline and
Analysis Models
Drugs and
Interactions
A Federated In-
Memory Database
System For Life
Sciences
Drug Response
Analysis
Pathway Topology
Analysis
Medical
Knowledge CockpitOncolyzer
Clinical Trial
Recruitment
Cohort
Analysis
...
Indexed
Sources
10. Use Case:
Identification of Best Treatment Option for Cancer Patient
■ Patient: 48 years, female, non-smoker, smoke-free environment
■ Diagnosis: Non-Small Cell Lung Cancer (NSCLC), stage IV
1. Surgery to remove tumor
2. Tumor sample is sent to laboratory to extract DNA
3. DNA is sequenced resulting in up to 750 GB of raw data per sample
4. Processing of raw data to perform analysis
5. Identification of relevant driver mutations using international medical knowledge
6. Informed decision making
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
10
11. From Raw Genome Data to Analysis
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
■ Sequencing: Acquire digital DNA data
■ Alignment: Reconstruction of complete
genome with snippets
■ Variant Calling: Identification of genetic
variants
■ Data Annotation: Linking genetic variants
with research findings
Chart 11
12. Standardized Modeling of
Genome Data Analysis Pipelines
■ Graphical modeling of analysis pipelines
□ Supports reproducible research
□ BPMN-2.0-compliant
■ Extension of modeling notation by
□ Modular structure
□ Degree of parallelization
□ Parameters/variables
■ Pipelines stored in IMDB and executed through
our worker framework
A Federated In-
Memory Database
System For Life
Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
Chart 12
13. Execution of
Genome Data Analysis Pipelines
■ Dedicated scheduler for optimized pipeline execution
□ Assigns tasks to workers
□ Recovery of pipeline status
■ Scheduler uses IMDB logs for workload estimation
■ Different scheduling algorithms available, e.g.
□ High Throughput
□ Priority First
□ User-/Group-based
A Federated In-
Memory Database
System For Life
Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
IMDB
Pipeline TasksScheduler
Worker
Worker
Worker
Worker
Pipeline Subtasks
Events
Data
Chart 13
14. Real-time Analysis of
Genetic Variants
■ Genome Browser enables detailed exploration of genome loci
and associated associations
■ Ranks variants accordingly to known diseases
■ Integrates latest international medical
knowledge, annotations, and literature
■ Provides links back to primary data sources,
e.g. EBI, NCBI, dbSNP, and UCSC
A Federated In-
Memory Database
System For Life
Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
Chart 14
15. Medical Knowledge Cockpit
■ Uses patient specifics to provide more adequate results
■ Immediate exploration of relevant information, e.g.
□ Gene descriptions
□ Molecular impact and related pathways
□ Scientific publications
□ Suitable clinical trials
■ Translates manual searching for hours or days into finding
A Federated In-
Memory Database
System For Life
Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
Chart 15
16. Drug Response Analysis
■ Incorporate knowledge about historic cases to optimize
treatment of current cases
■ Enables real-time exploration of Xenograft experiments
■ Configurable medical model to predict drug response
A Federated In-
Memory Database
System For Life
Sciences
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
Chart 16
17. ■ Global Medical Knowledge (Master’s project)
■ Detect cardiovascular diseases and evaluate
treatment options (DHZB)
■ Use health insurance data to improve health
care research (AOK)
■ Pharmacogenetics (Bayer)
■ Generously supported by
Join us for upcoming projects!
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
17
Interdisciplinary
Design Thinking
Teams
You?
18. ■ For patients
□ Identify relevant clinical trials and medical experts
□ Become an informed patient
■ For clinicians
□ Identify pharmacokinetic correlations
□ Scan for similar patient cases, e.g. to evaluate therapy efficiency
■ For researchers
□ Enable real-time analysis of medical data, e.g. assess pathways
to identify impact of detected variants
□ Combined mining in structured and unstructured data, e.g. publications,
diagnosis, and EMR data
What to Take Home?
Test it Yourself: AnalyzeGenomes.com
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
18
A Federated In-
Memory Database
System For Life
Sciences
19. Keep in contact with us!
Hasso Plattner Institute
Enterprise Platform & Integration Concepts (EPIC)
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
Program Manager E-Health
schapranow@hpi.de
Schapranow/Perscheid,
FSOC Lab Day, Nov 4,
2015
A Federated In-
Memory Database
System For Life
Sciences
19
Cindy Perscheid
Research Assistant
cindy.perscheid@hpi.de