What Are The Drone Anti-jamming Systems Technology?
A Platform for Integrated Genome Data Analysis
1. A Platform for Integrated
Genome Data Analysis
All About Data Congress, UMC Groningen
January 30th, 2014
Dr. Matthieu Schapranow
Hasso Plattner Institute
2. Hasso Plattner Institute
Key Facts
2
■ Founded as a public-private partnership
in 1998 in Potsdam near Berlin, Germany
■ Institute belongs to the
University of Potsdam
■ Ranked 1st in CHE since 2009
■ 500 B.Sc. and M.Sc. students
■ 10 professors, 150 PhD students
■ Course of study: IT Systems Engineering
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
3. Hasso Plattner Institute
Programs
3
■ Full university curriculum
■ Bachelor (6 semesters)
■ Master (4 semesters)
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
4. Hasso Plattner Institute
Enterprise Platform and Integration Concepts Group
4
Prof. Dr. h.c. Hasso Plattner
■ Research focuses on the technical aspects of enterprise
software and design of complex applications
□ In-Memory Data Management for
Enterprise Applications
□ Enterprise Application Programming Model
□ Scientific Data Management
□ Human-Centered Software Design and Engineering
■ Industry cooperations, e.g. SAP, Siemens, Audi, and EADS
■ Research cooperations, e.g. Stanford, MIT, and Berkeley
Partner of Stanford Center
for Design Research
Partner of MIT in Supply
Chain Innovation and CSAIL
Partner at UC Berkeley
RAD / AMP Lab
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
5. Our Motivation
Personalized Medicine
5
■ Motivation: Can we analyze the entire data of a patient, incl.
Electronic Health Records (EHR) and genome data, during a
doctor’s visit?
■ Genome data analysis may sum up to weeks,
i.e. biopsy, biological preparation, sequencing,
alignment, variant calling, full analysis, and evaluation
■ Issue: Complex and time-consuming data processing tasks
■ In-memory technology accelerates genome data processing
□ Highly parallel alignment / variant calling
□ Real-time analysis of individual patient or cohort data
□ Combined search in structured / unstructured data
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
6. Our Challenge
Distributed Big Data Sources
6
Human genome/biological data
600GB per full genome
15PB+ in databases of leading institutes
Hospital information systems
Often more than 50GB
Cancer patient records
>160k records at NCT
Prescription data
1.5B records from 10,000 doctors
and 10M Patients (100 GB)
Human proteome
160M data points (2.4GB) per sample
>3TB raw proteome data in ProteomicsDB
PubMed database
>23M articles
Medical sensor data
Scan of a single organ in 1s
creates 10GB of raw data
Clinical trials
Currently more than 30k
recruiting on ClinicalTrials.gov
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
7. Our Approach
In-Memory Technology
●
●
●
●
Read Event
Read Event
Repositories
Repositories
Combined
column
and row store
Insert only
for time travel
+
+
+
++
A
up to 2.000
up to 2.000
requests
requests
per second
per second
Minimal
projections
Any attribute
as index
Bulk load
Discovery Service
Discovery Service
Multi-core/
parallelization
SAP HANA
SAP HANA
P
Active/passive
data store
Dynamic
multithreading
within nodes
No aggregate
tables
+++
up to 8.000 read
up to 8.000 read
event notifications
event notifications
per second
per second
On-the-fly
extensibility
Map
reduce
P
P
t
A
A
Lightweight
Compression
Partitioning
SQL
Analytics on
historical
data
Single and
multi-tenancy
Object to
relational
mapping
Group Key
SQL interface
on columns &
rows
Reduction of
layers
x
x
7
Verification
Verification
Services
Services
T
Text Retrieval
and Extraction
No disk
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
9. Our Vision
Personalized Medicine
9
Desirability
■ Leveraging directed customer services
■ Portfolio of integrated services for
clinicians, researchers, and patients
■ Include latest research results, e.g. most
effective therapies
Viability
Feasibility
■ Enable personalized medicine also in faroff regions and developing countries to
exceed customer base
■ HiSeq 2500 enables highcoverage whole genome
sequencing in ≈1d
■ Share data via the Internet to get
feedback from word-wide experts (costsaving)
■ IMDB enables allele frequency
determination of 12B records
within <1s
■ Combine research data (publications,
annotations, genome data) from
international databases in a single
knowledge base
■ Identification of relevant
annotations out of 80M <1s
■ Data preparation as a service
reduces TCO
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
10. High-Performance In-Memory Genome Project
Integration of Genomic Data
10
■ Preprocessing of DNA
(alignment, variant
calling) can be
modeled and is
executed as integrated
process
■ Results are stored in
in-memory database
■ Real-time analysis of
genome data enables
completely new way of
research and therapies
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
11. High-Performance In-Memory Genome Project
Architectural Overview
11
Information and Feedback within the Window of Opportunity
Patients
Doctors
Insurers
Researchers
Real-Time Capturing
and Data Analysis
In-Memory Database
*omics
Electronic
Service Line
Medical
Items
Records
All Relevant Medical Information
...
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
12. High-Performance In-Memory Genome Project
Selected Research Topics
12
Improving Analyses:
■ Medical Knowledge Cockpit, Oncolyzer
■ Genome Browser enables deep dive into the genome
■ Cohort Analysis, e.g. clustering of patient cohorts
■ Combined Search, e.g. in clinical trials and side-effect databases
■ Pathways Topology Analysis, e.g. to identify cause/effect
Improving Data Preparations:
■ Graphical modeling of Genome Data Processing (GDP) pipelines
■ Scheduling and execution of multiple GPD pipelines in parallel
■ App store for medical knowledge (bring algorithms to data)
■ Exchange of sensitive data, e.g. history-based access control
■ Billing processes for intellectual property and services
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
13. High-Performance In-Memory Genome Project
HANA Oncolyzer
13
■ Research initiative for
exchanging relevant tumor
data to improve treatment
■ Awarded with the 2012
Innovation Award of the
German Capitol Region
■ In-Memory Technology as
key-enabler for real-time
analysis of tumor data in
seconds instead of hours
■ Information available at your fingertips: In-Memory Technology on
mobile devices (iPad)
■ Interdisciplinary cooperation between medical doctors, researchers,
and software engineers
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
14. HANA Oncolyzer
Patient Search and Details
14
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
15. HANA Oncolyzer
Analysis of Patient Cohorts
15
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
16. High-Performance In-Memory Genome Project
Medical Knowledge Cockpit
■ Search for affected genes in
distributed and heterogeneous data
sources
16
■ Immediate exploration of relevant
information, such as
□ Gene descriptions,
□ Molecular impact and related
pathways,
Unified access to
structured and
unstructured data
sources
Automatic
clinical trial
matching using
HANA text analysis
features
□ Scientific publications, and
□ Suitable clinical trials.
■ No manual searching for hours or
days – In-memory technology
translates it into interactive finding
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
17. High-Performance In-Memory Genome Project
Search in Structured and Unstructured Medical Data
■ Extended text analysis feature by
medical terminology
17
□ Genes (122,975 + 186,771
synonyms)
□ Medical terms and categories
(98,886 diseases + 48,561
synonyms, 47 categories)
Unified access to
structured and
unstructured data
sources
Clinical trial
matching using
text analysis features
□ Pharmaceutical ingredients
(7,099 + 5,561 synonyms)
■ Imported clinicaltrials.gov database
(145k trials/30,138 recruiting)
■ Extracted, e.g., 320k genes, 161k
ingredients, 30k periods
■ Select all studies based on multiple
filters in <500ms
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
18. High-Performance In-Memory Genome Project
Analysis of Patient Cohorts
■ In a patient cohort, a subset does
not respond to therapy – why?
18
■ Clustering using various statistical
algorithms, such as k-means or
hierarchical clustering
■ Calculation of all locus combinations
in which at least 5% of all TCGA
participants have mutations: 200ms
for top 20 combinations
Fast clustering
directly inside HANA
■ Individual clusters are calculated in
parallel directly within the database
■ K-means algorithm: 50ms (PAL) vs.
500ms (R)
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
19. High-Performance In-Memory Genome Project
Genome Browser
■ Genome Browser: Comparison of
multiple genomes with reference
19
■ Combined knowledge base: latest
relevant annotations and literature,
e.g. NCBI, dbSNP, and UCSC
■ Detailed exploration of genome
locations and existing associations
Unified access
to multiple formerly
disjoint data sources
Matching of
variants with relevant
international annotations
■ Ranked variants, e.g. accordingly to
known diseases
■ Links to more detailed sources
enable fast identification of relevant
information while eliminating longlasting searches.
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
20. High-Performance In-Memory Genome Project
Pathway Analysis
■ Search in pathways is limited to “is
a certain element contained” today
20
■ Integrated >1,5k pathways from
international sources, e.g. KEGG,
HumanCyc, and WikiPathways, into
HANA
Unified access
to multiple formerly
disjoint data sources
■ Implemented graph-based topology
exploration and ranking based on
patient specifics
■ Enables interactive identification of
possible dysfunctions affecting the
course of a therapy before its start
Pathway analysis
of genetic variations with
Graph Engine
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
21. High-Performance In-Memory Genome Project
Drug Response Testing
■ Drug response depends on
individual genetic variants of tumors
21
■ Challenge: Identification of relevant
genetic variants and their impact on
drug response is a ongoing research
activity, e.g. Xenograft models
■ Exploration of experiment results is
time-consuming and Excel-driven
Interactive analysis
of correlations between drugs
and genetic variants
■ In-memory technology enables
interactive exploration of
experiment data to leverage new
scientific insights
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
22. What to take home?
Test it: http://we.AnalyzeGenomes.com
22
For researchers
■ Enable real-time analysis of genome data
■ Automatic scan of pathways to identify cellular
impact of mutations
■ Free-text search in publications, diagnosis, and EMR
data (structured and unstructured data)
For clinicians
■ Preventive diagnostics to identify risk patients
■ Indicate pharmacokinetic correlations
■ Scan for comparable patient cases
For patients
■ Identify relevant clinical trials / experts
■ Start most appropriate therapy early based on all
evidences and latest knowledge
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014
23. Thank you for your interest!
Keep in contact with us.
23
Dr. Matthieu-P. Schapranow
schapranow@hpi.uni-potsdam.de
http://we.analyzegenomes.com/
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
A Platform for Integrated Genome Data Analysis, All About Data Congress, Schapranow, Jan 30, 2014