TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)
1. Enabling Real-Time Genome Data Research
with in-Memory Database Technology
May 30, 2013
Dr. Matthieu Schapranow
Hasso Plattner Institute
Dr. Anja Bog
SAP Labs LLC
2. Numbers You Should Know
Comparison of Costs
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
0,01
0,1
1
10
100
1000
10000
01.01.01
01.05.01
01.09.01
01.01.02
01.05.02
01.09.02
01.01.03
01.05.03
01.09.03
01.01.04
01.05.04
01.09.04
01.01.05
01.05.05
01.09.05
01.01.06
01.05.06
01.09.06
01.01.07
01.05.07
01.09.07
01.01.08
01.05.08
01.09.08
01.01.09
01.05.09
01.09.09
01.01.10
01.05.10
01.09.10
01.01.11
01.05.11
01.09.11
01.01.12
CostsinUSD
Comparison of Costs for Main Memory and Genome Analysis
Costs per Megabyte RAM Costs per Megabase Sequencing
2
3. In-Memory Technology
A Toolbox for Big Data Analysis
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
Any attribute
as index
Insert only
for time travel
Combined
column
and row store
+
No aggregate
tables
Minimal
projections
Partitioning
Analytics on
historical
datat
Single and
multi-tenancy
SQL interface
on columns &
rows
SQL
Reduction of
layers
x
x
Lightweight
Compression
Multi-core/
parallelization
On-the-fly
extensibility
+++
Active/passive
data storePA
Bulk load
Discovery Service
Read Event
Repositories
Verification
Services
SAP HANA
●
●
P A
up to 8.000 read
event notifications
per second
up to 2.000
requests
per second
Discovery Service
Read Event
Repositories
Verification
Services
SAP HANA
●
●
P A
up to 8.000 read
event notifications
per second
up to 2.000
requests
per second
+
+
++
T
Text Retrieval
and Extraction
Object to
relational
mapping
Dynamic
multi-
threading
within nodes
Map
reduce
No diskGroup Key
3
4. High-Performance In-Memory Genome Project
Challenges of Genome Data Analysis
Analysis of Genomic
Data
Alignment and
Variant Calling
Analysis of Annotations
in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI Minutes Real-time
In-Memory
Technology
Multi-Core Partitioning & Compression
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
4
5. Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
High-Performance In-Memory Genome Project
Challenges of Genome Data Analysis
Analysis of Genomic
Data
Alignment and
Variant Calling
Analysis of Annotations
in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI & SAP Minutes – Hours Interactively
In-Memory
Technology
Multi-Core Partitioning & Compression
5
6. High-Performance In-Memory Genome Project
Selected Research Topics
Improving Analyses:
■ Clustering of patient cohorts, e.g. k-means clustering
■ Combined search, e.g. in clinical trials and side-effect databases
■ Ad-hoc analysis of genetic pathways, e.g. to identify cause/effect
Improving Data Preparations:
■ Graphical modeling of Genome Data Processing (GDP) pipelines
■ Scheduling and execution of multiple GPD pipelines in parallel
■ App store for medical knowledge (bring algorithms to data)
■ Exchange of sensitive data, e.g. history-based access control
■ Billing processes for intellectual property and services
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
6
7. Genomics Analysis
Loaded part of 1,000 genomes pre-phase 1 dataset
■ Chromosome 1 of 629 individuals from the 1,000 genomes project
■ 12 billion entries in largest database table
■ 293 GB of data (compressed in HANA)
Results
■ Report SNPs failing quality control
UCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster
■ Compute the alternative allele frequency for each variant/region
VCFtools 259 sec | SAP HANA 0.43 sec – 600x faster
■ Compute the total number of missing genotypes per individual
VCFtools 548 sec | SAP HANA 2 sec – 270x faster
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
7
Supported by Dr. Carlos Bustamante lab
8. Chromosome
Absolutefrequency
Number
of
Alleles
Working With Big Data
Loaded entire 1,000 genomes pre-phase 1 dataset
■ Queries on all chromosomes for all 629 individuals
■ 136 billion entries in largest database table
■ ≈1.2TB (compressed in HANA)
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
8
Query
results
using
R
connec0vity:
Report
all
varia0ons
in
BRCA1
and
BRCA2
Supported by Dr. Carlos Bustamante lab
9. High-Performance In-Memory Genome Project
Analysis of Patient Cohorts
9
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
■ Columnar storage optimizes
space requirements while
enabling enhancing calculation
performance
■ Single k-means clustering:
R 470ms vs. HANA 30ms (15:1)
■ >60k clusters are calculated in
<2s on 1,000 core cluster
■ è Interactive exploration of
clusters comes true
Why is a therapy only working in 80% of the patient cases?
10. High-Performance In-Memory Genome Project
Integration of Genetic Pathways
■ Storing and accessing graph data
within in-memory database (Active
Information Store)
■ 263 pathways KEGG pathways with
6,481 genetic components, 32,784
vertices, and 90,682 edges
■ Rank all pathways by evaluation of
node connections: IMDB <350ms
■ >5,5k rankings can be calculated in
<2s on 1,000 core cluster
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
10 What are known effects for a somatic mutation?
11. High-Performance In-Memory Genome Project
Combined Search in Structured and Unstructured Data
■ In-memory technology enables entity extraction, e.g. age,
genes, and drugs
■ Integrated 30k free text documents from clinicaltrials.gov
■ Relational search on entities enables interactive comparison
■ Results by rated by relevant search criteria
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
11 What clinical trials are relevant for individual patient?
12. High-Performance In-Memory Genome Project
Architectural Overview
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
Cohort
Analysis
Pathway
Finder
Paper
Search
In-Memory Database
Clinical Trial
Finder
Pipeline
Editor
Extensions
App Store
Access
Control
Billing
Pipeline
Data
Genome
Data
Pathways
Genome
Metadata
Papers
Pipeline
Models
Analytical
Tools
12
...
...
...
13. The Future:
Combined Information Requirements
Enable clinicians to:
■ Make evidence-based therapy
decisions at the patient’s bed
■ Exchange latest patient data
with international experts
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
13
Enable researchers to:
■ Investigate genomes of
patient cohorts to derive new
knowledge
■ Analyze results in
real-time
Enable patients to:
■ To identify risk factors long
before they turn into diseases
■ Identify experts and similar
patient cases to bring up
alternatives for individual
therapies
14. Thank you for your interest!
Keep in contact with us.
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
SAP Labs LLC
Dr. Anja Bog
3410 Hillview Avenue
94304 Palo Alto, CA
Dr. Anja Bog
anja.bog@sap.com
14
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.uni-potsdam.de
http://j.mp/schapranow