SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Enabling Real-Time Genome Data Research
with in-Memory Database Technology
May 30, 2013
Dr. Matthieu Schapranow
Hasso Plattner Institute
Dr. Anja Bog
SAP Labs LLC
Numbers You Should Know
Comparison of Costs
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
0,01
0,1
1
10
100
1000
10000
01.01.01
01.05.01
01.09.01
01.01.02
01.05.02
01.09.02
01.01.03
01.05.03
01.09.03
01.01.04
01.05.04
01.09.04
01.01.05
01.05.05
01.09.05
01.01.06
01.05.06
01.09.06
01.01.07
01.05.07
01.09.07
01.01.08
01.05.08
01.09.08
01.01.09
01.05.09
01.09.09
01.01.10
01.05.10
01.09.10
01.01.11
01.05.11
01.09.11
01.01.12
CostsinUSD
Comparison of Costs for Main Memory and Genome Analysis
Costs per Megabyte RAM Costs per Megabase Sequencing
2
In-Memory Technology
A Toolbox for Big Data Analysis
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
Any attribute
as index
Insert only
for time travel
Combined
column
and row store
+
No aggregate
tables
Minimal
projections
Partitioning
Analytics on
historical
datat
Single and
multi-tenancy
SQL interface
on columns &
rows
SQL
Reduction of
layers
x
x
Lightweight
Compression
Multi-core/
parallelization
On-the-fly
extensibility
+++
Active/passive
data storePA
Bulk load
Discovery Service
Read Event
Repositories
Verification
Services
SAP HANA
●
●
P A
up to 8.000 read
event notifications
per second
up to 2.000
requests
per second
Discovery Service
Read Event
Repositories
Verification
Services
SAP HANA
●
●
P A
up to 8.000 read
event notifications
per second
up to 2.000
requests
per second
+
+
++
T
Text Retrieval
and Extraction
Object to
relational
mapping
Dynamic
multi-
threading
within nodes
Map
reduce
No diskGroup Key
3
High-Performance In-Memory Genome Project
Challenges of Genome Data Analysis
Analysis of Genomic
Data
Alignment and
Variant Calling
Analysis of Annotations
in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI Minutes Real-time
In-Memory
Technology
Multi-Core Partitioning & Compression
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
4
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
High-Performance In-Memory Genome Project
Challenges of Genome Data Analysis
Analysis of Genomic
Data
Alignment and
Variant Calling
Analysis of Annotations
in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI & SAP Minutes – Hours Interactively
In-Memory
Technology
Multi-Core Partitioning & Compression
5
High-Performance In-Memory Genome Project
Selected Research Topics
Improving Analyses:
■  Clustering of patient cohorts, e.g. k-means clustering
■  Combined search, e.g. in clinical trials and side-effect databases
■  Ad-hoc analysis of genetic pathways, e.g. to identify cause/effect
Improving Data Preparations:
■  Graphical modeling of Genome Data Processing (GDP) pipelines
■  Scheduling and execution of multiple GPD pipelines in parallel
■  App store for medical knowledge (bring algorithms to data)
■  Exchange of sensitive data, e.g. history-based access control
■  Billing processes for intellectual property and services
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
6
Genomics Analysis
Loaded part of 1,000 genomes pre-phase 1 dataset
■  Chromosome 1 of 629 individuals from the 1,000 genomes project
■  12 billion entries in largest database table
■  293 GB of data (compressed in HANA)
Results
■  Report SNPs failing quality control
UCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster
■  Compute the alternative allele frequency for each variant/region
VCFtools 259 sec | SAP HANA 0.43 sec – 600x faster
■  Compute the total number of missing genotypes per individual
VCFtools 548 sec | SAP HANA 2 sec – 270x faster
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
7
Supported by Dr. Carlos Bustamante lab
Chromosome	
  
Absolutefrequency
Number	
  of	
  Alleles	
  
Working With Big Data
Loaded entire 1,000 genomes pre-phase 1 dataset
■  Queries on all chromosomes for all 629 individuals
■  136 billion entries in largest database table
■  ≈1.2TB (compressed in HANA)
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
8
Query	
  results	
  using	
  R	
  connec0vity:	
  	
  
Report	
  all	
  varia0ons	
  in	
  BRCA1	
  and	
  BRCA2	
  	
  
Supported by Dr. Carlos Bustamante lab
High-Performance In-Memory Genome Project
Analysis of Patient Cohorts
9
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
■  Columnar storage optimizes
space requirements while
enabling enhancing calculation
performance
■  Single k-means clustering:
R 470ms vs. HANA 30ms (15:1)
■  >60k clusters are calculated in
<2s on 1,000 core cluster
■  è Interactive exploration of
clusters comes true
Why is a therapy only working in 80% of the patient cases?
High-Performance In-Memory Genome Project
Integration of Genetic Pathways
■  Storing and accessing graph data
within in-memory database (Active
Information Store)
■  263 pathways KEGG pathways with
6,481 genetic components, 32,784
vertices, and 90,682 edges
■  Rank all pathways by evaluation of
node connections: IMDB <350ms
■  >5,5k rankings can be calculated in
<2s on 1,000 core cluster
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
10 What are known effects for a somatic mutation?
High-Performance In-Memory Genome Project
Combined Search in Structured and Unstructured Data
■  In-memory technology enables entity extraction, e.g. age,
genes, and drugs
■  Integrated 30k free text documents from clinicaltrials.gov
■  Relational search on entities enables interactive comparison
■  Results by rated by relevant search criteria
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
11 What clinical trials are relevant for individual patient?
High-Performance In-Memory Genome Project
Architectural Overview
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
Cohort
Analysis
Pathway
Finder
Paper
Search
In-Memory Database
Clinical Trial
Finder
Pipeline
Editor
Extensions
App Store
Access
Control
Billing
Pipeline
Data
Genome
Data
Pathways
Genome
Metadata
Papers
Pipeline
Models
Analytical
Tools
12
...
...
...
The Future:
Combined Information Requirements
Enable clinicians to:
■  Make evidence-based therapy
decisions at the patient’s bed
■  Exchange latest patient data
with international experts
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
13
Enable researchers to:
■  Investigate genomes of
patient cohorts to derive new
knowledge
■  Analyze results in
real-time
Enable patients to:
■  To identify risk factors long
before they turn into diseases
■  Identify experts and similar
patient cases to bring up
alternatives for individual
therapies
Thank you for your interest!
Keep in contact with us.
Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013
SAP Labs LLC
Dr. Anja Bog
3410 Hillview Avenue
94304 Palo Alto, CA
Dr. Anja Bog
anja.bog@sap.com
14
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
schapranow@hpi.uni-potsdam.de
http://j.mp/schapranow

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (6)

2014 Big_Data_Forum_Intel
2014 Big_Data_Forum_Intel2014 Big_Data_Forum_Intel
2014 Big_Data_Forum_Intel
 
SNEAPA 2013 Thursday b1 10_30_tomorrows climate
SNEAPA 2013 Thursday b1 10_30_tomorrows climateSNEAPA 2013 Thursday b1 10_30_tomorrows climate
SNEAPA 2013 Thursday b1 10_30_tomorrows climate
 
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speechBig Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
 
Inside story on Intel Data Center @ IDF 2013
Inside story on Intel Data Center @ IDF 2013Inside story on Intel Data Center @ IDF 2013
Inside story on Intel Data Center @ IDF 2013
 
Hooduku - Big data analytics - case study
Hooduku - Big data analytics - case studyHooduku - Big data analytics - case study
Hooduku - Big data analytics - case study
 
Unlock Hidden Potential through Big Data and Analytics
Unlock Hidden Potential through Big Data and AnalyticsUnlock Hidden Potential through Big Data and Analytics
Unlock Hidden Potential through Big Data and Analytics
 

Ähnlich wie Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)

Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
Matthieu Schapranow
 
Gaining Time – Real-time Analysis of Big Medical Data
Gaining Time – Real-time Analysis of Big Medical Data Gaining Time – Real-time Analysis of Big Medical Data
Gaining Time – Real-time Analysis of Big Medical Data
SAP Technology
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
Yatpang Cheung
 

Ähnlich wie Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting) (20)

A Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data AnalysisA Platform for Integrated Genome Data Analysis
A Platform for Integrated Genome Data Analysis
 
Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI Introduction to High-performance In-memory Genome Project at HPI
Introduction to High-performance In-memory Genome Project at HPI
 
Big Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and ChallengesBig Data in Genomics: Opportunities and Challenges
Big Data in Genomics: Opportunities and Challenges
 
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or PotentialProcessing of Big Medical Data in Personalized Medicine: Challenge or Potential
Processing of Big Medical Data in Personalized Medicine: Challenge or Potential
 
"When time matters..."
"When time matters...""When time matters..."
"When time matters..."
 
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences ResearchAnalyze Genomes: In-memory Apps for Next-generation Life Sciences Research
Analyze Genomes: In-memory Apps for Next-generation Life Sciences Research
 
Gaining Time – Real-time Analysis of Big Medical Data
Gaining Time – Real-time Analysis of Big Medical Data Gaining Time – Real-time Analysis of Big Medical Data
Gaining Time – Real-time Analysis of Big Medical Data
 
Analyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision MedicineAnalyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision Medicine
 
Gaining Time -- Real-time Analysis of Big Medical Data
Gaining Time -- Real-time Analysis of Big Medical DataGaining Time -- Real-time Analysis of Big Medical Data
Gaining Time -- Real-time Analysis of Big Medical Data
 
In-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems MedicineIn-Memory Data Management for Systems Medicine
In-Memory Data Management for Systems Medicine
 
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
A Federated In-Memory Database Computing Platform Enabling Real-Time Analysis...
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
ReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for HealthReComp and P4@NU: Reproducible Data Science for Health
ReComp and P4@NU: Reproducible Data Science for Health
 
Analyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision MedicineAnalyze Genomes Services for Precision Medicine
Analyze Genomes Services for Precision Medicine
 
Building bioinformatics resources for the global community
Building bioinformatics resources for the global communityBuilding bioinformatics resources for the global community
Building bioinformatics resources for the global community
 
Reuse of public data in proteomics
Reuse of public data in proteomicsReuse of public data in proteomics
Reuse of public data in proteomics
 
Turning Big Data into Precision Medicine
Turning Big Data into Precision MedicineTurning Big Data into Precision Medicine
Turning Big Data into Precision Medicine
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life SciencesAnalyze Genomes: A Federated In-Memory Database System For Life Sciences
Analyze Genomes: A Federated In-Memory Database System For Life Sciences
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 

Mehr von Matthieu Schapranow

Mehr von Matthieu Schapranow (20)

Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in PracticePatient Journey in Oncology 2025: Molecular Tumour Boards in Practice
Patient Journey in Oncology 2025: Molecular Tumour Boards in Practice
 
How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?How will AI affect the patient journey of the future?
How will AI affect the patient journey of the future?
 
AI in Oncology
AI in OncologyAI in Oncology
AI in Oncology
 
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital HealthAnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
AnalyzeGenomes.com: A Federated In-Memory Database Platform for Digital Health
 
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
Algorithmen statt Ärzte: Algorithmen statt Ärzte: Ersetzt Big Data künftig ...
 
In-Memory Apps for Precision Medicine
In-Memory Apps for Precision MedicineIn-Memory Apps for Precision Medicine
In-Memory Apps for Precision Medicine
 
ICT Platform to Enable Consortium Work for Systems Medicine of Heart Failure
ICT Platform to Enable Consortium Work for Systems Medicine of Heart FailureICT Platform to Enable Consortium Work for Systems Medicine of Heart Failure
ICT Platform to Enable Consortium Work for Systems Medicine of Heart Failure
 
Analyze Genomes: In-memory Apps supporting Precision Medicine
Analyze Genomes: In-memory Apps supporting Precision MedicineAnalyze Genomes: In-memory Apps supporting Precision Medicine
Analyze Genomes: In-memory Apps supporting Precision Medicine
 
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
Analyze Genomes: A Federated In-memory Database Computing Platform enabling r...
 
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
The Driver of the Healthcare System in the 21st Century: Real-world Applicati...
 
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
Festival of Genomics 2016 London: Mining and Processing of Unstructured Medic...
 
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Gen...
 
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Comp...
 
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world ExamplesFestival of Genomics 2016 London: Analyze Genomes: Real-world Examples
Festival of Genomics 2016 London: Analyze Genomes: Real-world Examples
 
Festival of Genomics 2016 London: Challenges of Big Medical Data?
Festival of Genomics 2016 London: Challenges of Big Medical Data?Festival of Genomics 2016 London: Challenges of Big Medical Data?
Festival of Genomics 2016 London: Challenges of Big Medical Data?
 
Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...
Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...
Festival of Genomics 2016 London: Real-time Exploration of the Cancer Genome,...
 
Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?Festival of Genomics 2016 London: What to take home?
Festival of Genomics 2016 London: What to take home?
 
Festival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: AgendaFestival of Genomics 2016 London: Agenda
Festival of Genomics 2016 London: Agenda
 
Analyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response AnalysisAnalyze Genomes: Drug Response Analysis
Analyze Genomes: Drug Response Analysis
 
Big Data in Life Sciences
Big Data in Life SciencesBig Data in Life Sciences
Big Data in Life Sciences
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Enabling Real-Time Genome Data Research with In-Memory Database Technology (Illumina GIA Meeting)

  • 1. Enabling Real-Time Genome Data Research with in-Memory Database Technology May 30, 2013 Dr. Matthieu Schapranow Hasso Plattner Institute Dr. Anja Bog SAP Labs LLC
  • 2. Numbers You Should Know Comparison of Costs Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 0,01 0,1 1 10 100 1000 10000 01.01.01 01.05.01 01.09.01 01.01.02 01.05.02 01.09.02 01.01.03 01.05.03 01.09.03 01.01.04 01.05.04 01.09.04 01.01.05 01.05.05 01.09.05 01.01.06 01.05.06 01.09.06 01.01.07 01.05.07 01.09.07 01.01.08 01.05.08 01.09.08 01.01.09 01.05.09 01.09.09 01.01.10 01.05.10 01.09.10 01.01.11 01.05.11 01.09.11 01.01.12 CostsinUSD Comparison of Costs for Main Memory and Genome Analysis Costs per Megabyte RAM Costs per Megabase Sequencing 2
  • 3. In-Memory Technology A Toolbox for Big Data Analysis Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 Any attribute as index Insert only for time travel Combined column and row store + No aggregate tables Minimal projections Partitioning Analytics on historical datat Single and multi-tenancy SQL interface on columns & rows SQL Reduction of layers x x Lightweight Compression Multi-core/ parallelization On-the-fly extensibility +++ Active/passive data storePA Bulk load Discovery Service Read Event Repositories Verification Services SAP HANA ● ● P A up to 8.000 read event notifications per second up to 2.000 requests per second Discovery Service Read Event Repositories Verification Services SAP HANA ● ● P A up to 8.000 read event notifications per second up to 2.000 requests per second + + ++ T Text Retrieval and Extraction Object to relational mapping Dynamic multi- threading within nodes Map reduce No diskGroup Key 3
  • 4. High-Performance In-Memory Genome Project Challenges of Genome Data Analysis Analysis of Genomic Data Alignment and Variant Calling Analysis of Annotations in World-wide DBs Bound To CPU Performance Memory Capacity Duration Hours – Days Weeks HPI Minutes Real-time In-Memory Technology Multi-Core Partitioning & Compression Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 4
  • 5. Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 High-Performance In-Memory Genome Project Challenges of Genome Data Analysis Analysis of Genomic Data Alignment and Variant Calling Analysis of Annotations in World-wide DBs Bound To CPU Performance Memory Capacity Duration Hours – Days Weeks HPI & SAP Minutes – Hours Interactively In-Memory Technology Multi-Core Partitioning & Compression 5
  • 6. High-Performance In-Memory Genome Project Selected Research Topics Improving Analyses: ■  Clustering of patient cohorts, e.g. k-means clustering ■  Combined search, e.g. in clinical trials and side-effect databases ■  Ad-hoc analysis of genetic pathways, e.g. to identify cause/effect Improving Data Preparations: ■  Graphical modeling of Genome Data Processing (GDP) pipelines ■  Scheduling and execution of multiple GPD pipelines in parallel ■  App store for medical knowledge (bring algorithms to data) ■  Exchange of sensitive data, e.g. history-based access control ■  Billing processes for intellectual property and services Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 6
  • 7. Genomics Analysis Loaded part of 1,000 genomes pre-phase 1 dataset ■  Chromosome 1 of 629 individuals from the 1,000 genomes project ■  12 billion entries in largest database table ■  293 GB of data (compressed in HANA) Results ■  Report SNPs failing quality control UCSC 102.47 sec | SAP HANA 1.25 sec – 82x faster ■  Compute the alternative allele frequency for each variant/region VCFtools 259 sec | SAP HANA 0.43 sec – 600x faster ■  Compute the total number of missing genotypes per individual VCFtools 548 sec | SAP HANA 2 sec – 270x faster Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 7 Supported by Dr. Carlos Bustamante lab
  • 8. Chromosome   Absolutefrequency Number  of  Alleles   Working With Big Data Loaded entire 1,000 genomes pre-phase 1 dataset ■  Queries on all chromosomes for all 629 individuals ■  136 billion entries in largest database table ■  ≈1.2TB (compressed in HANA) Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 8 Query  results  using  R  connec0vity:     Report  all  varia0ons  in  BRCA1  and  BRCA2     Supported by Dr. Carlos Bustamante lab
  • 9. High-Performance In-Memory Genome Project Analysis of Patient Cohorts 9 Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 ■  Columnar storage optimizes space requirements while enabling enhancing calculation performance ■  Single k-means clustering: R 470ms vs. HANA 30ms (15:1) ■  >60k clusters are calculated in <2s on 1,000 core cluster ■  è Interactive exploration of clusters comes true Why is a therapy only working in 80% of the patient cases?
  • 10. High-Performance In-Memory Genome Project Integration of Genetic Pathways ■  Storing and accessing graph data within in-memory database (Active Information Store) ■  263 pathways KEGG pathways with 6,481 genetic components, 32,784 vertices, and 90,682 edges ■  Rank all pathways by evaluation of node connections: IMDB <350ms ■  >5,5k rankings can be calculated in <2s on 1,000 core cluster Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 10 What are known effects for a somatic mutation?
  • 11. High-Performance In-Memory Genome Project Combined Search in Structured and Unstructured Data ■  In-memory technology enables entity extraction, e.g. age, genes, and drugs ■  Integrated 30k free text documents from clinicaltrials.gov ■  Relational search on entities enables interactive comparison ■  Results by rated by relevant search criteria Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 11 What clinical trials are relevant for individual patient?
  • 12. High-Performance In-Memory Genome Project Architectural Overview Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 Cohort Analysis Pathway Finder Paper Search In-Memory Database Clinical Trial Finder Pipeline Editor Extensions App Store Access Control Billing Pipeline Data Genome Data Pathways Genome Metadata Papers Pipeline Models Analytical Tools 12 ... ... ...
  • 13. The Future: Combined Information Requirements Enable clinicians to: ■  Make evidence-based therapy decisions at the patient’s bed ■  Exchange latest patient data with international experts Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 13 Enable researchers to: ■  Investigate genomes of patient cohorts to derive new knowledge ■  Analyze results in real-time Enable patients to: ■  To identify risk factors long before they turn into diseases ■  Identify experts and similar patient cases to bring up alternatives for individual therapies
  • 14. Thank you for your interest! Keep in contact with us. Enabling Real-Time Genome Data Research, GIA Meeting, Dr. Schapranow and Dr. Bog, May 30, 2013 SAP Labs LLC Dr. Anja Bog 3410 Hillview Avenue 94304 Palo Alto, CA Dr. Anja Bog anja.bog@sap.com 14 Hasso Plattner Institute Enterprise Platform & Integration Concepts Dr. Matthieu-P. Schapranow August-Bebel-Str. 88 14482 Potsdam, Germany Dr. Matthieu-P. Schapranow schapranow@hpi.uni-potsdam.de http://j.mp/schapranow