SlideShare ist ein Scribd-Unternehmen logo
1 von 45
Downloaden Sie, um offline zu lesen
Hadoop as a Platform
for Genomics
@AllenDay, Chief Scientist
Sungwook Yoon, Data Scientist
Data Science @MapR
®
© 2014 MapR Technologies 2
DNA Sequencing, pre-2004
years
CPU
transistors/mm2
HDD
GB/mm2
DNA
bp/$, pre-2004
®
© 2014 MapR Technologies 3
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
®
© 2014 MapR Technologies 4
DNA Sequencing, 2004 Disruption
years
CPU
transistors/mm2
HDD
GB/mm2DNA
bp/$, post-2004
DNA
bp/$, pre-2004
Similar disruption occurred for
Internet traffic in mid-1990s
®
© 2014 MapR Technologies 5
Effect: Many DNA-Based Apps Coming…
•  2014: US$ 2B, mostly
research, mostly
chemical costs
•  2020: US$ 20B,
mostly clinical, mostly
analytics costs
Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning
0
5
10
15
20
25
2014 2020
Clinical
Non-Clinical
®
© 2014 MapR Technologies 6© 2014 MapR Technologies
®
1. What Kind of Analytics Apps?
2. How do they Work?
®
© 2014 MapR Technologies 7
Target Audience
•  Fluency in computing, math
•  Basic knowledge of genetics, DNA
…so expect some encapsulated complexity
http://xkcd.com/803/
®
© 2014 MapR Technologies 8
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
®
© 2014 MapR Technologies 9
Step 1: Identify all the Single Nucleotide Polymorphisms
•  Currently ~12MM known SNPs
•  Each person has a unique Genotype
–  Typically 3-5MM SNPs
–  Relative to a reference human
–  diff this.human other.human,
essentially
•  Inherited from parents
•  Inexpensive to find as sequencing costs
have plummeted
http://learn.genetics.utah.edu/content/pharma/snips/
®
© 2014 MapR Technologies 10
Step 2: Characterize all the SNPs (ML, AI)
Other data &
algorithms
JOIN
®
© 2014 MapR Technologies 11
Innovation
Opportunities
Pop.
Freq
Drug A
Response
Drug B
Response
10% Good Good
30% Poor Fair
30% Excellent Poor
30% Good, but
Toxic
Fair
“Nil nocere” – do no harm
Step 3: Use Genotype to Customize Therapy
®
© 2014 MapR Technologies 12
Jan 30: Obama Unveils “Precision Medicine” Initiative
“Most medical
treatments have
been designed for
the ‘average
patient’ …
treatments can be
very successful
for some patients
but not for
others.”
http://www.msnbc.com/msnbc/obama-seeks-215-million-personalized-medicine
®
© 2014 MapR Technologies 13
Application: Forensic Analysis
http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/
http://snapshot.parabon-nanolabs.com/
http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
®
© 2014 MapR Technologies 14
http://steamcommunity.com/app/203160/discussions/0/846956188647169800/
http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law
Moore’s Law #Dataviz:
Lara Croft 230=>40,000 Polygons (1996-2014)
®
© 2014 MapR Technologies 15© 2014 MapR Technologies
®
1. What Kind of Analytics Apps?
2. How do they Work?
®
© 2014 MapR Technologies 16
Genome Sequencing in a Nutshell
Reference HumanPatient
Reference Genome
¢
¢
¢
¢
¢
¢
¢
De novo sequencing + assemblyResequencing
Patient Genotype
®
© 2014 MapR Technologies 17
Population-Scale Genome Biobanking
®
© 2014 MapR Technologies 18
GATK: Typical Tool for DNA=>Genotype Conversion
Advantages
•  No consensus alternative… yet
•  Works!
•  Already deployed and being used to save lives
Disadvantages
•  Map-Reduce but not Hadoop (and no plans to support)
•  Compute context cannot span multiple nodes
•  Inefficient use of shared memory (even within one node)
•  Inefficient asymmetric joins. No leverage of context, data locality
®
© 2014 MapR Technologies 19
GATK: flat after
chromosome split
®
© 2014 MapR Technologies 20
Big Picture
N DNA
Input
Records
All SNPs
Catalog still growing;
Genotype space huge ≫ 8E37
Personal input is
fixed N records and
trivial to cut into
P partitions
GG
A good implementation: scales O(N) ~ F(N,P)
But GATK is SLOW: scales O(N) ~ F(Genotypes)
GATK parallelization metrics / DEAD END attempts:
https://github.com/allenday/sequencing-utils
®
© 2014 MapR Technologies 21
Bigger Picture: Human Suffering
•  Widely disliked. Reduction of suffering is good business.
Even Bigger
•  Is it morally wrong to allow others to suffer?
•  If you agree, and there’s a way to reduce suffering,
then…
•  We can argue there is a moral imperative to build the
most efficient, dependable, inexpensive solution possible
®
© 2014 MapR Technologies 22© 2014 MapR Technologies
®
From Feasible to Easy & Efficient
®
© 2014 MapR Technologies 23
Two Phases of Genome Data Analysis
•  Batch Sequence Processing
–  Align the reads to correct location
–  Make correct Variants detection through statistical modeling
•  Genome / Phenome Data Analysis
–  Find relevant Genotypes for Phenotypes
–  Find relevant Phenotypes for Genotypes
®
© 2014 MapR Technologies 24
Genome Processing Requirements
Big Storage Big Memory Algorithms
Sorting
Group By
Clustering
Sparse
Matrix
Distributed
Processing
Which Free SW Has This Solution?
2TB per
person
Affordable
Hardware
Forward
Backward
®
© 2014 MapR Technologies 25
Genome Processing Needs More Than Hadoop
•  Strong In Memory Computation
•  Strong Sparse Matrix Computation
Which Free SW Has This Solution?
®
© 2014 MapR Technologies 26
Still One More
Genome Data
Format Definition
(A 1 Z)
(B 1 Z)
(C 1 Z)
A 1 Z B 1 Z C 1 Z
A B C 1 1 1 Z Z Z
Record 1
Record 2
Record 3
RowBased
ColBased
Sorting
Group
MLLib
®
© 2014 MapR Technologies 27
Compute Engines
Data Workflow
Adam Pipeline
FastQ BAM ADAM
ADAM-
VCF
VCF
AvocadoADAM ADAMAligner
Super Fast
•  In-memory
•  Scalable compute
context
Pipeline in Genomics
Data Workflow, a sequence of
data transformation from DNA
sequence read to Variant Calls
®
© 2014 MapR Technologies 28
Scale with Machines
From ADAM Tech Report
®
© 2014 MapR Technologies 29
That’s A lot but it just is a start
•  Why do we want sequencing?
– To catch criminals ??
•  Police State??
•  Deeper wider genome study may reveal
– Future medicine
– Cure for diseases
– Maybe … find Heroes??
®
© 2014 MapR Technologies 30
Variants Accumulate – Need a Scalable Variant Store
ADAM
ADAM-
VCF
®
© 2014 MapR Technologies 31
Genome × Phenome Analysis
For given population,
given SNP 𝛿, and
given phenotype ϕ:
Count the number
of occurrences as the
value of the matrix
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPARSEBillion+Genotypes
®
© 2014 MapR Technologies 32
Interpreting Genome × Phenome Matrix Factorization
Result
•  Row Vectors of X represents
–  Archetype set of phenotypes
•  Column vectors of Y represents
–  Archetype set of genotypes
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Principal
Column
Vector
Archetype
Genotypes
Archetype
Phenotypes
Principal
Row
Vector
Sparse Matrix
Package is Actively
Developed in Spark
Community
®
© 2014 MapR Technologies 33
Toward Heroes : Genome × Phenome Tensor
•  Aggregating over individuals with matrix could ignore the
correlations among genotypes and phenotypes
•  Maintain individual identity
Variants
Phenotypes
Variants
Phenotypes
®
© 2014 MapR Technologies 34
Tensor Factorization (Parafac)
Genome
Variants
Phenome ≈
Principal
Variants1
Principal
Phenotypes1
®
© 2014 MapR Technologies 35© 2014 MapR Technologies
®
From Imaginable to Possible
®
© 2014 MapR Technologies 36
Genome needs Hadoop
Variant
Calling
DNA
Sequencer
Reads
Reference
Genome
Genotype/
Phenotype/
Individual
Matrix
Cure &
Prevent
Disease
Medical
Records
Patient
®
© 2014 MapR Technologies 37
Scalable Variant Store – Data Mining
Model P ~ F(G)
Fortunately, this has already been done…
Genotypes Med Record Phenotypes, e.g.
disease risk, drug response
®
© 2014 MapR Technologies 38
Largest Biometric Database in the World
PEOPLE
1.2B
PEOPLE
®
© 2014 MapR Technologies 39
Why Create Aadhaar?
•  India: 1.2 billion residents
–  640,000 villages, ~60% lives under $2/day
–  ~75% literacy, <3% pay income tax, <20% have bank accounts
–  ~800 million mobile, ~200-300 million migrant workers
•  Govt. spends about $25-40 billion on direct subsidies
–  Residents have no standard identity document
–  Most programs plagued with ghost and multiple identities causing
leakage of 30-40%
Standardize identity => Stop leakage
®
© 2014 MapR Technologies 40
Aadhaar Biometric Capture & Index
Raw
Digital
Fingerprint
®
© 2014 MapR Technologies 41
Aadhaar Biometric ID Creation
F(x): unique features
G(x): uncommon features
H(x): other features
•  900MM people loaded in 4
years
•  In production
–  1MM registrations/day
–  200+ trillion lookups/day
•  All built on MapR-DB (HBase)
®
© 2014 MapR Technologies 42
How Does this Relate to Genomics?
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
•  Aadhaar: 1B humans, 5MB minutia
•  Genome: 7B humans, ~3M variants
®
© 2014 MapR Technologies 43
How Does this Relate to Genomics?
F-1(x): common features
F(x): unique features
G(x): uncommon features
H(x): other features
Same data shape and size
•  Aadhaar: 1B humans, 5MB minutia
•  Genome: 6B humans, ~3M variants
•  Genome: variant × phenotype
•  Common variant => effect-causing
gene F-1(x) !
Same data set operations
®
© 2014 MapR Technologies 44
Genotype/
Phenotype/
Individual
Matrix
≈
individuals
fingerprint minutiae
Find genetic basis
of fingerprints
medicalrecords
genetic variants
Find genetic basis
of disease
© 2014 MapR Technologies, confidential
®
Thanks!
Questions?
@allenday, @mapr
aday@mapr.com, syoon@mapr.com
linkedin.com/in/allenday

Más contenido relacionado

Was ist angesagt?

MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Technologies
 
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...MapR Technologies
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformMapR Technologies
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentDataWorks Summit
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07Ted Dunning
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR Technologies
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series DatabaseDataWorks Summit
 
Big data today and tomorrow
Big data today and tomorrowBig data today and tomorrow
Big data today and tomorrowmagda3695
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systemsinside-BigData.com
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputinginside-BigData.com
 

Was ist angesagt? (20)

MapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data PlatformMapR Streams and MapR Converged Data Platform
MapR Streams and MapR Converged Data Platform
 
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
Xactly: How to Build a Successful Converged Data Platform with Hadoop, Spark,...
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
MapR 5.2 Product Update
MapR 5.2 Product UpdateMapR 5.2 Product Update
MapR 5.2 Product Update
 
Best Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data PlatformBest Practices for Protecting Sensitive Data Across the Big Data Platform
Best Practices for Protecting Sensitive Data Across the Big Data Platform
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Realistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure DevelopmentRealistic Synthetic Generation Allows Secure Development
Realistic Synthetic Generation Allows Secure Development
 
Boston hug-2012-07
Boston hug-2012-07Boston hug-2012-07
Boston hug-2012-07
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
MapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community EditionMapR 5.2: Getting More Value from the MapR Converged Community Edition
MapR 5.2: Getting More Value from the MapR Converged Community Edition
 
Real World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in ProductionReal World Use Cases: Hadoop and NoSQL in Production
Real World Use Cases: Hadoop and NoSQL in Production
 
Dealing with an Upside Down Internet With High Performance Time Series Database
Dealing with an Upside Down Internet  With High Performance Time Series DatabaseDealing with an Upside Down Internet  With High Performance Time Series Database
Dealing with an Upside Down Internet With High Performance Time Series Database
 
Big data today and tomorrow
Big data today and tomorrowBig data today and tomorrow
Big data today and tomorrow
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale SystemsDesigning HPC, Deep Learning, and Cloud Middleware for Exascale Systems
Designing HPC, Deep Learning, and Cloud Middleware for Exascale Systems
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
High Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for SupercomputingHigh Availability HPC ~ Microservice Architectures for Supercomputing
High Availability HPC ~ Microservice Architectures for Supercomputing
 

Andere mochten auch

Hadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsHadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsMapR Technologies
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Tamr presentation
Tamr presentationTamr presentation
Tamr presentationAdam Hasler
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" BusinessMapR Technologies
 
Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr_Inc
 
Tamr | Making enterprise elephants dance @ boston data festival
Tamr | Making enterprise elephants dance @ boston data festival Tamr | Making enterprise elephants dance @ boston data festival
Tamr | Making enterprise elephants dance @ boston data festival Tamr_Inc
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 

Andere mochten auch (9)

Hadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND OperationsHadoop: Revolutionizing Analytics AND Operations
Hadoop: Revolutionizing Analytics AND Operations
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Tamr presentation
Tamr presentationTamr presentation
Tamr presentation
 
Powering the "As it Happens" Business
Powering the "As it Happens" BusinessPowering the "As it Happens" Business
Powering the "As it Happens" Business
 
Tamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael StonebrakerTamr | Strata hadoop 2014 Michael Stonebraker
Tamr | Strata hadoop 2014 Michael Stonebraker
 
10c introduction
10c introduction10c introduction
10c introduction
 
Tamr | Making enterprise elephants dance @ boston data festival
Tamr | Making enterprise elephants dance @ boston data festival Tamr | Making enterprise elephants dance @ boston data festival
Tamr | Making enterprise elephants dance @ boston data festival
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 

Ähnlich wie Hadoop as a Platform for Genomics

Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
MapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn GloballyMapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn Globallyridhav
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersAllen Day, PhD
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital TransformationMapR Technologies
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupAlan Iovine
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataMathieu Dumoulin
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksJustin Brandenburg
 

Ähnlich wie Hadoop as a Platform for Genomics (20)

Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
MapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn GloballyMapR Edge : Act Locally Learn Globally
MapR Edge : Act Locally Learn Globally
 
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Genomics Crash Course for Data Engineers
Genomics Crash Course for Data EngineersGenomics Crash Course for Data Engineers
Genomics Crash Course for Data Engineers
 
The Keys to Digital Transformation
The Keys to Digital TransformationThe Keys to Digital Transformation
The Keys to Digital Transformation
 
Expect More from Hadoop
Expect More from Hadoop Expect More from Hadoop
Expect More from Hadoop
 
Map r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetupMap r chicago_advanalytics_oct_meetup
Map r chicago_advanalytics_oct_meetup
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor DataState of the Art Robot Predictive Maintenance with Real-time Sensor Data
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
Predictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural NetworksPredictive Maintenance Using Recurrent Neural Networks
Predictive Maintenance Using Recurrent Neural Networks
 

Mehr von MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0MapR Technologies
 

Mehr von MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 
Evolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and RainEvolving Beyond the Data Lake: A Story of Wind and Rain
Evolving Beyond the Data Lake: A Story of Wind and Rain
 
Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0Open Source Innovations in the MapR Ecosystem Pack 2.0
Open Source Innovations in the MapR Ecosystem Pack 2.0
 

Último

2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInThousandEyes
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2DianaGray10
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1DianaGray10
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)IES VE
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIVijayananda Mohire
 

Último (20)

2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedInOutage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
Outage Analysis: March 5th/6th 2024 Meta, Comcast, and LinkedIn
 
UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2UiPath Studio Web workshop series - Day 2
UiPath Studio Web workshop series - Day 2
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
UiPath Studio Web workshop series - Day 1
UiPath Studio Web workshop series  - Day 1UiPath Studio Web workshop series  - Day 1
UiPath Studio Web workshop series - Day 1
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)The Importance of Indoor Air Quality (English)
The Importance of Indoor Air Quality (English)
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
My key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAIMy key hands-on projects in Quantum, and QAI
My key hands-on projects in Quantum, and QAI
 

Hadoop as a Platform for Genomics

  • 1. Hadoop as a Platform for Genomics @AllenDay, Chief Scientist Sungwook Yoon, Data Scientist Data Science @MapR
  • 2. ® © 2014 MapR Technologies 2 DNA Sequencing, pre-2004 years CPU transistors/mm2 HDD GB/mm2 DNA bp/$, pre-2004
  • 3. ® © 2014 MapR Technologies 3 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004
  • 4. ® © 2014 MapR Technologies 4 DNA Sequencing, 2004 Disruption years CPU transistors/mm2 HDD GB/mm2DNA bp/$, post-2004 DNA bp/$, pre-2004 Similar disruption occurred for Internet traffic in mid-1990s
  • 5. ® © 2014 MapR Technologies 5 Effect: Many DNA-Based Apps Coming… •  2014: US$ 2B, mostly research, mostly chemical costs •  2020: US$ 20B, mostly clinical, mostly analytics costs Macquarie Capital, 2014. Genomics 2.0: It’s just the beginning 0 5 10 15 20 25 2014 2020 Clinical Non-Clinical
  • 6. ® © 2014 MapR Technologies 6© 2014 MapR Technologies ® 1. What Kind of Analytics Apps? 2. How do they Work?
  • 7. ® © 2014 MapR Technologies 7 Target Audience •  Fluency in computing, math •  Basic knowledge of genetics, DNA …so expect some encapsulated complexity http://xkcd.com/803/
  • 8. ® © 2014 MapR Technologies 8 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 9. ® © 2014 MapR Technologies 9 Step 1: Identify all the Single Nucleotide Polymorphisms •  Currently ~12MM known SNPs •  Each person has a unique Genotype –  Typically 3-5MM SNPs –  Relative to a reference human –  diff this.human other.human, essentially •  Inherited from parents •  Inexpensive to find as sequencing costs have plummeted http://learn.genetics.utah.edu/content/pharma/snips/
  • 10. ® © 2014 MapR Technologies 10 Step 2: Characterize all the SNPs (ML, AI) Other data & algorithms JOIN
  • 11. ® © 2014 MapR Technologies 11 Innovation Opportunities Pop. Freq Drug A Response Drug B Response 10% Good Good 30% Poor Fair 30% Excellent Poor 30% Good, but Toxic Fair “Nil nocere” – do no harm Step 3: Use Genotype to Customize Therapy
  • 12. ® © 2014 MapR Technologies 12 Jan 30: Obama Unveils “Precision Medicine” Initiative “Most medical treatments have been designed for the ‘average patient’ … treatments can be very successful for some patients but not for others.” http://www.msnbc.com/msnbc/obama-seeks-215-million-personalized-medicine
  • 13. ® © 2014 MapR Technologies 13 Application: Forensic Analysis http://cgi.uconn.edu/stranger-visions-forensic-art-exhibit/ http://snapshot.parabon-nanolabs.com/ http://www.nature.com/news/mugshots-built-from-dna-data-1.14899
  • 14. ® © 2014 MapR Technologies 14 http://steamcommunity.com/app/203160/discussions/0/846956188647169800/ http://www.vox.com/2015/2/1/7955921/lara-croft-moores-law Moore’s Law #Dataviz: Lara Croft 230=>40,000 Polygons (1996-2014)
  • 15. ® © 2014 MapR Technologies 15© 2014 MapR Technologies ® 1. What Kind of Analytics Apps? 2. How do they Work?
  • 16. ® © 2014 MapR Technologies 16 Genome Sequencing in a Nutshell Reference HumanPatient Reference Genome ¢ ¢ ¢ ¢ ¢ ¢ ¢ De novo sequencing + assemblyResequencing Patient Genotype
  • 17. ® © 2014 MapR Technologies 17 Population-Scale Genome Biobanking
  • 18. ® © 2014 MapR Technologies 18 GATK: Typical Tool for DNA=>Genotype Conversion Advantages •  No consensus alternative… yet •  Works! •  Already deployed and being used to save lives Disadvantages •  Map-Reduce but not Hadoop (and no plans to support) •  Compute context cannot span multiple nodes •  Inefficient use of shared memory (even within one node) •  Inefficient asymmetric joins. No leverage of context, data locality
  • 19. ® © 2014 MapR Technologies 19 GATK: flat after chromosome split
  • 20. ® © 2014 MapR Technologies 20 Big Picture N DNA Input Records All SNPs Catalog still growing; Genotype space huge ≫ 8E37 Personal input is fixed N records and trivial to cut into P partitions GG A good implementation: scales O(N) ~ F(N,P) But GATK is SLOW: scales O(N) ~ F(Genotypes) GATK parallelization metrics / DEAD END attempts: https://github.com/allenday/sequencing-utils
  • 21. ® © 2014 MapR Technologies 21 Bigger Picture: Human Suffering •  Widely disliked. Reduction of suffering is good business. Even Bigger •  Is it morally wrong to allow others to suffer? •  If you agree, and there’s a way to reduce suffering, then… •  We can argue there is a moral imperative to build the most efficient, dependable, inexpensive solution possible
  • 22. ® © 2014 MapR Technologies 22© 2014 MapR Technologies ® From Feasible to Easy & Efficient
  • 23. ® © 2014 MapR Technologies 23 Two Phases of Genome Data Analysis •  Batch Sequence Processing –  Align the reads to correct location –  Make correct Variants detection through statistical modeling •  Genome / Phenome Data Analysis –  Find relevant Genotypes for Phenotypes –  Find relevant Phenotypes for Genotypes
  • 24. ® © 2014 MapR Technologies 24 Genome Processing Requirements Big Storage Big Memory Algorithms Sorting Group By Clustering Sparse Matrix Distributed Processing Which Free SW Has This Solution? 2TB per person Affordable Hardware Forward Backward
  • 25. ® © 2014 MapR Technologies 25 Genome Processing Needs More Than Hadoop •  Strong In Memory Computation •  Strong Sparse Matrix Computation Which Free SW Has This Solution?
  • 26. ® © 2014 MapR Technologies 26 Still One More Genome Data Format Definition (A 1 Z) (B 1 Z) (C 1 Z) A 1 Z B 1 Z C 1 Z A B C 1 1 1 Z Z Z Record 1 Record 2 Record 3 RowBased ColBased Sorting Group MLLib
  • 27. ® © 2014 MapR Technologies 27 Compute Engines Data Workflow Adam Pipeline FastQ BAM ADAM ADAM- VCF VCF AvocadoADAM ADAMAligner Super Fast •  In-memory •  Scalable compute context Pipeline in Genomics Data Workflow, a sequence of data transformation from DNA sequence read to Variant Calls
  • 28. ® © 2014 MapR Technologies 28 Scale with Machines From ADAM Tech Report
  • 29. ® © 2014 MapR Technologies 29 That’s A lot but it just is a start •  Why do we want sequencing? – To catch criminals ?? •  Police State?? •  Deeper wider genome study may reveal – Future medicine – Cure for diseases – Maybe … find Heroes??
  • 30. ® © 2014 MapR Technologies 30 Variants Accumulate – Need a Scalable Variant Store ADAM ADAM- VCF
  • 31. ® © 2014 MapR Technologies 31 Genome × Phenome Analysis For given population, given SNP 𝛿, and given phenotype ϕ: Count the number of occurrences as the value of the matrix 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 SPARSE Billion + Phenotypes SPARSEBillion+Genotypes
  • 32. ® © 2014 MapR Technologies 32 Interpreting Genome × Phenome Matrix Factorization Result •  Row Vectors of X represents –  Archetype set of phenotypes •  Column vectors of Y represents –  Archetype set of genotypes 𝛿5 ϕ5 ϕ3 ϕ1 𝛿3 𝛿1 Principal Column Vector Archetype Genotypes Archetype Phenotypes Principal Row Vector Sparse Matrix Package is Actively Developed in Spark Community
  • 33. ® © 2014 MapR Technologies 33 Toward Heroes : Genome × Phenome Tensor •  Aggregating over individuals with matrix could ignore the correlations among genotypes and phenotypes •  Maintain individual identity Variants Phenotypes Variants Phenotypes
  • 34. ® © 2014 MapR Technologies 34 Tensor Factorization (Parafac) Genome Variants Phenome ≈ Principal Variants1 Principal Phenotypes1
  • 35. ® © 2014 MapR Technologies 35© 2014 MapR Technologies ® From Imaginable to Possible
  • 36. ® © 2014 MapR Technologies 36 Genome needs Hadoop Variant Calling DNA Sequencer Reads Reference Genome Genotype/ Phenotype/ Individual Matrix Cure & Prevent Disease Medical Records Patient
  • 37. ® © 2014 MapR Technologies 37 Scalable Variant Store – Data Mining Model P ~ F(G) Fortunately, this has already been done… Genotypes Med Record Phenotypes, e.g. disease risk, drug response
  • 38. ® © 2014 MapR Technologies 38 Largest Biometric Database in the World PEOPLE 1.2B PEOPLE
  • 39. ® © 2014 MapR Technologies 39 Why Create Aadhaar? •  India: 1.2 billion residents –  640,000 villages, ~60% lives under $2/day –  ~75% literacy, <3% pay income tax, <20% have bank accounts –  ~800 million mobile, ~200-300 million migrant workers •  Govt. spends about $25-40 billion on direct subsidies –  Residents have no standard identity document –  Most programs plagued with ghost and multiple identities causing leakage of 30-40% Standardize identity => Stop leakage
  • 40. ® © 2014 MapR Technologies 40 Aadhaar Biometric Capture & Index Raw Digital Fingerprint
  • 41. ® © 2014 MapR Technologies 41 Aadhaar Biometric ID Creation F(x): unique features G(x): uncommon features H(x): other features •  900MM people loaded in 4 years •  In production –  1MM registrations/day –  200+ trillion lookups/day •  All built on MapR-DB (HBase)
  • 42. ® © 2014 MapR Technologies 42 How Does this Relate to Genomics? F(x): unique features G(x): uncommon features H(x): other features Same data shape and size •  Aadhaar: 1B humans, 5MB minutia •  Genome: 7B humans, ~3M variants
  • 43. ® © 2014 MapR Technologies 43 How Does this Relate to Genomics? F-1(x): common features F(x): unique features G(x): uncommon features H(x): other features Same data shape and size •  Aadhaar: 1B humans, 5MB minutia •  Genome: 6B humans, ~3M variants •  Genome: variant × phenotype •  Common variant => effect-causing gene F-1(x) ! Same data set operations
  • 44. ® © 2014 MapR Technologies 44 Genotype/ Phenotype/ Individual Matrix ≈ individuals fingerprint minutiae Find genetic basis of fingerprints medicalrecords genetic variants Find genetic basis of disease
  • 45. © 2014 MapR Technologies, confidential ® Thanks! Questions? @allenday, @mapr aday@mapr.com, syoon@mapr.com linkedin.com/in/allenday