SlideShare ist ein Scribd-Unternehmen logo
1 von 54
© 2014 MapR Technologies 1
© 2014 MapR Technologies 2
Biomedical & Advertising Tech Overarching Themes*
*Obligatory movie references… shout-out to my hometown LA
Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
© 2014 MapR Technologies 3
Biomedical Research Goal:
Therapeutics => Diagnostics => Prognostics
• Therapeutics => traditional medicine
• Diagnostics => personalized medicine
– NextGen public health
– Requires hi-res mechanical knowledge
– Reverse engineer how genetic variation leads to (un)desired traits
• Prognostics => GATTACA (dys/eu)topia
– Managed populations / NextGen eugenics
© 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
© 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
© 2014 MapR Technologies 6
Genetic Basis of Facial Features
self-reported values of {sex, ancestry}
+ observer scores [race, sex]}
+ 3D facial scan
+ genome scan
______________________________
Allelic model of 20 genes that
determine facial characteristics
Claes, et al. 2014. Modeling 3D Facial Shape from DNA
© 2014 MapR Technologies 7
Genetic Basis of Facial Features
Claes, et al. 2014. Modeling 3D Facial Shape from DNA
© 2014 MapR Technologies 8
So Get Ready…
www.theness.com
© 2014 MapR Technologies 9© 2014 MapR Technologies
Genomics Crash Course for Data Engineers
© 2014 MapR Technologies 10
Me, Us
• Allen Day, Principal Data Scientist, MapR
5yr Hadoop Dev, R project contributor
PhD, Human Genetics, UCLA Medicine
• MapR
Distributes open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
• See Also
– “allenday” most places (twitter, github, etc.)
– @mapR
© 2014 MapR Technologies 11
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
© 2014 MapR Technologies 12
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
© 2014 MapR Technologies 13
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
© 2014 MapR Technologies 14
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
© 2014 MapR Technologies 15
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
© 2014 MapR Technologies 16
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
© 2014 MapR Technologies 17
Clinical Genomics, Information Systems Perspective
Compressed Structured
Base4 Data
Uncompressed Unstructured
Base2 Data
extract
Base4=>Base2
Converter
[[ DE-STRUCTURES ]]
“BI” Reporting and
Visualization tools
PhysicianPatient
AnalystStakeholder
© 2014 MapR Technologies 18
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
© 2014 MapR Technologies 19
Sequencing “Even Moore’s Law”
Stein. 2010. The case for cloud computing in genome informatics
© 2014 MapR Technologies 20
The Evolving Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= 1º analytics
“current high ROI use cases”
<= 2º analytics
“next-gen high ROI use cases”
© 2014 MapR Technologies 21
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
1º analytics
2º analytics
Not much in this presentation,
see also:
http://slidesha.re/1sC2BOX
© 2014 MapR Technologies 22
Sequence Analysis, Quick Partial Details
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G A T A - - A G A fragment3
A A C A G C T T A C A […] fragment4
C T A T A G A T A A fragment5
[…] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA
[…] G A C T A C A G A T A A C A G A T T A C A […] patient__DNA
© 2014 MapR Technologies 23
What is the (Probable) Color of Each Column?
© 2014 MapR Technologies 24
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*cols)
+ O(1 col) memory
© 2014 MapR Technologies 25
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies O(rows)
+ O(rows*cols) memory
© 2014 MapR Technologies 26
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(rows log rows)
+ O(cols)
+ O(1 col) memory
© 2014 MapR Technologies 27
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows)
+ O(cols)
+ O(1 col) memory
© 2014 MapR Technologies 28
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows) ÷ shards
+ O(cols) ÷ shards
+ O(1 col) memory
As # of rows & columns increases
Strategy 3 becomes more attractive
© 2014 MapR Technologies 29
1º Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
© 2014 MapR Technologies 30
Crossbow (MapReduce Strategy, implemented)
Langmead, et al. 2009. Searching for SNPs with cloud computing
© 2014 MapR Technologies 31
Ion Flux (MapReduce Strategy, implemented for Enterprise)
• Sequencing workflow in MapReduce (Hadoop, Cascading,
Amazon Elastic M/R)
• Integrated with Ion Torrent as a plugin to stream sequence to the
cloud
• Emphasis on scalability and latency
– assay->clinical report turnaround in < 24h
• Compare to fast-follower stack ILMN MiSeq+BaseSpace
http://aws.amazon.com/solutions/case-studies/ion-flux/
http://ionflux.com
© 2014 MapR Technologies 32© 2014 MapR Technologies
Non-Genomics Digression, 1 of 2
Data Warehouse ETL Offload
© 2014 MapR Technologies 33
The Problem
• Major telecom vendor
• Key step in billing pipeline handled by data warehouse (EDW)
• EDW at maximum capacity
• Multiple rounds of software optimization already done
• Revenue limiting (= career limiting) bottleneck
© 2014 MapR Technologies 34
Three Options
1. No more revenue growth
2. Increase EDW size
– Expensive
– Known to not scale well
3. Find a more scalable solution
© 2014 MapR Technologies 35
ETL
CDR
billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow – ELTL
© 2014 MapR Technologies 36
Simplified Analysis – EDW Strategy
• 70% of EDW consumed by ELTL processing
– Caused by 10% of code (CDR transformations)
• 200% EDW capacity adds capital cost is ~X
• Indirect costs non-trivial (floor space, power)
• 150% performance increase (poor division of labor)
© 2014 MapR Technologies 37
ETL
CDR
billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
© 2014 MapR Technologies 38
Simplified Analysis – MapR Strategy
• Hardware + MapR cost ~1/20X
• ETL replacement development costs ~1/20X
• 300% performance increase
© 2014 MapR Technologies 39
Price Performance
• EDW strategy
– 1.5x performance
– Cost is X
• MapR Strategy
– 3x performance
– Cost is 1/10X
• 20x cost/performance advantage for MapR strategy
© 2014 MapR Technologies 40
Platform Advantages
• Standard Hadoop eco-system components allow efficient
CDR parsing and ETL
• MapR platform provides high availability, disaster
recovery
• MapR NFS interface allows direct load of transformed
data
© 2014 MapR Technologies 41© 2014 MapR Technologies
Non-Genomics Digression, 2 of 2
© 2014 MapR Technologies 42© 2014 MapR Technologies
<Recommendation System. Redacted>
© 2014 MapR Technologies 50© 2014 MapR Technologies
Hybrid Use-Cases
© 2014 MapR Technologies 51
MapR Data Platform Advantage, Telecommunications
CO-OCCURRENCE
(MAHOUT)
SOLR INDEXING
ETL
BILLING
REPORTS
WEB TIERDATA
WAREHOUSE
CDR
BILLING
RECORDS
CUSTOMER
BILLING
USER HISTORY QUERY /
CONTEXT RECOMENDATIONS
COMPLETE HISTORY
(all users)
ITEM META-DATA INDEX SHARDS
© 2014 MapR Technologies 52
MapR Data Platform Advantage, Clinical Genomics
Epidemiological,
Actuarial Analyses
Denormalization for
Search, Viz, Research
ETL
Clinical
Reporting
WEB TIERClinical
Reporting
Systems
CLINICAL
TREATMENT
OF PATIENTS
RESEARCHERS
National Pop.
Database
INDEX SHARDSPrognostic
Capability
© 2014 MapR Technologies 53© 2014 MapR Technologies
Bonus Round: 2º Analytics
© 2014 MapR Technologies 54
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
2º analytics
Not much in this presentation,
see also:
http://slidesha.re/1sC2BOX
© 2014 MapR Technologies 55
Matrices A (U*Q) and B (U*V)
Query Term = Clicked Term
Users
Query Terms
Users
Clicked Videos
© 2014 MapR Technologies 56
Relate Q to V
Users
Query Terms
© 2014 MapR Technologies 57
Relate Q to V
Users
Query Terms
© 2014 MapR Technologies 58
Relate Q to V: it’s a Cross-Recommender
QueryTerms
Videos
© 2014 MapR Technologies 59
Users
Query Terms
© 2014 MapR Technologies 60
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
NPR. 2011. The Search For Analysts To Make Sense Of
'Big Data’
http://www.npr.org/2011/11/30/142893065
© 2014 MapR Technologies 61
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
• Identify network structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• PROFIT ! ! ! !

Weitere ähnliche Inhalte

Andere mochten auch

Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntelHealthcare
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomicsmikaelhuss
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Crowdfunding: an Easy and Creative Way of Funding
Crowdfunding: an Easy and Creative Way of FundingCrowdfunding: an Easy and Creative Way of Funding
Crowdfunding: an Easy and Creative Way of Fundingjustverycurious
 
7 #designgames The Innovation Games: methods to help teams develop breakthrou...
7 #designgames The Innovation Games: methods to help teams develop breakthrou...7 #designgames The Innovation Games: methods to help teams develop breakthrou...
7 #designgames The Innovation Games: methods to help teams develop breakthrou...John Knight
 
How to pitch an american VC by Blake Armstrong, Partner at TheFamily
How to pitch an american VC by Blake Armstrong, Partner at TheFamilyHow to pitch an american VC by Blake Armstrong, Partner at TheFamily
How to pitch an american VC by Blake Armstrong, Partner at TheFamilyTheFamily
 
Earth images from space 2014 (2014年 太空拍的地球照片)
Earth images from space 2014 (2014年 太空拍的地球照片)Earth images from space 2014 (2014年 太空拍的地球照片)
Earth images from space 2014 (2014年 太空拍的地球照片)Chung Yen Chang
 

Andere mochten auch (12)

Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics AnalyticsIntel - Challenges and Opportunities in Cloud-Based Genomics Analytics
Intel - Challenges and Opportunities in Cloud-Based Genomics Analytics
 
Presentación 2018-2019
Presentación 2018-2019Presentación 2018-2019
Presentación 2018-2019
 
Data analytics challenges in genomics
Data analytics challenges in genomicsData analytics challenges in genomics
Data analytics challenges in genomics
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
CAD CAM CAE
CAD CAM CAECAD CAM CAE
CAD CAM CAE
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Crowdfunding: an Easy and Creative Way of Funding
Crowdfunding: an Easy and Creative Way of FundingCrowdfunding: an Easy and Creative Way of Funding
Crowdfunding: an Easy and Creative Way of Funding
 
7 #designgames The Innovation Games: methods to help teams develop breakthrou...
7 #designgames The Innovation Games: methods to help teams develop breakthrou...7 #designgames The Innovation Games: methods to help teams develop breakthrou...
7 #designgames The Innovation Games: methods to help teams develop breakthrou...
 
How to pitch an american VC by Blake Armstrong, Partner at TheFamily
How to pitch an american VC by Blake Armstrong, Partner at TheFamilyHow to pitch an american VC by Blake Armstrong, Partner at TheFamily
How to pitch an american VC by Blake Armstrong, Partner at TheFamily
 
Cad cam cae
Cad cam caeCad cam cae
Cad cam cae
 
How Scientists Engage the Public
How Scientists Engage the PublicHow Scientists Engage the Public
How Scientists Engage the Public
 
Earth images from space 2014 (2014年 太空拍的地球照片)
Earth images from space 2014 (2014年 太空拍的地球照片)Earth images from space 2014 (2014年 太空拍的地球照片)
Earth images from space 2014 (2014年 太空拍的地球照片)
 

Ähnlich wie Genomics Crash Course for Data Engineers

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...Allen Day, PhD
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Allen Day, PhD
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIAllen Day, PhD
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseAllen Day, PhD
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Allen Day, PhD
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsAllen Day, PhD
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsMapR Technologies
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleJulius Remigio, CBIP
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Databricks
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)Dongheon Lee
 
Machine Learning: Past, Present and Future - by Tom Dietterich
Machine Learning: Past, Present and Future - by Tom DietterichMachine Learning: Past, Present and Future - by Tom Dietterich
Machine Learning: Past, Present and Future - by Tom DietterichBigML, Inc
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...Kees van Bochove
 
[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...PAPIs.io
 
Hadoop recognition of biomedical named entity using conditional random fields...
Hadoop recognition of biomedical named entity using conditional random fields...Hadoop recognition of biomedical named entity using conditional random fields...
Hadoop recognition of biomedical named entity using conditional random fields...LeMeniz Infotech
 
COMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENTCOMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENTArunpandiyan59
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Bigfinite
 

Ähnlich wie Genomics Crash Course for Data Engineers (20)

2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
2014.06.30 - Renaissance in Medicine - Singapore Management University - Data...
 
Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]Human Genetics & Big Data [sans Ethics]
Human Genetics & Big Data [sans Ethics]
 
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBIHadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
Hadoop and Genomics - What you need to know - Cambridge - Sanger Center and EBI
 
Hadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San JoseHadoop as a Platform for Genomics - Strata 2015, San Jose
Hadoop as a Platform for Genomics - Strata 2015, San Jose
 
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
Hadoop and Genomics - What You Need to Know - London - Viadex RCC - 2015.03.17
 
Renaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and GenomicsRenaissance in Medicine - Strata - NoSQL and Genomics
Renaissance in Medicine - Strata - NoSQL and Genomics
 
Hadoop as a Platform for Genomics
Hadoop as a Platform for GenomicsHadoop as a Platform for Genomics
Hadoop as a Platform for Genomics
 
Genome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data StyleGenome Analysis Pipelines, Big Data Style
Genome Analysis Pipelines, Big Data Style
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
Insights from Building the Future of Drug Discovery with Apache Spark with Lu...
 
Deep Learning for AI (3)
Deep Learning for AI (3)Deep Learning for AI (3)
Deep Learning for AI (3)
 
Machine Learning: Past, Present and Future - by Tom Dietterich
Machine Learning: Past, Present and Future - by Tom DietterichMachine Learning: Past, Present and Future - by Tom Dietterich
Machine Learning: Past, Present and Future - by Tom Dietterich
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
SCOPE Summit - Applying the OMOP data model & OHDSI software to national Euro...
 
[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...[Keynote] predictive technologies and the prediction of technology - Bob Will...
[Keynote] predictive technologies and the prediction of technology - Bob Will...
 
Hadoop recognition of biomedical named entity using conditional random fields...
Hadoop recognition of biomedical named entity using conditional random fields...Hadoop recognition of biomedical named entity using conditional random fields...
Hadoop recognition of biomedical named entity using conditional random fields...
 
Parkinson disease classification v2.0
Parkinson disease classification v2.0Parkinson disease classification v2.0
Parkinson disease classification v2.0
 
COMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENTCOMPUTERS IN PHARMACEUTICAL DEVELOPMENT
COMPUTERS IN PHARMACEUTICAL DEVELOPMENT
 
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
Maximize Your Understanding of Operational Realities in Manufacturing with Pr...
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 

Mehr von Allen Day, PhD

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Allen Day, PhD
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...Allen Day, PhD
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser UniversityAllen Day, PhD
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - WageningenAllen Day, PhD
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - AmsterdamAllen Day, PhD
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / PhoenixAllen Day, PhD
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen ChinaAllen Day, PhD
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseAllen Day, PhD
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedAllen Day, PhD
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big DataAllen Day, PhD
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data AnalyticsAllen Day, PhD
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design PatternsAllen Day, PhD
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design PatternsAllen Day, PhD
 

Mehr von Allen Day, PhD (15)

Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...Deep learning in medicine: An introduction and applications to next-generatio...
Deep learning in medicine: An introduction and applications to next-generatio...
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
20170426 - Deep Learning Applications in Genomics - Vancouver - Simon Fraser ...
 
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University20170424 - Big Data in Biology - Vancouver - Simon Fraser University
20170424 - Big Data in Biology - Vancouver - Simon Fraser University
 
20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen20170406 Genomics@Google - KeyGene - Wageningen
20170406 Genomics@Google - KeyGene - Wageningen
 
20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam20170402 Crop Innovation and Business - Amsterdam
20170402 Crop Innovation and Business - Amsterdam
 
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
20170315 Cloud Accelerated Genomics - Tel Aviv / Phoenix
 
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
2014.06.16 - BGI - Genomics BigData Workloads - Shenzhen China
 
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San JoseR + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
R + Storm Moneyball - Realtime Advanced Statistics - Hadoop Summit - San Jose
 
Building Data Science Teams, Abbreviated
Building Data Science Teams, AbbreviatedBuilding Data Science Teams, Abbreviated
Building Data Science Teams, Abbreviated
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
20131212 - Sydney - Garvan Institute - Human Genetics and Big Data
 
2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics2013.12.12 - Sydney - Big Data Analytics
2013.12.12 - Sydney - Big Data Analytics
 
20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns20131011 - Los Gatos - Netflix - Big Data Design Patterns
20131011 - Los Gatos - Netflix - Big Data Design Patterns
 
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
20131111 - Santa Monica - BigDataCamp - Big Data Design Patterns
 

Kürzlich hochgeladen

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Kürzlich hochgeladen (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 

Genomics Crash Course for Data Engineers

  • 1. © 2014 MapR Technologies 1
  • 2. © 2014 MapR Technologies 2 Biomedical & Advertising Tech Overarching Themes* *Obligatory movie references… shout-out to my hometown LA Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
  • 3. © 2014 MapR Technologies 3 Biomedical Research Goal: Therapeutics => Diagnostics => Prognostics • Therapeutics => traditional medicine • Diagnostics => personalized medicine – NextGen public health – Requires hi-res mechanical knowledge – Reverse engineer how genetic variation leads to (un)desired traits • Prognostics => GATTACA (dys/eu)topia – Managed populations / NextGen eugenics
  • 4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
  • 5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
  • 6. © 2014 MapR Technologies 6 Genetic Basis of Facial Features self-reported values of {sex, ancestry} + observer scores [race, sex]} + 3D facial scan + genome scan ______________________________ Allelic model of 20 genes that determine facial characteristics Claes, et al. 2014. Modeling 3D Facial Shape from DNA
  • 7. © 2014 MapR Technologies 7 Genetic Basis of Facial Features Claes, et al. 2014. Modeling 3D Facial Shape from DNA
  • 8. © 2014 MapR Technologies 8 So Get Ready… www.theness.com
  • 9. © 2014 MapR Technologies 9© 2014 MapR Technologies Genomics Crash Course for Data Engineers
  • 10. © 2014 MapR Technologies 10 Me, Us • Allen Day, Principal Data Scientist, MapR 5yr Hadoop Dev, R project contributor PhD, Human Genetics, UCLA Medicine • MapR Distributes open source components for Hadoop Adds major technology for performance, HA, industry standard API’s • See Also – “allenday” most places (twitter, github, etc.) – @mapR
  • 11. © 2014 MapR Technologies 11 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 12. © 2014 MapR Technologies 12 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • 13. © 2014 MapR Technologies 13 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • 14. © 2014 MapR Technologies 14 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • 15. © 2014 MapR Technologies 15 One Bad MTHFR MTHFR C677T Methylfolate helps make neurotransmitters in your brain. When methylfolate levels are low, so are your neurotransmitters. Low production of neurotransmitters may cause conditions of addictive behavior, depression, anxiety, ADHD, mania, irritability, insomnia, learning disorders and others. Everyone should get tested. Why? Because 1 in 2 people are affected and if one knows they have a MTHFR polymorphism, they know they have to be very proactive in taking care of themselves. http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The- Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid- Health.htm
  • 16. © 2014 MapR Technologies 16 Clinical Sequencing Business Process Workflow PhysicianPatient Clinic blood/saliva Clinical Lab Analytics extract
  • 17. © 2014 MapR Technologies 17 Clinical Genomics, Information Systems Perspective Compressed Structured Base4 Data Uncompressed Unstructured Base2 Data extract Base4=>Base2 Converter [[ DE-STRUCTURES ]] “BI” Reporting and Visualization tools PhysicianPatient AnalystStakeholder
  • 18. © 2014 MapR Technologies 18 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics
  • 19. © 2014 MapR Technologies 19 Sequencing “Even Moore’s Law” Stein. 2010. The case for cloud computing in genome informatics
  • 20. © 2014 MapR Technologies 20 The Evolving Genomics Workload Sboner, et al, 2011. The real cost of sequencing: higher than you think! <= 1º analytics “current high ROI use cases” <= 2º analytics “next-gen high ROI use cases”
  • 21. © 2014 MapR Technologies 21 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 1º analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  • 22. © 2014 MapR Technologies 22 Sequence Analysis, Quick Partial Details […] G A C T A G A fragment1 A C A G T T T A C A fragment2 A G A T A - - A G A fragment3 A A C A G C T T A C A […] fragment4 C T A T A G A T A A fragment5 […] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA […] G A C T A C A G A T A A C A G A T T A C A […] patient__DNA
  • 23. © 2014 MapR Technologies 23 What is the (Probable) Color of Each Column?
  • 24. © 2014 MapR Technologies 24 Which Columns are (probably) Not White? Strategy 1: examine foreach column, foreach row O(rows*cols) + O(1 col) memory
  • 25. © 2014 MapR Technologies 25 Which Columns are (probably) Not White? Strategy 2: examine foreach row. keep running tallies O(rows) + O(rows*cols) memory
  • 26. © 2014 MapR Technologies 26 Which Columns are (probably) Not White? Strategy 3: rotate matrix. examine foreach column O(rows log rows) + O(cols) + O(1 col) memory
  • 27. © 2014 MapR Technologies 27 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) + O(cols) + O(1 col) memory
  • 28. © 2014 MapR Technologies 28 Comparison of Strategies Strategy 1 • Low mem req • Random access pattern, many ops Strategy 3 • Low mem req • Sequential access pattern • Requires Sort Strategy 2 • High mem req • Sequential access pattern O(rows*cols) + O(1 col) memory O(rows) + O(rows*cols) memory O(rows log rows) ÷ shards + O(cols) ÷ shards + O(1 col) memory As # of rows & columns increases Strategy 3 becomes more attractive
  • 29. © 2014 MapR Technologies 29 1º Sequence Analysis (ETL), MapReduce style .fastq .bam .vcf short read alignment genotype calling MAP MAP REDUCE, rotate matrix 90º (O(mn)) / 1 (O(mn) + O(n log n)) / s
  • 30. © 2014 MapR Technologies 30 Crossbow (MapReduce Strategy, implemented) Langmead, et al. 2009. Searching for SNPs with cloud computing
  • 31. © 2014 MapR Technologies 31 Ion Flux (MapReduce Strategy, implemented for Enterprise) • Sequencing workflow in MapReduce (Hadoop, Cascading, Amazon Elastic M/R) • Integrated with Ion Torrent as a plugin to stream sequence to the cloud • Emphasis on scalability and latency – assay->clinical report turnaround in < 24h • Compare to fast-follower stack ILMN MiSeq+BaseSpace http://aws.amazon.com/solutions/case-studies/ion-flux/ http://ionflux.com
  • 32. © 2014 MapR Technologies 32© 2014 MapR Technologies Non-Genomics Digression, 1 of 2 Data Warehouse ETL Offload
  • 33. © 2014 MapR Technologies 33 The Problem • Major telecom vendor • Key step in billing pipeline handled by data warehouse (EDW) • EDW at maximum capacity • Multiple rounds of software optimization already done • Revenue limiting (= career limiting) bottleneck
  • 34. © 2014 MapR Technologies 34 Three Options 1. No more revenue growth 2. Increase EDW size – Expensive – Known to not scale well 3. Find a more scalable solution
  • 35. © 2014 MapR Technologies 35 ETL CDR billing records Billing reports Data Warehouse Customer bills Original Flow – ELTL
  • 36. © 2014 MapR Technologies 36 Simplified Analysis – EDW Strategy • 70% of EDW consumed by ELTL processing – Caused by 10% of code (CDR transformations) • 200% EDW capacity adds capital cost is ~X • Indirect costs non-trivial (floor space, power) • 150% performance increase (poor division of labor)
  • 37. © 2014 MapR Technologies 37 ETL CDR billing records Billing reports Data Warehouse Customer billing With ETL Offload
  • 38. © 2014 MapR Technologies 38 Simplified Analysis – MapR Strategy • Hardware + MapR cost ~1/20X • ETL replacement development costs ~1/20X • 300% performance increase
  • 39. © 2014 MapR Technologies 39 Price Performance • EDW strategy – 1.5x performance – Cost is X • MapR Strategy – 3x performance – Cost is 1/10X • 20x cost/performance advantage for MapR strategy
  • 40. © 2014 MapR Technologies 40 Platform Advantages • Standard Hadoop eco-system components allow efficient CDR parsing and ETL • MapR platform provides high availability, disaster recovery • MapR NFS interface allows direct load of transformed data
  • 41. © 2014 MapR Technologies 41© 2014 MapR Technologies Non-Genomics Digression, 2 of 2
  • 42. © 2014 MapR Technologies 42© 2014 MapR Technologies <Recommendation System. Redacted>
  • 43. © 2014 MapR Technologies 50© 2014 MapR Technologies Hybrid Use-Cases
  • 44. © 2014 MapR Technologies 51 MapR Data Platform Advantage, Telecommunications CO-OCCURRENCE (MAHOUT) SOLR INDEXING ETL BILLING REPORTS WEB TIERDATA WAREHOUSE CDR BILLING RECORDS CUSTOMER BILLING USER HISTORY QUERY / CONTEXT RECOMENDATIONS COMPLETE HISTORY (all users) ITEM META-DATA INDEX SHARDS
  • 45. © 2014 MapR Technologies 52 MapR Data Platform Advantage, Clinical Genomics Epidemiological, Actuarial Analyses Denormalization for Search, Viz, Research ETL Clinical Reporting WEB TIERClinical Reporting Systems CLINICAL TREATMENT OF PATIENTS RESEARCHERS National Pop. Database INDEX SHARDSPrognostic Capability
  • 46. © 2014 MapR Technologies 53© 2014 MapR Technologies Bonus Round: 2º Analytics
  • 47. © 2014 MapR Technologies 54 Clinical Genomics, Information Systems Perspective PhysicianPatient AnalystStakeholder ETL Reporting and Viz Data Store Analytics 2º analytics Not much in this presentation, see also: http://slidesha.re/1sC2BOX
  • 48. © 2014 MapR Technologies 55 Matrices A (U*Q) and B (U*V) Query Term = Clicked Term Users Query Terms Users Clicked Videos
  • 49. © 2014 MapR Technologies 56 Relate Q to V Users Query Terms
  • 50. © 2014 MapR Technologies 57 Relate Q to V Users Query Terms
  • 51. © 2014 MapR Technologies 58 Relate Q to V: it’s a Cross-Recommender QueryTerms Videos
  • 52. © 2014 MapR Technologies 59 Users Query Terms
  • 53. © 2014 MapR Technologies 60 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building NPR. 2011. The Search For Analysts To Make Sense Of 'Big Data’ http://www.npr.org/2011/11/30/142893065
  • 54. © 2014 MapR Technologies 61 If they were unlabeled, would you know which is which? Friend. 2010. The Need for Precompetitive Integrative Bionetwork Disease Model Building • Identify network structures • Label them • Observe stimulus=>response space mapping • Purposefully target • PROFIT ! ! ! !