Weitere ähnliche Inhalte
Ähnlich wie Genomics Crash Course for Data Engineers (20)
Mehr von Allen Day, PhD (15)
Kürzlich hochgeladen (20)
Genomics Crash Course for Data Engineers
- 2. © 2014 MapR Technologies 2
Biomedical & Advertising Tech Overarching Themes*
*Obligatory movie references… shout-out to my hometown LA
Eugenics & Determinism Free will vs. Determinism Media Tech & Privacy
- 3. © 2014 MapR Technologies 3
Biomedical Research Goal:
Therapeutics => Diagnostics => Prognostics
• Therapeutics => traditional medicine
• Diagnostics => personalized medicine
– NextGen public health
– Requires hi-res mechanical knowledge
– Reverse engineer how genetic variation leads to (un)desired traits
• Prognostics => GATTACA (dys/eu)topia
– Managed populations / NextGen eugenics
- 4. © 2014 MapR Technologies 4Star Wars III: Revenge of the Sith
- 5. © 2014 MapR Technologies 5Star Wars V: The Empire Strikes Back
- 6. © 2014 MapR Technologies 6
Genetic Basis of Facial Features
self-reported values of {sex, ancestry}
+ observer scores [race, sex]}
+ 3D facial scan
+ genome scan
______________________________
Allelic model of 20 genes that
determine facial characteristics
Claes, et al. 2014. Modeling 3D Facial Shape from DNA
- 7. © 2014 MapR Technologies 7
Genetic Basis of Facial Features
Claes, et al. 2014. Modeling 3D Facial Shape from DNA
- 8. © 2014 MapR Technologies 8
So Get Ready…
www.theness.com
- 9. © 2014 MapR Technologies 9© 2014 MapR Technologies
Genomics Crash Course for Data Engineers
- 10. © 2014 MapR Technologies 10
Me, Us
• Allen Day, Principal Data Scientist, MapR
5yr Hadoop Dev, R project contributor
PhD, Human Genetics, UCLA Medicine
• MapR
Distributes open source components for Hadoop
Adds major technology for performance, HA, industry standard API’s
• See Also
– “allenday” most places (twitter, github, etc.)
– @mapR
- 11. © 2014 MapR Technologies 11
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
- 12. © 2014 MapR Technologies 12
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
- 13. © 2014 MapR Technologies 13
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
- 14. © 2014 MapR Technologies 14
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
- 15. © 2014 MapR Technologies 15
One Bad MTHFR
MTHFR C677T
Methylfolate helps make neurotransmitters in
your brain. When methylfolate levels are low,
so are your neurotransmitters. Low production
of neurotransmitters may cause conditions of
addictive behavior, depression, anxiety,
ADHD, mania, irritability, insomnia, learning
disorders and others.
Everyone should get tested. Why? Because 1
in 2 people are affected and if one knows they
have a MTHFR polymorphism, they know they
have to be very proactive in taking care of
themselves.
http://thyroid.about.com/od/MTHFR-Gene-Mutations-and-Polymorphisms/fl/The-
Link-Between-MTHFR-Gene-Mutations-and-Disease-Including-Thyroid-
Health.htm
- 16. © 2014 MapR Technologies 16
Clinical Sequencing Business Process Workflow
PhysicianPatient
Clinic
blood/saliva
Clinical Lab
Analytics
extract
- 17. © 2014 MapR Technologies 17
Clinical Genomics, Information Systems Perspective
Compressed Structured
Base4 Data
Uncompressed Unstructured
Base2 Data
extract
Base4=>Base2
Converter
[[ DE-STRUCTURES ]]
“BI” Reporting and
Visualization tools
PhysicianPatient
AnalystStakeholder
- 18. © 2014 MapR Technologies 18
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
- 19. © 2014 MapR Technologies 19
Sequencing “Even Moore’s Law”
Stein. 2010. The case for cloud computing in genome informatics
- 20. © 2014 MapR Technologies 20
The Evolving Genomics Workload
Sboner, et al, 2011. The real cost of sequencing: higher than you think!
<= 1º analytics
“current high ROI use cases”
<= 2º analytics
“next-gen high ROI use cases”
- 21. © 2014 MapR Technologies 21
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
1º analytics
2º analytics
Not much in this presentation,
see also:
http://slidesha.re/1sC2BOX
- 22. © 2014 MapR Technologies 22
Sequence Analysis, Quick Partial Details
[…] G A C T A G A fragment1
A C A G T T T A C A fragment2
A G A T A - - A G A fragment3
A A C A G C T T A C A […] fragment4
C T A T A G A T A A fragment5
[…] G A T T A C A G A T T A C A G A T T A C A […] referenceDNA
[…] G A C T A C A G A T A A C A G A T T A C A […] patient__DNA
- 23. © 2014 MapR Technologies 23
What is the (Probable) Color of Each Column?
- 24. © 2014 MapR Technologies 24
Which Columns are (probably) Not White?
Strategy 1: examine foreach column, foreach row O(rows*cols)
+ O(1 col) memory
- 25. © 2014 MapR Technologies 25
Which Columns are (probably) Not White?
Strategy 2: examine foreach row. keep running tallies O(rows)
+ O(rows*cols) memory
- 26. © 2014 MapR Technologies 26
Which Columns are (probably) Not White?
Strategy 3: rotate matrix. examine foreach column O(rows log rows)
+ O(cols)
+ O(1 col) memory
- 27. © 2014 MapR Technologies 27
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows)
+ O(cols)
+ O(1 col) memory
- 28. © 2014 MapR Technologies 28
Comparison of Strategies
Strategy 1
• Low mem req
• Random access
pattern, many ops
Strategy 3
• Low mem req
• Sequential access
pattern
• Requires Sort
Strategy 2
• High mem req
• Sequential access
pattern
O(rows*cols)
+ O(1 col) memory
O(rows)
+ O(rows*cols) memory
O(rows log rows) ÷ shards
+ O(cols) ÷ shards
+ O(1 col) memory
As # of rows & columns increases
Strategy 3 becomes more attractive
- 29. © 2014 MapR Technologies 29
1º Sequence Analysis (ETL), MapReduce style
.fastq .bam .vcf
short read
alignment
genotype
calling
MAP
MAP
REDUCE, rotate matrix 90º
(O(mn)) / 1 (O(mn) + O(n log n)) / s
- 30. © 2014 MapR Technologies 30
Crossbow (MapReduce Strategy, implemented)
Langmead, et al. 2009. Searching for SNPs with cloud computing
- 31. © 2014 MapR Technologies 31
Ion Flux (MapReduce Strategy, implemented for Enterprise)
• Sequencing workflow in MapReduce (Hadoop, Cascading,
Amazon Elastic M/R)
• Integrated with Ion Torrent as a plugin to stream sequence to the
cloud
• Emphasis on scalability and latency
– assay->clinical report turnaround in < 24h
• Compare to fast-follower stack ILMN MiSeq+BaseSpace
http://aws.amazon.com/solutions/case-studies/ion-flux/
http://ionflux.com
- 32. © 2014 MapR Technologies 32© 2014 MapR Technologies
Non-Genomics Digression, 1 of 2
Data Warehouse ETL Offload
- 33. © 2014 MapR Technologies 33
The Problem
• Major telecom vendor
• Key step in billing pipeline handled by data warehouse (EDW)
• EDW at maximum capacity
• Multiple rounds of software optimization already done
• Revenue limiting (= career limiting) bottleneck
- 34. © 2014 MapR Technologies 34
Three Options
1. No more revenue growth
2. Increase EDW size
– Expensive
– Known to not scale well
3. Find a more scalable solution
- 35. © 2014 MapR Technologies 35
ETL
CDR
billing
records
Billing
reports
Data Warehouse
Customer
bills
Original Flow – ELTL
- 36. © 2014 MapR Technologies 36
Simplified Analysis – EDW Strategy
• 70% of EDW consumed by ELTL processing
– Caused by 10% of code (CDR transformations)
• 200% EDW capacity adds capital cost is ~X
• Indirect costs non-trivial (floor space, power)
• 150% performance increase (poor division of labor)
- 37. © 2014 MapR Technologies 37
ETL
CDR
billing
records
Billing
reports
Data Warehouse
Customer
billing
With ETL Offload
- 38. © 2014 MapR Technologies 38
Simplified Analysis – MapR Strategy
• Hardware + MapR cost ~1/20X
• ETL replacement development costs ~1/20X
• 300% performance increase
- 39. © 2014 MapR Technologies 39
Price Performance
• EDW strategy
– 1.5x performance
– Cost is X
• MapR Strategy
– 3x performance
– Cost is 1/10X
• 20x cost/performance advantage for MapR strategy
- 40. © 2014 MapR Technologies 40
Platform Advantages
• Standard Hadoop eco-system components allow efficient
CDR parsing and ETL
• MapR platform provides high availability, disaster
recovery
• MapR NFS interface allows direct load of transformed
data
- 41. © 2014 MapR Technologies 41© 2014 MapR Technologies
Non-Genomics Digression, 2 of 2
- 42. © 2014 MapR Technologies 42© 2014 MapR Technologies
<Recommendation System. Redacted>
- 43. © 2014 MapR Technologies 50© 2014 MapR Technologies
Hybrid Use-Cases
- 44. © 2014 MapR Technologies 51
MapR Data Platform Advantage, Telecommunications
CO-OCCURRENCE
(MAHOUT)
SOLR INDEXING
ETL
BILLING
REPORTS
WEB TIERDATA
WAREHOUSE
CDR
BILLING
RECORDS
CUSTOMER
BILLING
USER HISTORY QUERY /
CONTEXT RECOMENDATIONS
COMPLETE HISTORY
(all users)
ITEM META-DATA INDEX SHARDS
- 45. © 2014 MapR Technologies 52
MapR Data Platform Advantage, Clinical Genomics
Epidemiological,
Actuarial Analyses
Denormalization for
Search, Viz, Research
ETL
Clinical
Reporting
WEB TIERClinical
Reporting
Systems
CLINICAL
TREATMENT
OF PATIENTS
RESEARCHERS
National Pop.
Database
INDEX SHARDSPrognostic
Capability
- 46. © 2014 MapR Technologies 53© 2014 MapR Technologies
Bonus Round: 2º Analytics
- 47. © 2014 MapR Technologies 54
Clinical Genomics, Information Systems Perspective
PhysicianPatient
AnalystStakeholder
ETL
Reporting and Viz
Data Store
Analytics
2º analytics
Not much in this presentation,
see also:
http://slidesha.re/1sC2BOX
- 48. © 2014 MapR Technologies 55
Matrices A (U*Q) and B (U*V)
Query Term = Clicked Term
Users
Query Terms
Users
Clicked Videos
- 49. © 2014 MapR Technologies 56
Relate Q to V
Users
Query Terms
- 50. © 2014 MapR Technologies 57
Relate Q to V
Users
Query Terms
- 51. © 2014 MapR Technologies 58
Relate Q to V: it’s a Cross-Recommender
QueryTerms
Videos
- 53. © 2014 MapR Technologies 60
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
NPR. 2011. The Search For Analysts To Make Sense Of
'Big Data’
http://www.npr.org/2011/11/30/142893065
- 54. © 2014 MapR Technologies 61
If they were unlabeled, would you know which is which?
Friend. 2010. The Need for Precompetitive
Integrative Bionetwork Disease Model Building
• Identify network structures
• Label them
• Observe
stimulus=>response
space mapping
• Purposefully target
• PROFIT ! ! ! !