SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Next-Generation
Genomics
Using Spark and ADAM
Timothy Danford
Tamr Inc.
AMPLab
Next
Generation?
We come in
peace.
What even is
genomics?
Organism Cell Genome
One chromosome
One chromosome
per person
One chromosome
per person
defines a
reference
chromosome
One chromosome
per person
defines a
reference
chromosome
and
location
“… decoding the Book of Life”
Ortellius, 1570
Google, 2005
Lambert et al. “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer's disease” (2013)
Down the
Long Slide,
To
Happiness
Endlessly
We often treat
‘bioinformatics’ as a
black box
Vials into Files
What’s
In
The Box?
My God, It’s Full of
Pipelines
My God, It’s Full of
Pipelines
A Tale of Three File Formats
BAM Files: Do You Read
Me?
Compressed text files & custom index formats
User-defined attributes
Multi-record structure
“Not wishing to be outdone
by Amazon, Sanger
Institute develops drone
deliver system for BAM
files.”
Open the Pod Bay Doors,
Pal
I Had a Dream
It Would End This Way
What to do, what to do?
Bioinformaticians
❤️
Probabilistic
Models
Our Data Scattered Back and
Forth
Across Space by this Gadget
Why Are We Still Defining
File Formats By Hand?
• Instead of defining custom file
formats for each data type and
access pattern…
• Parquet creates a
compressed format for each
Avro-defined data model.
• Improvement over existing
formats1
• 20-22% for BAM
• ~95% for VCF
1
compression % quoted from 1K Genomes
Spark + Genomics =
ADAM
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
Core Genomics Primitives:
The Needs of the Many
The Terrible Trouble
with Existing Pipelines
Cibulskis et al. “Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples” (2013)
“I think you know what the
problem is, just as well as I
do.”
A single piece of a
filtering stage for a
somatic variant caller
“11-base-pair window
centered on a candidate
mutation” actually turns
out to be optimized for
a particular file format
and sort order
“Myths of Bioinformatics
Software”
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your software.
4. Making software free for commercial use shows you are not against companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
W
e
Can
M
ake
O
ur
O
w
n
M
yths
Thanks to...
And thank you! Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAMfnothaft
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMfnothaft
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsMaribel Acosta Deibe
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia StudyMaribel Acosta Deibe
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planC. Tobin Magle
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...GigaScience, BGI Hong Kong
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
Getting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data AnalysisGetting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data AnalysisAndreas Wilm
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomicsGuy Coates
 
Introduction to r
Introduction to rIntroduction to r
Introduction to rgslicraf
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Spark Summit
 

Was ist angesagt? (20)

Design for Scalability in ADAM
Design for Scalability in ADAMDesign for Scalability in ADAM
Design for Scalability in ADAM
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
A Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia MappingsA Closer Look at the Changing Dynamics of DBpedia Mappings
A Closer Look at the Changing Dynamics of DBpedia Mappings
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Datat and donuts: how to write a data management plan
Datat and donuts: how to write a data management planDatat and donuts: how to write a data management plan
Datat and donuts: how to write a data management plan
 
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
Lisa Johnson at #ICG13: Re-assembly, quality evaluation, and annotation of 67...
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Lisp Machine Prunciples
Lisp Machine PrunciplesLisp Machine Prunciples
Lisp Machine Prunciples
 
Ngsp
NgspNgsp
Ngsp
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Getting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data AnalysisGetting Started with RNA-Seq Data Analysis
Getting Started with RNA-Seq Data Analysis
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Future Architectures for genomics
Future Architectures for genomicsFuture Architectures for genomics
Future Architectures for genomics
 
Introduction to r
Introduction to rIntroduction to r
Introduction to r
 
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
Algorithms and Tools for Genomic Analysis on Spark: Spark Summit East talk by...
 

Andere mochten auch

Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't SpecialAllen Day, PhD
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerJacob Feala
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMAllen Day, PhD
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modelingindico data
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Ha Phuong
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 

Andere mochten auch (9)

Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
Building cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and DockerBuilding cloud-enabled genomics workflows with Luigi and Docker
Building cloud-enabled genomics workflows with Luigi and Docker
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modeling
 
Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)Deep Learning And Business Models (VNITC 2015-09-13)
Deep Learning And Business Models (VNITC 2015-09-13)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 

Ähnlich wie Strata-Hadoop 2015 Presentation

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseRai University
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960mare34
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsDuncan Hull
 
amrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdfamrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdfamrapalibuildersreviews
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Vince Smith
 
Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyMelanie Courtot
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final ReportShruthi Choudary
 
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Numenta
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingTheodore J. LaGrow
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestKeith Bradnam
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 

Ähnlich wie Strata-Hadoop 2015 Presentation (20)

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960The seven-deadly-sins-of-bioinformatics3960
The seven-deadly-sins-of-bioinformatics3960
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
amrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdfamrapali builders -- hacking the genome.pdf
amrapali builders -- hacking the genome.pdf
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...Next generation sequencing requires next generation publishing: the Biodivers...
Next generation sequencing requires next generation publishing: the Biodivers...
 
Ontologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontologyOntologies for life sciences: examples from the gene ontology
Ontologies for life sciences: examples from the gene ontology
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Bioinformatics Final Report
Bioinformatics Final ReportBioinformatics Final Report
Bioinformatics Final Report
 
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
Brains, Data, and Machine Intelligence (2014 04 14 London Meetup)
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 

Kürzlich hochgeladen

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 

Kürzlich hochgeladen (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 

Strata-Hadoop 2015 Presentation

Hinweis der Redaktion

  1. I’m nervous, so I’ll be speaking fast. Before we dive in, let me ask a couple of questions: biologists? Spark experts? This entire presentation is a lie. There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this I am going to try and split the difference, but I won’t be able to satisfy everyone. In all the places where I have to skip over the truth, maybe there will be at least a breadcrumb back to the truth This isn’t a technical talk. Let’s talk about the title –
  2. Next generations? I didn’t realize that there was a *first* generation! Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems’ today, it’s OK [animation] I come in peace! Bioinformatics software development has been *remarkalbly* effective, for decades. If there are problems to be solved, these are the result of new technologies, new conceptions of scale. So that’s “next generation,” but what about…
  3. Genomics? What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
  4. Fundamentally, we’re interested in studying individuals (and populations of individuals) Each individual is *itself* a population: of cells But each of those cells has, ideally, an identical genome. The genome is a collection of 23 molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  5. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  6. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  7. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  8. Now, not only do all the cells in your body have identical genomes… [ANIMATE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals… and that means [ANIMATE] That we can define a ‘base’ or a ‘reference’ chromosome [ANIMATE] And a concept of ‘location’ across chromosomes. This is maybe the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  9. Here is Bill Clinton (and Craig Venter and Francis Crick), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project.
  10. 1570: Theatrum Orbis Terrarum “Theater of the world” First modern atlas. A direct byproduct of the first 100 years of PRINTING, and a tool for describing and exploring the world around us. It’s direct descendants are still with us, today!
  11. Google maps! But what does the genomic version of this look like?
  12. Mapmakers today focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes.
  13. This is a manhattan plot, of alzheimer’s related genes and sequence markers. Now let’s shift gears, and talk about how this was performed – through sequencers. Sequencers are microscopes that read the genome.
  14. If there’s one graph you should remember, in order to understand the last (and the next) ten years of bioinformatics and genomics, it’s this one The Human Genome Project was thousands of researchers, billions of dollars, spent over a decade, all to sequence on-the-order-of half a dozen individuals. Today, we’re close to the “thousand dollar genome” – and already we’re seeing prototype sequencers with the form factor of a USB stick. So sequencers will drive everything before it – but sequencers are only ever half the story.
  15. Bioinformatics is a computational reversal of the sequencing process. [ANIMATE] But to most
  16. So… what’s in the box?
  17. It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  18. It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  19. It’s a pipeline! (Makes sense, since I’m also name-checking Spark, right?) It’s never *one* pipeline, we do this once for every person Let me talk a little bit about the structure of one of these pipelines Each step is typically written as a standalone program – passing files from stage to stage – often using something like unix pipes These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem But of course, it’s never one pipeline [ANIMATE] It’s a pipeline per person [ANIMATE] But since each pipeline runs (essentially) serially, scaling up is easy: scale out! [ANIMATE]
  20. That was the data side, but let’s open up the computation as well. Take one of those boxes, that I drew earlier. Here’s alignment, but it could be… [ANIMATE] any bioinformatics tool. I assert that there are *two* things going on inside any bioinformatics tool – [ANIMATE] There is the method, and there is the implementation of that method. I think this is an important distinction to make… But even that is a lie, because there is a third thing… [ANIMATE] “Platform.” That’s why I’ve included this code snippet up above. So what’s the problem? Faster sequencers means we sequence more people, but we have tools that work and a natural path to parallelism! Why does there need to be a “next generation?” The answer, of course, is that when you have all that data, you want to *USE* all that data.
  21. When you want to *use* all the data, now your entire system will start to show cracks. This is an example, variant calling. But [ANIMATE] God help you if you want to combine statistical information at an earlier phase of the process. But this is by no means a unique problem. And what is one solution? You might have guessed it from the title to my talk….
  22. There’s more parallelism that we can extract from our pipelines.
  23. Spark. The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also…