SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Scaling up genomic 
analysis with ADAM 
Frank Austin Nothaft, UC Berkeley AMPLab 
fnothaft@berkeley.edu, @fnothaft 
12/8/2014
Data Intensive Genomics 
• Scale of genomic analyses is growing rapidly: 
• New experiments sequence 10-100k samples 
• Use high coverage, WGS for variant analyses 
• 100k samples @ 60x WGS will generate ~20PB of 
read data and ~300TB of genotype data
Petabytes Cause Problems 
1. Analysis systems must be horizontally scalable 
without substantial programmer overhead 
2. Data storage format must compress well while 
providing good read performance 
3. Need to efficiently slice and dice dataset: not all 
users want the same views or subsets of data
Analysis Characteristics 
• Current genomics pipelines are limited by I/O 
• Most genomics algorithms can be formulated as a 
data or graph parallel computation 
• Analysis algorithms use iteration and pipelining 
• Reference genome/experiment metadata access 
must be cheap! —> impacts analysis performance
What is ADAM? 
• An open source, high performance, distributed 
platform for genomic analysis 
• ADAM defines a: 
1. Data schema and layout on disk* 
2. A Scala API 
3. A command line interface 
* Via Avro and Parquet
Principles for Scalable 
Design in ADAM 
• Reuse commodity horizontally scalable systems 
• Parallel FS and data representation (HDFS + 
Parquet) combined with in-memory computing 
eliminates disk bandwidth bottleneck 
• Spark provides horizontally scalable iterative/ 
pipelined Map-Reduce 
• Minimize data movement: send code to data, 
efficiently encode metadata
• An in-memory data parallel computing framework 
• Optimized for iterative jobs —> unlike Hadoop 
• Data maintained in memory unless inter-node 
movement needed (e.g., on repartitioning) 
• Presents a functional programing API, along with support 
for iterative programming via REPL 
• Set Daytona Greysort record (100TB in 23 min, 206 nodes)
Data Format 
• Avro schema encoded by Parquet 
• Schema can be updated without 
breaking backwards compatibility 
• Normalize metadata fields into 
schema for O(1) metadata access 
• Genotype schema is strictly 
biallelic, a “cell in the matrix” 
record AlignmentRecord { 
union { null, Contig } contig = null; 
union { null, long } start = null; 
union { null, long } end = null; 
union { null, int } mapq = null; 
union { null, string } readName = null; 
union { null, string } sequence = null; 
union { null, string } mateReference = null; 
union { null, long } mateAlignmentStart = null; 
union { null, string } cigar = null; 
union { null, string } qual = null; 
union { null, string } recordGroupName = null; 
union { int, null } basesTrimmedFromStart = 0; 
union { int, null } basesTrimmedFromEnd = 0; 
union { boolean, null } readPaired = false; 
union { boolean, null } properPair = false; 
union { boolean, null } readMapped = false; 
union { boolean, null } mateMapped = false; 
union { boolean, null } firstOfPair = false; 
union { boolean, null } secondOfPair = false; 
union { boolean, null } failedVendorQualityChecks = false; 
union { boolean, null } duplicateRead = false; 
union { boolean, null } readNegativeStrand = false; 
union { boolean, null } mateNegativeStrand = false; 
union { boolean, null } primaryAlignment = false; 
union { boolean, null } secondaryAlignment = false; 
union { boolean, null } supplementaryAlignment = false; 
union { null, string } mismatchingPositions = null; 
union { null, string } origQual = null; 
union { null, string } attributes = null; 
union { null, string } recordGroupSequencingCenter = null; 
union { null, string } recordGroupDescription = null; 
union { null, long } recordGroupRunDateEpoch = null; 
union { null, string } recordGroupFlowOrder = null; 
union { null, string } recordGroupKeySequence = null; 
union { null, string } recordGroupLibrary = null; 
union { null, int } recordGroupPredictedMedianInsertSize = null; 
union { null, string } recordGroupPlatform = null; 
union { null, string } recordGroupPlatformUnit = null; 
union { null, string } recordGroupSample = null; 
union { null, Contig} mateContig = null; 
}
Parquet 
• ASF Incubator project, based on 
Google Dremel 
• http://www.parquet.io 
• High performance columnar 
store with support for projections 
and push-down predicates 
• 3 layers of parallelism: 
• File/row group 
• Column chunk 
• Page 
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Big Data in Parquet 
• ADAM in Parquet provides a 25% improvement over 
compressed BAM 
• Enables efficient slice-and-dice: 
• Can select column projections —> reduce I/O 
• Support pushdown predicates for efficient filtering 
• Have Parquet/S3 integration to push computing 
down into remote block stores for cold data
Scalability 
• Evaluated on 1000G WGS 
NA12878, 234GB dataset 
• Used 32-128 m2.4xlarge, 1 
cr1.8xlarge from AWS 
• Achieve linear scalability out 
to 128 nodes for most tasks 
• 2-4x improvement vs {GATK, 
samtools/Picard} on single 
machine for most tasks
Long-read assembly 
with PacMin
The State of Analysis 
• Conventional short-read alignment based pipelines 
are really good at calling SNPs 
• Need improvement at calling INDELs and SVs 
• And are slow: 2 weeks to sequence, 1 week to 
analyze. Not fast enough. 
• If we move away from short reads, do we have other 
options?
Opportunities 
• New read technologies are available 
• Provide much longer reads (250bp vs. >10kbp) 
• Different error model… (15% INDEL errors, vs. 2% 
SNP errors) 
• Generally, lower sequence specific bias 
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available… 
• We can use conventional methods: 
Carneiro et al, Genome Biology 2012
But! 
• Why not make raw assemblies out of the reads? 
Find overlapping reads Find consensus sequence 
for all pairs of reads (i,j): 
i j 
=? 
…ACACTGCGACTCATCGACTC… 
• Problems: 
1. Overlapping is O(n 
2 
) and single evaluation is expensive anyways 
2. Typical algorithms find a single consensus sequence; what if we’ve got 
polymorphisms?
Fast Overlapping with 
MinHashing 
• Wonderful realization by Berlin et al1: overlapping is 
similar to document similarity problem 
• Use MinHashing to approximate similarity: 
1: Berlin et al, bioRxiv 2014 
Per document/read, 
compute signature:! 
! 
1. Cut into shingles 
2. Apply random 
hashes to shingles 
3. Take min over all 
random hashes 
Hash into buckets:! 
! 
Signatures of length l 
can be hashed into b 
buckets, so we expect 
to compare all elements 
with similarity 
≥ (1/b)^(b/l) 
Compare:! 
! 
For two documents with 
signatures of length l, 
Jaccard similarity is 
estimated by 
(# equal hashes) / l 
! 
• Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies 
• Finding pairwise overlaps gives us a directed 
graph between reads (lots of edges!)
Transitive Reduction 
• We can find a consensus between clique members 
• Or, we can reduce down: 
• Via two iterations of Pregel!
Monoallelic Sequence Model 
• Traditional probabilistic models assume independence 
at each site and a good reference model 
• This discards information about local sequence context 
• Can consider a different formulation of the problem: 
• Per reduced segment, build a graph of the alleles 
• Find the allelic copy numbers that maximize 
segment probability
Allele Graphs 
ACACTCG 
C 
A 
TCTCA 
G 
C 
• Edges of graph define conditional probabilities 
! 
! 
TCCACACT 
• Can efficiently marginalize probabilities over graph using Eliminate 
algorithm1, exactly solve for argmax 
1. Jordan, “Probabilistic Graphical Models.” 
Notes:! 
X = copy number of this allele 
Y = copy number of preceding allele 
k = number of reads observed 
j = number of reads supporting Y —> X transition 
Pi = probability that read i supports Y —> X transition
Output 
• Current assemblers emit FASTA contigs 
• We’ll emit “multigs”, which we’ll map back to reference 
graph 
• Multig = multi-allelic (polymorphic) contig 
• Will include a confidence score per base 
• Working with UCSC, who’ve done some really neat work1 
deriving formalisms & building software for mapping 
between sequence graphs, and GA4GH ref. variation team 
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
Acknowledgements 
• UC Berkeley: Matt Massie, André Schumacher, 
Jey Kottalam, Christos Kozanitis, Adam Bloniarz! 
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael 
Linderman, Jeff Hammerbacher! 
• GenomeBridge: Timothy Danford, Carl Yeksigian! 
• Cloudera: Uri Laserson! 
• Microsoft Research: Jeremy Elson, Ravi Pandya! 
• And many other open source contributors: 26 
contributors to ADAM/BDG from >11 institutions

Weitere ähnliche Inhalte

Was ist angesagt?

Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
Matt Massie
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
SAIL_QU
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
Kim Herzig
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 

Was ist angesagt? (20)

Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 

Andere mochten auch

Developing openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesDeveloping openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalities
Pablo Pazos
 

Andere mochten auch (6)

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1
 
Developing openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesDeveloping openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalities
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
 

Ähnlich wie Scaling up genomic analysis with ADAM

Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 

Ähnlich wie Scaling up genomic analysis with ADAM (20)

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulationsSFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
User biglm
User biglmUser biglm
User biglm
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 

Mehr von fnothaft (6)

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 

Kürzlich hochgeladen

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Kürzlich hochgeladen (20)

Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 

Scaling up genomic analysis with ADAM

  • 1. Scaling up genomic analysis with ADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 12/8/2014
  • 2. Data Intensive Genomics • Scale of genomic analyses is growing rapidly: • New experiments sequence 10-100k samples • Use high coverage, WGS for variant analyses • 100k samples @ 60x WGS will generate ~20PB of read data and ~300TB of genotype data
  • 3. Petabytes Cause Problems 1. Analysis systems must be horizontally scalable without substantial programmer overhead 2. Data storage format must compress well while providing good read performance 3. Need to efficiently slice and dice dataset: not all users want the same views or subsets of data
  • 4. Analysis Characteristics • Current genomics pipelines are limited by I/O • Most genomics algorithms can be formulated as a data or graph parallel computation • Analysis algorithms use iteration and pipelining • Reference genome/experiment metadata access must be cheap! —> impacts analysis performance
  • 5. What is ADAM? • An open source, high performance, distributed platform for genomic analysis • ADAM defines a: 1. Data schema and layout on disk* 2. A Scala API 3. A command line interface * Via Avro and Parquet
  • 6. Principles for Scalable Design in ADAM • Reuse commodity horizontally scalable systems • Parallel FS and data representation (HDFS + Parquet) combined with in-memory computing eliminates disk bandwidth bottleneck • Spark provides horizontally scalable iterative/ pipelined Map-Reduce • Minimize data movement: send code to data, efficiently encode metadata
  • 7. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed (e.g., on repartitioning) • Presents a functional programing API, along with support for iterative programming via REPL • Set Daytona Greysort record (100TB in 23 min, 206 nodes)
  • 8. Data Format • Avro schema encoded by Parquet • Schema can be updated without breaking backwards compatibility • Normalize metadata fields into schema for O(1) metadata access • Genotype schema is strictly biallelic, a “cell in the matrix” record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig} mateContig = null; }
  • 9. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 10. Big Data in Parquet • ADAM in Parquet provides a 25% improvement over compressed BAM • Enables efficient slice-and-dice: • Can select column projections —> reduce I/O • Support pushdown predicates for efficient filtering • Have Parquet/S3 integration to push computing down into remote block stores for cold data
  • 11. Scalability • Evaluated on 1000G WGS NA12878, 234GB dataset • Used 32-128 m2.4xlarge, 1 cr1.8xlarge from AWS • Achieve linear scalability out to 128 nodes for most tasks • 2-4x improvement vs {GATK, samtools/Picard} on single machine for most tasks
  • 13. The State of Analysis • Conventional short-read alignment based pipelines are really good at calling SNPs • Need improvement at calling INDELs and SVs • And are slow: 2 weeks to sequence, 1 week to analyze. Not fast enough. • If we move away from short reads, do we have other options?
  • 14. Opportunities • New read technologies are available • Provide much longer reads (250bp vs. >10kbp) • Different error model… (15% INDEL errors, vs. 2% SNP errors) • Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  • 15. If long reads are available… • We can use conventional methods: Carneiro et al, Genome Biology 2012
  • 16. But! • Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? …ACACTGCGACTCATCGACTC… • Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
  • 17. Fast Overlapping with MinHashing • Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem • Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity ≥ (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! • Easy to implement in Spark: map, groupBy, map, filter
  • 18. Overlaps to Assemblies • Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  • 19. Transitive Reduction • We can find a consensus between clique members • Or, we can reduce down: • Via two iterations of Pregel!
  • 20. Monoallelic Sequence Model • Traditional probabilistic models assume independence at each site and a good reference model • This discards information about local sequence context • Can consider a different formulation of the problem: • Per reduced segment, build a graph of the alleles • Find the allelic copy numbers that maximize segment probability
  • 21. Allele Graphs ACACTCG C A TCTCA G C • Edges of graph define conditional probabilities ! ! TCCACACT • Can efficiently marginalize probabilities over graph using Eliminate algorithm1, exactly solve for argmax 1. Jordan, “Probabilistic Graphical Models.” Notes:! X = copy number of this allele Y = copy number of preceding allele k = number of reads observed j = number of reads supporting Y —> X transition Pi = probability that read i supports Y —> X transition
  • 22. Output • Current assemblers emit FASTA contigs • We’ll emit “multigs”, which we’ll map back to reference graph • Multig = multi-allelic (polymorphic) contig • Will include a confidence score per base • Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
  • 23. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis, Adam Bloniarz! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Timothy Danford, Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Jeremy Elson, Ravi Pandya! • And many other open source contributors: 26 contributors to ADAM/BDG from >11 institutions