SlideShare ist ein Scribd-Unternehmen logo
1 von 49
1© Cloudera, Inc. All rights reserved.
Scaling Up Genomics with Hadoop and Spark
Uri Laserson | @laserson | 14 November 2015
Petascale Genomics
2© Cloudera, Inc. All rights reserved.
We come in peace.
Pioneer plaque
3© Cloudera, Inc. All rights reserved.
What is genomics?
4© Cloudera, Inc. All rights reserved.
Organism
5© Cloudera, Inc. All rights reserved.
Organism Cell
6© Cloudera, Inc. All rights reserved.
Organism Cell Genome
7© Cloudera, Inc. All rights reserved.
8© Cloudera, Inc. All rights reserved.
9© Cloudera, Inc. All rights reserved.
Reference chromosome
10© Cloudera, Inc. All rights reserved.
Reference chromosome
Location
11© Cloudera, Inc. All rights reserved.
“
 decoding the Book of Life”
12© Cloudera, Inc. All rights reserved.
...atatggaaccaaaaaagagcccgcatcgccaaggcaatcctaagccaaaagaacaaagctggaggcatcacactacctgacttcaaactatactaca
agcctacagtaaccaaaacagcatggtactggtaccaaaacagagatatagatcaatggaacagaacagagccctcagaaataacgccgcatatctacaa
ctatctgatctttgacgaacctgagaaaaacaagcaatggggaaaggattccctatttaataaatggtgctgggaaaactggctagccatatgtagaaag
ctgaaactggatcccttccttacaccttatacaaaaatcaattcaagatggattaaagacttaaacgttagacctaaaaccataaaaaccctagaagaaa
acctaggcagtaccattcaggacataggcatgggcaaggacttcatgtccaaaacaccaaaagcaatggcaacaaaagacaaaattgacaaatgggatct
aattaaactaaagagcttctgcacagcaaaagaaactaccatcagagtgaacaggaaacctacaaaatgggagaaaattttcgcaacctactcatctgac
aaagggctaatatccagaatctacaatgaactcaaacaaatttacaagaaaaaaacaaacaaccccatcaaaaagtgggcaaaggacatgaacagacact
tctcaaatgaagacatttatgcagccaaaaaacacatgaaaaaatgctcatcatcactggccatcagagaaatgcaaatcaaaaccacaatgagatacca
tctcacaccagttagaatggcaatcattaaaaagtcaggaaacaacaggtgctggagaggatgtggagaaataggaacacttttacactgttggtgggac
tgtaaactagttcaaccattgtggaagtcagtgtggtgattcctcagggatctagaactagaaataccatttgacccagccatcccattactgggtatat
acccaaaggactataaatcatgctgctataaagacacatgcacacgtatgtttattgcggcattattcacaatagcaaagacttggaaccaacccaaatg
tccaacaatgataaactggattaagaaaatgtggcacatatacaccatggaatactctgcagccataaaaaaggatgagttcatgtcctttgtagggaca
tggatgaaattggaaatcatcattctcagtaaactatcgcaagaataaaaaaccaaacaccgcatattctcactcataggtgggaattgaacaatgagat
cacatggacacaggaagaggaatatcacactctggggactgtggtggggtggggggaggggggagggatagcattgggagatatacctaatgctagatga
cgagttagtgggtgcagcgcaccagcatggcacatgtatacatatgtaactaacctgcacattgtgcacatgtaccctaaaacttaaagtataataaaaa
aataaaaaaaataaagtgtgtgtgtgtatgactttaattaacttgatcacccacacacacacaaacactgaccaaaattaatatcaagtcaggtctgtct
gaatgtaaagccaacagcaaacatccctctctccaaatggaaaagaaacagggggttatgggcagctacactgctaaatgttaaaactttatttttaaat
gtggccataaaaatcactaaataaaattgataatatatgtttttgatgaataaattttatatatgtctacactggaaactatatagcaataaaaactaac
catgtacaactaaactcataaatttcataaacataataagtaaaagaagccagacaaaaagtagtgtatactgttaaattccatttatataaaagttcaa
aaaagccaaaaagaaactatgctgttaaaagtaaggattatagttactattcagggaagagagtagtggctggaaagaaacataaagggggtctctgaag
tggaataatgttctgttttttgatctgggtattagggtgtttaatttcggaaaattattttatctttatacttattgtattattgattttttgcttaaca
aattactcaaaacttagaggtttaaaaaaaattaattattgtattaatttctctgggccaggaattggagagagcttagctgggtagttctggttcaaaa
tttctcatgagattaccgtcaagctgttggagggggctgcatcatctgaaggcttgaccgaggctagaggatctactttcaagatggcccactcacatgg
ctgttggcaagaagtttcagtttctcactagcttctagcaggaggccataatttctcaccacatagatctctctatagggctactcgagtgtcctcacag
caaggtagctggctttcttcagagccaagtgactcaaaggcaaagaggaagtcactatgccatttatgacctagttttggaactcacactttgttccgaa
ttgaccttccatcactttctagtcattaggatttaagtcactaactctgatccatagtcaaggggagtaaaatttggctttattgttggaggatggagta
gcaaagaatttgttgacacattttaaaactaccatacttaaacagttcatttttctgaatatgcttcaattagaagttaaaatgatgcaattttaaaaca
ttgtttcaaatgaacactgttagggagagaagtgcttcttctccatatctaatgtttcttccatatttagggagttccattagtttaacactttaag...
13© Cloudera, Inc. All rights reserved.
14© Cloudera, Inc. All rights reserved.
15© Cloudera, Inc. All rights reserved.
16© Cloudera, Inc. All rights reserved.
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
19© Cloudera, Inc. All rights reserved.
>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
20© Cloudera, Inc. All rights reserved.
>read1
TTGGACATTTCGGGGTCTCAGATT
>read2
AATGTTGTTAGAGATCCGGGATTT
>read3
GGATTCCCCGCCGTTTGAGAGCCT
>read4
AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
21© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Pipelines!
22© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)
Semi-structured
Poorly specified
23© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)
Semi-structured
Poorly specified
Global sort order
24© Cloudera, Inc. All rights reserved.
C
HPC (scheduler)
POSIX filesystem
Java
HPC (Queue)
POSIX filesystem
C++
Single-node
SQLite
It’s file formats all the way down!
25© Cloudera, Inc. All rights reserved.
Dedup
26© Cloudera, Inc. All rights reserved.
/**
* Main work method. Reads the BAM file once and collects sorted information about
* the 5' ends of both ends of each read (or just one end in the case of pairs).
* Then makes a pass through those determining duplicates before re-reading the
* input file and writing it out with duplication flags set correctly.
*/
protected int doWork() {
// build some data structures
buildSortedReadEndLists(useBarcodes);
generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =
new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;
while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();
if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {
rec.setDuplicateReadFlag(true);
// Now try and figure out the next duplicate index
if (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();
} else {
// Only happens once we've marked all the duplicates
nextDuplicateIndex = -1;
}
} else {
Method
Code
27© Cloudera, Inc. All rights reserved.
/**
* Main work method. Reads the BAM file once and collects sorted information about
* the 5' ends of both ends of each read (or just one end in the case of pairs).
* Then makes a pass through those determining duplicates before re-reading the
* input file and writing it out with duplication flags set correctly.
*/
protected int doWork() {
// build some data structures
buildSortedReadEndLists(useBarcodes);
generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =
new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;
while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();
if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {
rec.setDuplicateReadFlag(true);
// Now try and figure out the next duplicate index
if (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();
} else {
// Only happens once we've marked all the duplicates
nextDuplicateIndex = -1;
}
} else {
Method
Code
28© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",
doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +
"per-process maximum number of file that may be open. This " +
"number can be found by executing the 'ulimit -n' command on " +
"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
29© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",
doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +
"per-process maximum number of file that may be open. This " +
"number can be found by executing the 'ulimit -n' command on " +
"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
Dedup
Method/Algo
Code
Platform
30© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
31© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
32© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 1
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 3
33© Cloudera, Inc. All rights reserved.
Manually running pipelines on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
34© Cloudera, Inc. All rights reserved.
35© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
36© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup Recalibrate QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Node 3
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
Node 4
37© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup QC/Filter
Variant
Calling
Variant
Annotation
Node 2
Node 3
Alignment Dedup QC/Filter
Alignment Dedup QC/Filter
Node 4
Recalibrate
38© Cloudera, Inc. All rights reserved.
Why Are We Still Defining File Formats By Hand?
‱ Instead of defining custom file
formats for each data type and
access pattern

‱ Parquet creates a compressed
format for each Avro-defined
data model
‱ Improvements over existing
formats
‱ ~20% for BAM
‱ ~90% for VCF
39© Cloudera, Inc. All rights reserved.
YARN-managed
Hadoop cluster
Spark
executors
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)Partial sums
𝑖=1
𝑁
𝑗=1
𝑑 𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)
Driver
Application
code
ContEst Algorithm
40© Cloudera, Inc. All rights reserved.
Hadoop provides layered abstractions for data processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduce Impala (SQL) Solr (search) Spark
ADAMquince guacamole 

bdg-formats(Avro/Parquet)
41© Cloudera, Inc. All rights reserved.
‱ Hosted at Berkeley and the
AMPLab
‱ Apache 2 License
‱ Contributors from both
research and commercial
organizations
‱ Core spatial primitives,
variant calling
‱ Avro and Parquet for data
models and file formats
Spark + Genomics = ADAM
42© Cloudera, Inc. All rights reserved.
Core Genomics Primitives: Spatial Join
43© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: interactive Spark shell (ADAM)
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
maf.saveAsNewAPIHadoopFile("path/to/output")
apply predicates
load data
join data
group-by
aggregate (MAF)
persist data
44© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: distributed SQL
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
apply predicates
“load” and
join data
group-by
aggregate (UDAF)
45© Cloudera, Inc. All rights reserved.
ADAM preliminary performance
46© Cloudera, Inc. All rights reserved.
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your
software.
4. Making software free for commercial use shows you are not against
companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachter
https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
“Myths of Bioinformatics Software”
47© Cloudera, Inc. All rights reserved.
48© Cloudera, Inc. All rights reserved.
Acknowledgements
UCBerkeley
Matt Massie
Frank Nothaft
Michael Heuer
Tamr
Timothy Danford
MSSM
Jeff Hammerbacher
Ryan Williams
Cloudera
Tom White
Sandy Ryza
49© Cloudera, Inc. All rights reserved.
Thank you
@laserson
laserson@cloudera.com

Weitere Àhnliche Inhalte

Was ist angesagt?

Implementing pseudo-keywords through Functional Programing
Implementing pseudo-keywords through Functional ProgramingImplementing pseudo-keywords through Functional Programing
Implementing pseudo-keywords through Functional ProgramingVincent Pradeilles
 
Bluetooth Beacon Tracking on a Budget
Bluetooth Beacon Tracking on a BudgetBluetooth Beacon Tracking on a Budget
Bluetooth Beacon Tracking on a BudgetBlaine Carter
 
Proxy OOP Pattern in PHP
Proxy OOP Pattern in PHPProxy OOP Pattern in PHP
Proxy OOP Pattern in PHPMarco Pivetta
 
System Calls
System CallsSystem Calls
System CallsDavid Evans
 
strace for Perl Mongers
strace for Perl Mongersstrace for Perl Mongers
strace for Perl MongersNaosuke Yokoe
 
Do we need Unsafe in Java?
Do we need Unsafe in Java?Do we need Unsafe in Java?
Do we need Unsafe in Java?Andrei Pangin
 
Make Sure Your Applications Crash
Make Sure Your  Applications CrashMake Sure Your  Applications Crash
Make Sure Your Applications CrashMoshe Zadka
 
Introduction to Kernel Programming
Introduction to Kernel ProgrammingIntroduction to Kernel Programming
Introduction to Kernel ProgrammingAhmed Mekkawy
 
ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†
ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†
ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†maruyama097
 
Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?
Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?
Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?tvaleev
 
Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5
Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5
Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5magoroku Yamamoto
 
Linux configer
Linux configerLinux configer
Linux configerMD. AL AMIN
 
The Ring programming language version 1.7 book - Part 12 of 196
The Ring programming language version 1.7 book - Part 12 of 196The Ring programming language version 1.7 book - Part 12 of 196
The Ring programming language version 1.7 book - Part 12 of 196Mahmoud Samir Fayed
 
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingDavid Evans
 
A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...Valerio Morfino
 
Pokemon battle simulator (Java Program written on Blue J Editor)
Pokemon battle simulator (Java Program written on Blue J Editor)Pokemon battle simulator (Java Program written on Blue J Editor)
Pokemon battle simulator (Java Program written on Blue J Editor)Mahir Bathija
 
Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)MongoDB
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin IGuixing Bai
 
Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—…
Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—… Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—…
Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—… Akira Takahashi
 
Snakes for Camels
Snakes for CamelsSnakes for Camels
Snakes for Camelsmiquelruizm
 

Was ist angesagt? (20)

Implementing pseudo-keywords through Functional Programing
Implementing pseudo-keywords through Functional ProgramingImplementing pseudo-keywords through Functional Programing
Implementing pseudo-keywords through Functional Programing
 
Bluetooth Beacon Tracking on a Budget
Bluetooth Beacon Tracking on a BudgetBluetooth Beacon Tracking on a Budget
Bluetooth Beacon Tracking on a Budget
 
Proxy OOP Pattern in PHP
Proxy OOP Pattern in PHPProxy OOP Pattern in PHP
Proxy OOP Pattern in PHP
 
System Calls
System CallsSystem Calls
System Calls
 
strace for Perl Mongers
strace for Perl Mongersstrace for Perl Mongers
strace for Perl Mongers
 
Do we need Unsafe in Java?
Do we need Unsafe in Java?Do we need Unsafe in Java?
Do we need Unsafe in Java?
 
Make Sure Your Applications Crash
Make Sure Your  Applications CrashMake Sure Your  Applications Crash
Make Sure Your Applications Crash
 
Introduction to Kernel Programming
Introduction to Kernel ProgrammingIntroduction to Kernel Programming
Introduction to Kernel Programming
 
ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†
ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†
ă‚šăƒłă‚żăƒŒăƒ—ăƒ©ă‚€ă‚șăƒ»ă‚Żăƒ©ă‚Šăƒ‰ăš äžŠćˆ—ăƒ»ćˆ†æ•Łăƒ»éžćŒæœŸć‡Šç†
 
Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?
Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?
Joker 2015 - ВалДДĐČ ĐąĐ°ĐłĐžŃ€ - Đ§Ń‚ĐŸ жД ĐŒŃ‹ ĐžĐ·ĐŒĐ”Ń€ŃĐ”ĐŒ?
 
Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5
Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5
Unix v6 ă‚»ăƒŸăƒŠăƒŒ vol. 5
 
Linux configer
Linux configerLinux configer
Linux configer
 
The Ring programming language version 1.7 book - Part 12 of 196
The Ring programming language version 1.7 book - Part 12 of 196The Ring programming language version 1.7 book - Part 12 of 196
The Ring programming language version 1.7 book - Part 12 of 196
 
SSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and SchedulingSSL Failing, Sharing, and Scheduling
SSL Failing, Sharing, and Scheduling
 
A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...A comparison of apache spark supervised machine learning algorithms for dna s...
A comparison of apache spark supervised machine learning algorithms for dna s...
 
Pokemon battle simulator (Java Program written on Blue J Editor)
Pokemon battle simulator (Java Program written on Blue J Editor)Pokemon battle simulator (Java Program written on Blue J Editor)
Pokemon battle simulator (Java Program written on Blue J Editor)
 
Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)Replica Sets (NYC NoSQL Meetup)
Replica Sets (NYC NoSQL Meetup)
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin I
 
Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—…
Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—… Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—…
Boostăƒ©ă‚€ăƒ–ăƒ©ăƒȘäž€ć‘šăźæ—…
 
Snakes for Camels
Snakes for CamelsSnakes for Camels
Snakes for Camels
 

Andere mochten auch

10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonJoe Stein
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogJoe Stein
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...Yahoo Developer Network
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 

Andere mochten auch (9)

10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Hadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With PythonHadoop Streaming Tutorial With Python
Hadoop Streaming Tutorial With Python
 
Streaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit LogStreaming Processing with a Distributed Commit Log
Streaming Processing with a Distributed Commit Log
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Sparkℱ and Apache Ign...
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
A Multi Colored YARN
A Multi Colored YARNA Multi Colored YARN
A Multi Colored YARN
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 

Ähnlich wie Petascale Genomics (Strata Singapore 20151203)

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...Hakka Labs
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at ClouderaDataconomy Media
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing CoursePierre Lindenbaum
 
Nginx Scripting - Extending Nginx Functionalities with Lua
Nginx Scripting - Extending Nginx Functionalities with LuaNginx Scripting - Extending Nginx Functionalities with Lua
Nginx Scripting - Extending Nginx Functionalities with LuaTony Fabeen
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 
A brief introduction to PostgreSQL
A brief introduction to PostgreSQLA brief introduction to PostgreSQL
A brief introduction to PostgreSQLVu Hung Nguyen
 
Eff Plsql
Eff PlsqlEff Plsql
Eff Plsqlafa reg
 
2016ćčŽăźPerl (Long version)
2016ćčŽăźPerl (Long version)2016ćčŽăźPerl (Long version)
2016ćčŽăźPerl (Long version)charsbar
 
Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]
Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]
Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]Accumulo Summit
 
Easy R
Easy REasy R
Easy RAjay Ohri
 
Notes for SQLite3 Usage
Notes for SQLite3 UsageNotes for SQLite3 Usage
Notes for SQLite3 UsageWilliam Lee
 
11 Things About 11gr2
11 Things About 11gr211 Things About 11gr2
11 Things About 11gr2afa reg
 
Oracle Tracing
Oracle TracingOracle Tracing
Oracle TracingMerin Mathew
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)David Evans
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsCommand Prompt., Inc
 
Oracle RDBMS Workshop (Part1)
Oracle RDBMS Workshop (Part1)Oracle RDBMS Workshop (Part1)
Oracle RDBMS Workshop (Part1)Taras Lyuklyanchuk
 
Buenos Aires Drools Expert Presentation
Buenos Aires Drools Expert PresentationBuenos Aires Drools Expert Presentation
Buenos Aires Drools Expert PresentationMark Proctor
 

Ähnlich wie Petascale Genomics (Strata Singapore 20151203) (20)

DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
 
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course20110114 Next Generation Sequencing Course
20110114 Next Generation Sequencing Course
 
Nginx Scripting - Extending Nginx Functionalities with Lua
Nginx Scripting - Extending Nginx Functionalities with LuaNginx Scripting - Extending Nginx Functionalities with Lua
Nginx Scripting - Extending Nginx Functionalities with Lua
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
A brief introduction to PostgreSQL
A brief introduction to PostgreSQLA brief introduction to PostgreSQL
A brief introduction to PostgreSQL
 
Eff Plsql
Eff PlsqlEff Plsql
Eff Plsql
 
2016ćčŽăźPerl (Long version)
2016ćčŽăźPerl (Long version)2016ćčŽăźPerl (Long version)
2016ćčŽăźPerl (Long version)
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]
Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]
Accumulo Summit 2015: Zookeeper, Accumulo, and You [Internals]
 
Easy R
Easy REasy R
Easy R
 
Notes for SQLite3 Usage
Notes for SQLite3 UsageNotes for SQLite3 Usage
Notes for SQLite3 Usage
 
11 Things About 11gr2
11 Things About 11gr211 Things About 11gr2
11 Things About 11gr2
 
Oracle Tracing
Oracle TracingOracle Tracing
Oracle Tracing
 
pm1
pm1pm1
pm1
 
Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)Putting a Fork in Fork (Linux Process and Memory Management)
Putting a Fork in Fork (Linux Process and Memory Management)
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
 
Learning Dtrace
Learning DtraceLearning Dtrace
Learning Dtrace
 
Oracle RDBMS Workshop (Part1)
Oracle RDBMS Workshop (Part1)Oracle RDBMS Workshop (Part1)
Oracle RDBMS Workshop (Part1)
 
Buenos Aires Drools Expert Presentation
Buenos Aires Drools Expert PresentationBuenos Aires Drools Expert Presentation
Buenos Aires Drools Expert Presentation
 

Mehr von Uri Laserson

Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic BiologyUri Laserson
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Uri Laserson
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesUri Laserson
 

Mehr von Uri Laserson (6)

Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 

KĂŒrzlich hochgeladen

Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...aartirawatdelhi
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...chandars293
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...Arohi Goyal
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Servicevidya singh
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls AvailableNehru place Escorts
 
Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...
Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...
Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...narwatsonia7
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...astropune
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Call Girls in Nagpur High Profile
 
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Chandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD availableChandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD availableDipal Arora
 
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service KochiLow Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service KochiSuhani Kapoor
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...indiancallgirl4rent
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...jageshsingh5554
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...Taniya Sharma
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escortsvidya singh
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service AvailableDipal Arora
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Dipal Arora
 

KĂŒrzlich hochgeladen (20)

Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Aurangabad Just Call 9907093804 Top Class Call Girl Service Available
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀ night ...
 
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
The Most Attractive Hyderabad Call Girls Kothapet 𖠋 6297143586 𖠋 Will You Mis...
 
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
All Time Service Available Call Girls Marine Drive 📳 9820252231 For 18+ VIP C...
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
Vip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls AvailableVip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls Available
Vip Call Girls Anna Salai Chennai 👉 8250192130 âŁïžđŸ’Ż Top Class Girls Available
 
Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...
Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...
Top Rated Bangalore Call Girls Mg Road ⟟ 8250192130 ⟟ Call Me For Genuine Sex...
 
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
Best Rate (Hyderabad) Call Girls Jahanuma ⟟ 8250192130 ⟟ High Class Call Girl...
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Jabalpur Just Call 9907093804 Top Class Call Girl Service Available
 
Chandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD availableChandrapur Call girls 8617370543 Provides all area service COD available
Chandrapur Call girls 8617370543 Provides all area service COD available
 
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service KochiLow Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
Low Rate Call Girls Kochi Anika 8250192130 Independent Escort Service Kochi
 
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Ooty Just Call 9907093804 Top Class Call Girl Service Available
 
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
(Rocky) Jaipur Call Girl - 9521753030 Escorts Service 50% Off with Cash ON De...
 
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
VIP Service Call Girls Sindhi Colony 📳 7877925207 For 18+ VIP Call Girl At Th...
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore EscortsCall Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
Call Girls Horamavu WhatsApp Number 7001035870 Meeting With Bangalore Escorts
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Siliguri Just Call 9907093804 Top Class Call Girl Service Available
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
 

Petascale Genomics (Strata Singapore 20151203)

  • 1. 1© Cloudera, Inc. All rights reserved. Scaling Up Genomics with Hadoop and Spark Uri Laserson | @laserson | 14 November 2015 Petascale Genomics
  • 2. 2© Cloudera, Inc. All rights reserved. We come in peace. Pioneer plaque
  • 3. 3© Cloudera, Inc. All rights reserved. What is genomics?
  • 4. 4© Cloudera, Inc. All rights reserved. Organism
  • 5. 5© Cloudera, Inc. All rights reserved. Organism Cell
  • 6. 6© Cloudera, Inc. All rights reserved. Organism Cell Genome
  • 7. 7© Cloudera, Inc. All rights reserved.
  • 8. 8© Cloudera, Inc. All rights reserved.
  • 9. 9© Cloudera, Inc. All rights reserved. Reference chromosome
  • 10. 10© Cloudera, Inc. All rights reserved. Reference chromosome Location
  • 11. 11© Cloudera, Inc. All rights reserved. “
 decoding the Book of Life”
  • 12. 12© Cloudera, Inc. All rights reserved. ...atatggaaccaaaaaagagcccgcatcgccaaggcaatcctaagccaaaagaacaaagctggaggcatcacactacctgacttcaaactatactaca agcctacagtaaccaaaacagcatggtactggtaccaaaacagagatatagatcaatggaacagaacagagccctcagaaataacgccgcatatctacaa ctatctgatctttgacgaacctgagaaaaacaagcaatggggaaaggattccctatttaataaatggtgctgggaaaactggctagccatatgtagaaag ctgaaactggatcccttccttacaccttatacaaaaatcaattcaagatggattaaagacttaaacgttagacctaaaaccataaaaaccctagaagaaa acctaggcagtaccattcaggacataggcatgggcaaggacttcatgtccaaaacaccaaaagcaatggcaacaaaagacaaaattgacaaatgggatct aattaaactaaagagcttctgcacagcaaaagaaactaccatcagagtgaacaggaaacctacaaaatgggagaaaattttcgcaacctactcatctgac aaagggctaatatccagaatctacaatgaactcaaacaaatttacaagaaaaaaacaaacaaccccatcaaaaagtgggcaaaggacatgaacagacact tctcaaatgaagacatttatgcagccaaaaaacacatgaaaaaatgctcatcatcactggccatcagagaaatgcaaatcaaaaccacaatgagatacca tctcacaccagttagaatggcaatcattaaaaagtcaggaaacaacaggtgctggagaggatgtggagaaataggaacacttttacactgttggtgggac tgtaaactagttcaaccattgtggaagtcagtgtggtgattcctcagggatctagaactagaaataccatttgacccagccatcccattactgggtatat acccaaaggactataaatcatgctgctataaagacacatgcacacgtatgtttattgcggcattattcacaatagcaaagacttggaaccaacccaaatg tccaacaatgataaactggattaagaaaatgtggcacatatacaccatggaatactctgcagccataaaaaaggatgagttcatgtcctttgtagggaca tggatgaaattggaaatcatcattctcagtaaactatcgcaagaataaaaaaccaaacaccgcatattctcactcataggtgggaattgaacaatgagat cacatggacacaggaagaggaatatcacactctggggactgtggtggggtggggggaggggggagggatagcattgggagatatacctaatgctagatga cgagttagtgggtgcagcgcaccagcatggcacatgtatacatatgtaactaacctgcacattgtgcacatgtaccctaaaacttaaagtataataaaaa aataaaaaaaataaagtgtgtgtgtgtatgactttaattaacttgatcacccacacacacacaaacactgaccaaaattaatatcaagtcaggtctgtct gaatgtaaagccaacagcaaacatccctctctccaaatggaaaagaaacagggggttatgggcagctacactgctaaatgttaaaactttatttttaaat gtggccataaaaatcactaaataaaattgataatatatgtttttgatgaataaattttatatatgtctacactggaaactatatagcaataaaaactaac catgtacaactaaactcataaatttcataaacataataagtaaaagaagccagacaaaaagtagtgtatactgttaaattccatttatataaaagttcaa aaaagccaaaaagaaactatgctgttaaaagtaaggattatagttactattcagggaagagagtagtggctggaaagaaacataaagggggtctctgaag tggaataatgttctgttttttgatctgggtattagggtgtttaatttcggaaaattattttatctttatacttattgtattattgattttttgcttaaca aattactcaaaacttagaggtttaaaaaaaattaattattgtattaatttctctgggccaggaattggagagagcttagctgggtagttctggttcaaaa tttctcatgagattaccgtcaagctgttggagggggctgcatcatctgaaggcttgaccgaggctagaggatctactttcaagatggcccactcacatgg ctgttggcaagaagtttcagtttctcactagcttctagcaggaggccataatttctcaccacatagatctctctatagggctactcgagtgtcctcacag caaggtagctggctttcttcagagccaagtgactcaaaggcaaagaggaagtcactatgccatttatgacctagttttggaactcacactttgttccgaa ttgaccttccatcactttctagtcattaggatttaagtcactaactctgatccatagtcaaggggagtaaaatttggctttattgttggaggatggagta gcaaagaatttgttgacacattttaaaactaccatacttaaacagttcatttttctgaatatgcttcaattagaagttaaaatgatgcaattttaaaaca ttgtttcaaatgaacactgttagggagagaagtgcttcttctccatatctaatgtttcttccatatttagggagttccattagtttaacactttaag...
  • 13. 13© Cloudera, Inc. All rights reserved.
  • 14. 14© Cloudera, Inc. All rights reserved.
  • 15. 15© Cloudera, Inc. All rights reserved.
  • 16. 16© Cloudera, Inc. All rights reserved.
  • 17. 17© Cloudera, Inc. All rights reserved.
  • 18. 18© Cloudera, Inc. All rights reserved. >read1 TTGGACATTTCGGGGTCTCAGATT >read2 AATGTTGTTAGAGATCCGGGATTT >read3 GGATTCCCCGCCGTTTGAGAGCCT >read4 AGGTTGGTACCGCGAAAAGCGCAT
  • 19. 19© Cloudera, Inc. All rights reserved. >read1 TTGGACATTTCGGGGTCTCAGATT >read2 AATGTTGTTAGAGATCCGGGATTT >read3 GGATTCCCCGCCGTTTGAGAGCCT >read4 AGGTTGGTACCGCGAAAAGCGCAT Bioinformatics!
  • 20. 20© Cloudera, Inc. All rights reserved. >read1 TTGGACATTTCGGGGTCTCAGATT >read2 AATGTTGTTAGAGATCCGGGATTT >read3 GGATTCCCCGCCGTTTGAGAGCCT >read4 AGGTTGGTACCGCGAAAAGCGCAT Bioinformatics!
  • 21. 21© Cloudera, Inc. All rights reserved. Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Pipelines!
  • 22. 22© Cloudera, Inc. All rights reserved. ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 Compressed text files (non-splittable) Semi-structured Poorly specified
  • 23. 23© Cloudera, Inc. All rights reserved. ##fileformat=VCFv4.1 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 Compressed text files (non-splittable) Semi-structured Poorly specified Global sort order
  • 24. 24© Cloudera, Inc. All rights reserved. C HPC (scheduler) POSIX filesystem Java HPC (Queue) POSIX filesystem C++ Single-node SQLite It’s file formats all the way down!
  • 25. 25© Cloudera, Inc. All rights reserved. Dedup
  • 26. 26© Cloudera, Inc. All rights reserved. /** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */ protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes); final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { Method Code
  • 27. 27© Cloudera, Inc. All rights reserved. /** * Main work method. Reads the BAM file once and collects sorted information about * the 5' ends of both ends of each read (or just one end in the case of pairs). * Then makes a pass through those determining duplicates before re-reading the * input file and writing it out with duplication flags set correctly. */ protected int doWork() { // build some data structures buildSortedReadEndLists(useBarcodes); generateDuplicateIndexes(useBarcodes); final SAMFileWriter out = new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT); final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator; while (iterator.hasNext()) { final SAMRecord rec = iterator.next(); if (!rec.isSecondaryOrSupplementary()) { if (recordInFileIndex == nextDuplicateIndex) { rec.setDuplicateReadFlag(true); // Now try and figure out the next duplicate index if (this.duplicateIndexes.hasNext()) { nextDuplicateIndex = this.duplicateIndexes.next(); } else { // Only happens once we've marked all the duplicates nextDuplicateIndex = -1; } } else { Method Code
  • 28. 28© Cloudera, Inc. All rights reserved. @Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.") public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
  • 29. 29© Cloudera, Inc. All rights reserved. @Option(shortName = "MAX_FILE_HANDLES", doc = "Maximum number of file handles to keep open when spilling " + "read ends to disk. Set this number a little lower than the " + "per-process maximum number of file that may be open. This " + "number can be found by executing the 'ulimit -n' command on " + "a Unix system.") public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000; Dedup Method/Algo Code Platform
  • 30. 30© Cloudera, Inc. All rights reserved. Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation
  • 31. 31© Cloudera, Inc. All rights reserved. It’s pipelines all the way down! Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation
  • 32. 32© Cloudera, Inc. All rights reserved. It’s pipelines all the way down! Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 1 Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 2 Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 3
  • 33. 33© Cloudera, Inc. All rights reserved. Manually running pipelines on HPC $ bsub –q shared_12h python split_genotypes.py $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv $ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv $ bsub –q shared_12h python merge_maf.py
  • 34. 34© Cloudera, Inc. All rights reserved.
  • 35. 35© Cloudera, Inc. All rights reserved. Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Alignment Dedup Recalibrate QC/Filter Alignment Dedup Recalibrate QC/Filter
  • 36. 36© Cloudera, Inc. All rights reserved. Node 1 Alignment Dedup Recalibrate QC/Filter Variant Calling Variant Annotation Node 2 Node 3 Alignment Dedup Recalibrate QC/Filter Alignment Dedup Recalibrate QC/Filter Node 4
  • 37. 37© Cloudera, Inc. All rights reserved. Node 1 Alignment Dedup QC/Filter Variant Calling Variant Annotation Node 2 Node 3 Alignment Dedup QC/Filter Alignment Dedup QC/Filter Node 4 Recalibrate
  • 38. 38© Cloudera, Inc. All rights reserved. Why Are We Still Defining File Formats By Hand? ‱ Instead of defining custom file formats for each data type and access pattern
 ‱ Parquet creates a compressed format for each Avro-defined data model ‱ Improvements over existing formats ‱ ~20% for BAM ‱ ~90% for VCF
  • 39. 39© Cloudera, Inc. All rights reserved. YARN-managed Hadoop cluster Spark executors 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖) 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖) 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖)Partial sums 𝑖=1 𝑁 𝑗=1 𝑑 𝑖 𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗, 𝑓𝑖) Driver Application code ContEst Algorithm
  • 40. 40© Cloudera, Inc. All rights reserved. Hadoop provides layered abstractions for data processing HDFS (scalable, distributed storage) YARN (resource management) MapReduce Impala (SQL) Solr (search) Spark ADAMquince guacamole 
 bdg-formats(Avro/Parquet)
  • 41. 41© Cloudera, Inc. All rights reserved. ‱ Hosted at Berkeley and the AMPLab ‱ Apache 2 License ‱ Contributors from both research and commercial organizations ‱ Core spatial primitives, variant calling ‱ Avro and Parquet for data models and file formats Spark + Genomics = ADAM
  • 42. 42© Cloudera, Inc. All rights reserved. Core Genomics Primitives: Spatial Join
  • 43. 43© Cloudera, Inc. All rights reserved. Executing query in Hadoop: interactive Spark shell (ADAM) def inDbSnp(g: Genotype): Boolean = true or false def isDeleterious(g: Genotype): Boolean = g.getPolyPhen val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect() val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect() val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”) val genotypesRDD = sc.adamLoad("path/to/genotypes") val filteredRDD = genotypesRDD .filter(!inDbSnp(_)) .filter(isDeleterious(_)) .filter(isFramingham(_)) val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD) val maf = joinedRDD .keyBy(x => (x.getVariant, getPopulation(x))) .groupByKey() .map(computeMAF(_)) maf.saveAsNewAPIHadoopFile("path/to/output") apply predicates load data join data group-by aggregate (MAF) persist data
  • 44. 44© Cloudera, Inc. All rights reserved. Executing query in Hadoop: distributed SQL SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call) FROM genotypes g INNER JOIN samples s ON g.sample = s.sample INNER JOIN dnase d ON g.chr = d.chr AND g.pos >= d.start AND g.pos < d.end LEFT OUTER JOIN dbsnp p ON g.chr = p.chr AND g.pos = p.pos AND g.ref = p.ref AND g.alt = p.alt WHERE s.study = "framingham" p.pos IS NULL AND g.polyphen IN ( "possibly damaging", "probably damaging" ) GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop apply predicates “load” and join data group-by aggregate (UDAF)
  • 45. 45© Cloudera, Inc. All rights reserved. ADAM preliminary performance
  • 46. 46© Cloudera, Inc. All rights reserved. 1. Somebody will build on your code 2. You should have assembled a team to build your software 3. If you choose the right license, more people will use and build on your software. 4. Making software free for commercial use shows you are not against companies. 5. You should maintain your software indefinitely 6. Your “stable URL” can exist forever 7. You should make your software “idiot proof” 8. You used the right programming language for the task. Lior Pachter https://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/ “Myths of Bioinformatics Software”
  • 47. 47© Cloudera, Inc. All rights reserved.
  • 48. 48© Cloudera, Inc. All rights reserved. Acknowledgements UCBerkeley Matt Massie Frank Nothaft Michael Heuer Tamr Timothy Danford MSSM Jeff Hammerbacher Ryan Williams Cloudera Tom White Sandy Ryza
  • 49. 49© Cloudera, Inc. All rights reserved. Thank you @laserson laserson@cloudera.com

Hinweis der Redaktion

  1. Before we dive in, let me ask a couple of questions: Biologists? Spark experts? There are always at least three different constituencies in the room: * biologists * programmers * someone thinking about how to build a business around this Gonna tell you a lot of lies today. Won’t satisfy everyone. Where I skip over the truth, maybe there will be at least a breadcrumb of truth left over. This will not be a very technical talk.
  2. Scared/pissed off some bio people in the past. Bioinformatics is a field with a long history, thirty or more years as a separate discipline. At the same time, the fundamental technology is changing. So if I talk about ‘problems of bioinformatics’ today, it’s OK because WE COME IN PEACE! Bioinformatics software development has been *remarkably* effective, for decades. If there are problems to be solved, these are the result of new technologies, new ambitions of scale.
  3. What even is genomics? Who here has heard the terms ‘chromosome’ and ‘gene’ before, and could explain the difference? So before we dive into the main part of the talk, I’m going to spend a few minutes discussing some of the basic biological concepts.
  4. Fundamentally, we’re interested in studying individuals (and populations of individuals) [ADVANCE] But each individual is actually a population: of cells [ADVANCE] But each of those cells has, ideally, an identical genome. The genome is a collection of 23 linear molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  5. Fundamentally, we’re interested in studying individuals (and populations of individuals) [ADVANCE] But each individual is actually a population: of cells [ADVANCE] But each of those cells has, ideally, an identical genome. The genome is a collection of 23 linear molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  6. Fundamentally, we’re interested in studying individuals (and populations of individuals) [ADVANCE] But each individual is actually a population: of cells [ADVANCE] But each of those cells has, ideally, an identical genome. The genome is a collection of 23 linear molecules. These are called ‘polymers,’ they’re built (like Legos) out of a small number of repeated interlocking parts – these are the A, T, G, and C you’ve probably heard about. The content of the genome is determined by the linear order in which these letters are arranged. (Linear is important!)
  7. Without losing much, assume that our genomes are contained on just a single chromosome. Now, not only do all the cells in your body have identical genomes
 [ADVANCE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals
 and that means
 [ADVANCE] That we can define a ‘base’ or a ‘reference’ chromosome. Now that there is a reference that all of us adhere to
 [ADVANCE] We can define a concept of ‘location’ across chromosomes. This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This also means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  8. Without losing much, assume that our genomes are contained on just a single chromosome. Now, not only do all the cells in your body have identical genomes
 [ADVANCE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals
 and that means
 [ADVANCE] That we can define a ‘base’ or a ‘reference’ chromosome. Now that there is a reference that all of us adhere to
 [ADVANCE] We can define a concept of ‘location’ across chromosomes. This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This also means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  9. Without losing much, assume that our genomes are contained on just a single chromosome. Now, not only do all the cells in your body have identical genomes
 [ADVANCE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals
 and that means
 [ADVANCE] That we can define a ‘base’ or a ‘reference’ chromosome. Now that there is a reference that all of us adhere to
 [ADVANCE] We can define a concept of ‘location’ across chromosomes. This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This also means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  10. Without losing much, assume that our genomes are contained on just a single chromosome. Now, not only do all the cells in your body have identical genomes
 [ADVANCE] But individual humans have genomes that are very similar to each other. So similar that I can define “the same” chromosome between individuals
 and that means
 [ADVANCE] That we can define a ‘base’ or a ‘reference’ chromosome. Now that there is a reference that all of us adhere to
 [ADVANCE] We can define a concept of ‘location’ across chromosomes. This is possibly the most important concept in genome informatics, the idea that DNA defines a common linear coordinate system. This also means that we can talk about differences between individuals in terms of diffs to a common reference genome. But where does this reference genome come from?
  11. Here is Bill Clinton (and Craig Venter and Francis Collins), announcing in June of 2000 the “rough draft” of the Human Genome – this is the Human Genome Project. Took >10 years and $2 billion What did this actually do?
  12. An ASCII text file with a linear sequence of 3 billion ACGTs This is the reference. Now go cure cancer. If this looks uninterpretable, it is!
  13. Anyone recognize this? Want to make an analogy. Difficult to understand. How do I make it more comprehensible?
  14. Mapmakers work to add ANNOTATIONS to the map. Annotations are keyed by geo coordinates. Points, lines, and polygons in 2d space
  15. And often, it’s only the annotations that are interesting, so mapmakers focus on *annotation* of the maps themselves. The core technologies are 2D planar and spherical geometry, geometric operations composed out of latitudes and longitudes. This is what we want to do for the genome. What does the annotated map of the genome look like?
  16. Chromosome on top. Highlighted red portion is what we’re zoomed in on. See the scale: total of about 600,000 bases (ACGTs) arranged from left to right. Multiple annotation “tracks” are overlaid on the genome sequence, marking functional elements, positions of observed human differences, similarity to other animals. In part it’s the product of numerous additional large biology annotation projects (e.g., HapMap project, 1000 Genomes, ENCODE). Lot's of bioinformatics is computing these elements, or evaluating models on top of the elements. How are these annotations actually generated? Shift gears and talk about the technology.
  17. DNA SEQUENCING If satellites provide images of the world for cartography, sequencers are the microscopes that give you “images” of the genome. Over past decade, massive EXPONENTIAL increase in throughput (much faster than Moore’s law)
  18. Get sample Extract DNA (possibly other manipulations) Dump into sequencer Spits out text file (actually looks just like that) But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements? [ADVANCE] Bioinformatics is the computational process to reconstruct the genomic information. But
 [ADVANCE] Often considered simply a black box. What does it actually look like inside?
  19. Get sample Extract DNA (possibly other manipulations) Dump into sequencer Spits out text file (actually looks just like that) But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements? [ADVANCE] Bioinformatics is the computational process to reconstruct the genomic information. But
 [ADVANCE] Often considered simply a black box. What does it actually look like inside?
  20. Get sample Extract DNA (possibly other manipulations) Dump into sequencer Spits out text file (actually looks just like that) But how to get from the text file to an annotation track that reconstructs a genome or shows position of certain functional elements? [ADVANCE] Bioinformatics is the computational process to reconstruct the genomic information. But
 [ADVANCE] Often considered simply a black box. What does it actually look like inside?
  21. Pipelines, of course. Example pipeline: raw sequencing data => a single individual’s “diff” from the reference. How are these typically structured? Each step is typically written as a standalone program – passing files from stage to stage These are written as part of a globally-distrbuted research program, by researchers and grad students around the world, who have to assume the lowest common denominator: command line and filesystem. This has important implications for scalability. What does one of these files look like?
  22. Text is highly inefficient Compresses poorly Values must be parsed Text is semi-structured Flexible schemas make parsing difficult Difficult to make assumptions on data structure Text poorly separates the roles of delimiters and data Requires escaping of control characters (ASCII actually includes RS 0x1E and FS 0x1F, but they’re never used)
  23. Imposes severe constraint: global sort invariant. => Many impls depend on this, even if it’s not necessary or conducive to distributed computing.
  24. Bioinformaticians LOVE hand-coded file formats. But only store several fundamental data types. Strong assumptions in the formats. Inconsistent implementations in multiple languages. Doesn’t allow different storage backends. OK, we discussed what the data/files are like that are passed around. What about the computation itself?
  25. Let’s take one of the transformations in the pipeline. Basically a more complex version of a DISTINCT operation.
  26. Actual code from the standard Picard implementation of MarkDuplicates. Two things should be going on: Algorithm/Method overall Actual code implementation. Start by building some data structures from the input files. Then iterate over file and rewrite is as necessary.
  27. But what if we jump into one of these functions. You’ll find a dependence on
 [ADVANCE]
  28. An input option related to Unix file handle limits? WTF? Why should this METHOD need know anything about the platform that this is running on? LEAKY ABSTRACTIONS
  29. Most bioinformatics tools make strong assumptions about their environments, and also the structure of the data (e.g., global sort), when it shouldn’t be necessary. Ok, but that’s not all
 [ADVANCE]
  30. We’ve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual. But of course, it’s never one pipeline
 [ADVANCE] It’s a pipeline per person! But since each pipeline runs (essentially) serially, scaling it up is easy
 [ADVANCE] Scale out! Typically managed with a pretty low-level job scheduler.
  31. We’ve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual. But of course, it’s never one pipeline
 [ADVANCE] It’s a pipeline per person! But since each pipeline runs (essentially) serially, scaling it up is easy
 [ADVANCE] Scale out! Typically managed with a pretty low-level job scheduler.
  32. We’ve looked at the data and a bit of code for one of these tools. But this runs the pipeline on a single individual. But of course, it’s never one pipeline
 [ADVANCE] It’s a pipeline per person! But since each pipeline runs (essentially) serially, scaling it up is easy
 [ADVANCE] Scale out! Typically managed with a pretty low-level job scheduler.
  33. MANUAL split and merge MANUAL resource request BABYSIT for failures/errors CUSTOM intermediate ser/de But this basically works and the parallelism is pretty simple. This architecture has kept up with the pace of sequencing for some time now. Pipelines. Managed by job schedulers. Passing files around. SO WHY AM I EVEN UP HERE TALKING? Two reasons

  34. SCALE! New levels of ambition for large biology projects. 100k genomes at Genomics England in collaboration with National Health Service. Raw data for a single individual can be in the hundreds of GB But even before we hit that huge scale (which is soon)

  35. For latest algorithms, we don’t want to analyze each sample separately. We want to use ALL THE DATA we generate. Well, these pipelines often include lots of aggregation, perhaps we can just
 [ADVANCE] Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks. But even worse
 [ADVANCE] God help you if you want to jointly use all the data in earlier part of the pipeline. 2 Problems: Large scale Using all data simultaneously
  36. For latest algorithms, we don’t want to analyze each sample separately. We want to use ALL THE DATA we generate. Well, these pipelines often include lots of aggregation, perhaps we can just
 [ADVANCE] Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks. But even worse
 [ADVANCE] God help you if you want to jointly use all the data in earlier part of the pipeline. 2 Problems: Large scale Using all data simultaneously
  37. For latest algorithms, we don’t want to analyze each sample separately. We want to use ALL THE DATA we generate. Well, these pipelines often include lots of aggregation, perhaps we can just
 [ADVANCE] Do the easy thing! Not ideal, especially as the amount of data goes up (data transfer), number of files increases (we saw file handles). May start hitting the cracks. But even worse
 [ADVANCE] God help you if you want to jointly use all the data in earlier part of the pipeline. 2 Problems: Large scale Using all data simultaneously How do we solve these problems?
  38. Things like global sort order are overly restrictive and leads to algos relying on it when it’s not necessary.
  39. A lot of the problems go away with a tool like Spark. Example of an algo. Bioinformatics loves evaluating probabilistic models on the genome annotations. We can easily extract parallelism at different parts of our pipelines. With easiest language, we can describe a high-level computation. Use higher level distributed computing primitives and let the system figure out all the platform issues for you: storage, job scheduling, fault tolerance, shuffles, serde.
  40. Layered abstractions. Use multiple storage engines with different characteristics. Multiple execution engines. Avro ties it all together. Application code/algos should only touch the top of the abstraction layer. Cheap scalable STORAGE at bottom Resource management middle EXECUTION engines that can run your code on the cluster and provide parallelism Consistent SERIALIZATION framework Scientist should NOT WORRY about lower levels (coordination, file formats, storage details, fault tolerance)
  41. We’ve implemented this vision with Spark, starting from the Amplab (same people that gave you Spark) into a project called ADAM The reason this works is that Spark naturally handles pipelines, and automatically performs shuffles when appropriate, but also

  42. In addition to some of the standard pipeline transformations, implemented the core spatial join operations (analogous to a geospatial library).
  43. Another computation for a statistical aggregate on genome variant data. Details not important. Spark data flow: Distributed data load High level joins/spatial computations that are parallelized as necessary. But really nice thing is because our data is stored using the Avro data model
 [ADVANCE]
  44. You can execute the exact same computation using, for example, SQL! Pick the best tool for the job.
  45. Single-node performance improvements. Free scalability: fixed price, significant wall-clock improvements See most recent SIGMOD.
  46. Controversial and disagree with many. #8 similar to assuming primitive lowest common denominator For especially for the last “myth”, being able to achieve the ambition that people are proposing will require moving beyond “anything is ok” to making some important technical decisions.
  47. Not to be outdone, Craig Venter proposes 1 million genomes at Human Longevity Inc.
  48. Cloudera is hiring. Including the data science team.