SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
Karen Feng, Databricks
Enabling Biobank-Scale Genomic
Processing with Spark SQL
#UnifiedDataAnalytics #SparkAISummit
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
3
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
4
Genomics is a big data problem
5
40,000 Petabytes / year by 2025From $2.7B to <$1,000
https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
6
The power of big genomic data
7
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/
Orthosteric inhibition
The power of big genomic data
8
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
The power of big genomic data
9
Accelerate
Target
Discovery
Goal: identify a biological target
(eg. protein) that can be
modulated with a drug
Approach: large-scale
regressions to correlate DNA
variants and the trait
Result: clinical trials with
genomic evidence are 2x more
likely to be approved by the FDA
The power of big genomic data
10
Accelerate
Target
Discovery
Reduce Costs
via Precision
Prevention
Improve
Survival with
Optimized
Treatment
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
11
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
12
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
13
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
14
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
15
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
16
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
17
for i in chr2L chr2R chr3L chr3R chr4 chrX;do
GenomeAnalysisTK -R ${ref_seq} 
-T SelectVariants 
-V my_flies.vcf 
-L $i 
-o my_flies.${i}.vcf
done;
for i in {1..16};
do vcftools --vcf VCF_FILE --chr $i
--recode --recode-INFO-all --out VCF_$i;
done
bgzip -c myvcf.vcf > myvcf.vcf.gz
tabix -p vcf myvcf.vcf.gz
tabix myvcf.vcf.gz chr1 > chr1.vcf
java -jar SnpSift.jar split file.vcf
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
18
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
19
838 results
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
20
Data management
--make-bed
--recode
--output-chr
--zero-cluster
--split-x/--merge-x
--set-me-missing
--fill-missing-a2
--set-missing-var-ids
--update-map...
--update-ids...
--flip
--flip-scan
--keep-allele-order...
--indiv-sort
--write-covar...
--{,b}merge...
Merge failures
VCF reference merge
--merge-list
--write-snplist
--list-duplicate-vars
Basic statistics
--freq{,x}
--missing
--test-mishap
--hardy
--mendel
--het/--ibc
--check-sex/--impute-sex
--fst
Linkage disequilibrium
--indep...
--r/--r2
--show-tags
--blocks
Distance matrices
Identity-by-state/Hamming
(--distance...)
Relationship/covariance
(--make-grm-bin...)
--rel-cutoff
Distance-pheno. analysis
(--ibs-test...)
Identity-by-descent
--genome
--homozyg...
Population stratification
--cluster
--pca
--mds-plot
--neighbour
Association analysis
Basic case/control
(--assoc, --model)
Stratified case/control
(--mh, --mh2, --homog)
Quantitative trait
(--assoc, --gxe)
Regression w/ covariates
(--linear, --logistic)
--dosage
--lasso
--test-missing
Monte Carlo permutation
Set-based tests
REML additive heritability
Family-based association
--tdt
--dfam
--qfam...
--tucc
Report postprocessing
--annotate
--clump
--gene-report
--meta-analysis
Epistasis
--fast-epistasis
--epistasis
--twolocus
Allelic scoring (--score)
R plugins (--R)
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
21
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
22
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
23
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
24
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
25
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
26
1. Converting one file format to another file format.
2. Converting one file format to another file format.
3. Converting one file format to another file format.
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
27
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”
Genomic analysis on big data is hard!
• Existing tools are
often difficult to
– Scale
– Learn
– Integrate
28
“Give a statistical geneticist an
awk line, feed him for a day,
teach a statistical geneticist how
to awk, feed him for a lifetime...”
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
29
• Open-source toolkit for large-scale genomic
analysis
30
• Open-source toolkit for large-scale genomic
analysis
• Built on Spark for biobank scale
• Query and use built-in commands with familiar
languages using Spark SQL
• Compatible with existing genomic tools and
formats, as well as big data and ML tools
31
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
32
Genomic variant data
33
Genomic variant data
34
Genomic variant data
35
Always present
Genomic variant data
36
Chromosome: StringType
Genomic variant data
37
Variant information: depends on metadata
Genomic variant data
38
MapType(StringType, StringType): {“DP” -> “14”, “AF” -> “0.5”}
Genomic variant data
Genomic variant data
39
MapType(StringType, StringType)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
Genomic variant data
40
MapType(StringType, StringType)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
##INFO=<ID=AF, Number=?, Type=?, Description=?>
Genomic variant data
41
MapType(StringType, StringType): lose metadata and slow querying
Genomic variant data
42
Dynamic schema: preserve metadata and fast querying
Genomic variant data
43
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” -> “Allele Frequency”)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
Genomic variant data
44
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” -> “Allele Frequency”)
##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
Genomic variant data
45
Genotype information: depends on metadata
Genomic variant data
46
Genotype information: width depends on number of samples
Genomic variant data
47
Sample NA00001
Genotype 0|0
Genotype quality 48
Depth 1
Haplotype quality 51,51
Genomic variant data
48
Sample NA00001 NA0002
Genotype 0|0 0|0
Genotype quality 48 49
Depth 1 3
Haplotype quality 51,51 58,50
Genomic variant data
49
Sample NA00001 NA0002
Genotype 0|0 0|0
Genotype quality 48 49
Depth 1 3
Haplotype quality 51,51 58,50
...
UK Biobank has 500,000
participants!
Genomic variant data
50
Sample Genotype Genotype quality Depth Haplotype quality
NA0001 0|0 48 1 51,51
NA0002 0|0 49 3 58,50
...
Genomic variant data
• Static fields
– eg. Chromosome
• Dynamic fields
– Variant information
– Genotype information
• Preserves metadata
• Fast querying
• Limited number of columns
51
Genomic variant data
52
VCF VCF rows
spark.read
.format(“vcf”)
.load(“genotypes.vcf”)
Genomic variant data
53
spark.write
.format(“vcf”)
.save(“genotypes.vcf”)
VCF VCF rows
Genomic variant data
54
VCF rows
spark.write
.format(“delta”)
.save(“genotypes.delta”)
Delta Lake
55
• Genomic data
– VCF, BGEN, BED
• Medical images
• Electronic health records
• Waveform data
• Real world evidence
• ...
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
56
Built-in functions
• Convert genotype probabilities to hard calls
• Normalize variants
• Liftover between reference assemblies
• Annotate variants
• Genome-wide association studies
• ...
57
Built-in functions
• Convert genotype probabilities to hard calls
• Normalize variants
• Liftover between reference assemblies
• Annotate variants
• Genome-wide association studies
• ...
58
GWAS
• linear_regression_gwas
• logistic_regression_gwas
• Single-node bioinformatics tools
59
Single-node bioinformatics tools
• SAIGE
– R library
– VCF → CSV
60
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
Single-node bioinformatics tools
• Require flat file
splicing and
combination
61
Single-node bioinformatics tools
62
Command
line tool
Text
Text
Text
Text
Text
Text
Text
Text
Command
line tool
Command
line tool
... ... ...
rdd.pipe()
63
Command
line tool
worker
stdin stdout
Text RDD Text RDD
rdd.pipe()
• Input and output RDDs have single text column
– Input: set header as pipe context
– Output: mixed header and text data
• Convert between genomic file formats
– Changing specs
64
glow.transform(‘pipe’)
65
DataFrame
(VCF, CSV,
text)
DataFrame
(VCF, CSV,
text)
Command
line tool
(SAIGE)
worker
stdin stdout
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
66
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
67
DataFrame
VCF
glow.transform(‘pipe’)
• VCF input formatter
– Set header based on
schema
– Convert Spark Rows
to Java objects
– Third-party library
writes header and
variant rows
68
StructField(
name = “INFO_AF”,
dataType = DoubleType,
nullable = true,
metadata = Map(
“vcf_header_count” -> “A”,
“vcf_header_description” ->
“Allele Frequency”))
##INFO=<ID=AF, Number=A, Type=Float,
Description=”Allele Frequency”>
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
69
Rscript step2_SPAtests.R
glow.transform(‘pipe’)
• For each partition
– Input formatter writes to the command’s stdin
– Output formatter reads from the command’s stdout
– If running the command triggers an exception, the
error is propagated to the driver
70
glow.transform(‘pipe’)
glow.transform(
"pipe", 
input_df, 
cmd=cmd, 
input_formatter='vcf', 
in_vcf_header='infer', 
output_formatter='csv', 
out_header='true', 
out_delimiter=' ')
71
DataFrame
CSV
glow.transform(‘pipe’)
• CSV output formatter
– Write schema to first element in iterator
– Write remaining rows to iterator
72
CHR POS BETA SE p.value
22 35292447 1.206 3.285 0.714
22 35292456 1.358 2.534 0.592
StructType(
Seq(“CHR”, “POS”, “BETA”,
“SE”, “p.value”).map(
StructField(_, StringType))
InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”)
InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)
glow.transform(‘pipe’)
• Input and output DataFrames
– Input: infer header from schema
– Output: infer schema from header
• Convert genomic data under the hood
– Spark Row ↔ Java object ↔ text
73
Agenda
• Genomics overview
– Big data problem
– Real-world applications
– Pain points at biobank scale
• Glow
– Datasources
– Built-in functions
– Extensibility
74
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
75
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
76
spark.read.format("vcf") 
.load(“genotypes.vcf”)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
77
variant_df.selectExpr("*", 
"expand_struct(call_summary_stats(genotypes))", 
"expand_struct(hardy_weinberg(genotypes))") 
.where((col("alleleFrequencies").getItem(0) >= 
allele_freq_cutoff) & 
(col("alleleFrequencies").getItem(0) <= 
(1.0 - allele_freq_cutoff)) & 
(col("pValueHwe") >= hwe_cutoff))
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
78
qc_df.write 
.format(“delta”) 
.save(delta_path)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
79
matrix.computeSVD(num_pcs)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
80
genotypes.crossJoin( 
phenotypeAndCovariates) 
.selectExpr(
“expand_struct( ” 
“linear_regression_gwas( ” 
“genotype_states(genotypes), ” 
“phenotype_values, covariates))”)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
81
gwas_results_rdf <- as.data.frame(gwas_results)
install.packages("qqman",
`repos="http://cran.us.r-project.org") library(qqman)
png('/databricks/driver/manhattan.png')
manhattan(gwas_results_rdf)
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
82
http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
GWAS
• Load variants
• Perform quality control
• Control for ancestry
• Run regression against trait
• Log Manhattan plot
83
mlflow.log_artifact( 
'/databricks/driver/manhattan.png')
GWAS pipeline
84
VCF DF
QC’d
DataFrame
GWAS
hits
Phenotypes
Ancestry
85
projectglow.io
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Weitere ähnliche Inhalte

Was ist angesagt?

MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRPivotalOpenSourceHub
 
Open source analytics
Open source analyticsOpen source analytics
Open source analyticsAjay Ohri
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Databricks
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...Srivatsan Ramanujam
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Simplilearn
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library EMC
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data ScientistsDataWorks Summit
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 

Was ist angesagt? (18)

MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
 
Open source analytics
Open source analyticsOpen source analytics
Open source analytics
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
AI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat DetectionAI on Spark for Malware Analysis and Anomalous Threat Detection
AI on Spark for Malware Analysis and Anomalous Threat Detection
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
PyMADlib - A Python wrapper for MADlib : in-database, parallel, machine learn...
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
Pig Tutorial | Apache Pig Tutorial | What Is Pig In Hadoop? | Apache Pig Arch...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Apache Pig for Data Scientists
Apache Pig for Data ScientistsApache Pig for Data Scientists
Apache Pig for Data Scientists
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 

Ähnlich wie Enabling Biobank-Scale Genomic Processing with Spark SQL

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceGigaScience, BGI Hong Kong
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...GenomeInABottle
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchAnshika Bansal
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSHAMNAHAMNA8
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Prof. Wim Van Criekinge
 
SHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAMSHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAMArraygenrajeshmahato
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkDatabricks
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsJan Aerts
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesYasset Perez-Riverol
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesGabe Rudy
 
Aug2013 bioinformatics working group
Aug2013 bioinformatics working groupAug2013 bioinformatics working group
Aug2013 bioinformatics working groupGenomeInABottle
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Lucidworks
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009Ian Foster
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupGenomeInABottle
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookKeiichiro Ono
 

Ähnlich wie Enabling Biobank-Scale Genomic Processing with Spark SQL (20)

Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant scienceVenice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
Venice Juanillas at #ICG13: Rice Galaxy: an open resource for plant science
 
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
RNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGSRNA sequencing analysis tutorial with NGS
RNA sequencing analysis tutorial with NGS
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
 
SHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAMSHORT TERM BIOINFORMATICS TRAINING PROGRAM
SHORT TERM BIOINFORMATICS TRAINING PROGRAM
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache SparkAccelerating Genomics SNPs Processing and Interpretation with Apache Spark
Accelerating Genomics SNPs Processing and Interpretation with Apache Spark
 
ChipSeq Data Analysis
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
 
ECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPsECCB10 talk - Nextgen sequencing and SNPs
ECCB10 talk - Nextgen sequencing and SNPs
 
Standarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata filesStandarization in Proteomics: From raw data to metadata files
Standarization in Proteomics: From raw data to metadata files
 
CS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databasesCS Guest Lecture 2015 10-05 advanced databases
CS Guest Lecture 2015 10-05 advanced databases
 
Aug2013 bioinformatics working group
Aug2013 bioinformatics working groupAug2013 bioinformatics working group
Aug2013 bioinformatics working group
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
Mar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working GroupMar2013 Performance Metrics Working Group
Mar2013 Performance Metrics Working Group
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
Reproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter NotebookReproducible Workflow with Cytoscape and Jupyter Notebook
Reproducible Workflow with Cytoscape and Jupyter Notebook
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 

Kürzlich hochgeladen (20)

Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Nandini Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 

Enabling Biobank-Scale Genomic Processing with Spark SQL

  • 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  • 2. Karen Feng, Databricks Enabling Biobank-Scale Genomic Processing with Spark SQL #UnifiedDataAnalytics #SparkAISummit
  • 3. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 3
  • 4. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 4
  • 5. Genomics is a big data problem 5 40,000 Petabytes / year by 2025From $2.7B to <$1,000 https://www.genome.gov/27541954/dna-sequencing-costs-data/ https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
  • 6. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 6
  • 7. The power of big genomic data 7 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4547496/ Orthosteric inhibition
  • 8. The power of big genomic data 8 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA
  • 9. The power of big genomic data 9 Accelerate Target Discovery Goal: identify a biological target (eg. protein) that can be modulated with a drug Approach: large-scale regressions to correlate DNA variants and the trait Result: clinical trials with genomic evidence are 2x more likely to be approved by the FDA
  • 10. The power of big genomic data 10 Accelerate Target Discovery Reduce Costs via Precision Prevention Improve Survival with Optimized Treatment
  • 11. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 11
  • 12. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 12
  • 13. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 13
  • 14. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 14 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done;
  • 15. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 15 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done
  • 16. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 16 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done bgzip -c myvcf.vcf > myvcf.vcf.gz tabix -p vcf myvcf.vcf.gz tabix myvcf.vcf.gz chr1 > chr1.vcf
  • 17. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 17 for i in chr2L chr2R chr3L chr3R chr4 chrX;do GenomeAnalysisTK -R ${ref_seq} -T SelectVariants -V my_flies.vcf -L $i -o my_flies.${i}.vcf done; for i in {1..16}; do vcftools --vcf VCF_FILE --chr $i --recode --recode-INFO-all --out VCF_$i; done bgzip -c myvcf.vcf > myvcf.vcf.gz tabix -p vcf myvcf.vcf.gz tabix myvcf.vcf.gz chr1 > chr1.vcf java -jar SnpSift.jar split file.vcf
  • 18. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 18
  • 19. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 19 838 results
  • 20. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 20 Data management --make-bed --recode --output-chr --zero-cluster --split-x/--merge-x --set-me-missing --fill-missing-a2 --set-missing-var-ids --update-map... --update-ids... --flip --flip-scan --keep-allele-order... --indiv-sort --write-covar... --{,b}merge... Merge failures VCF reference merge --merge-list --write-snplist --list-duplicate-vars Basic statistics --freq{,x} --missing --test-mishap --hardy --mendel --het/--ibc --check-sex/--impute-sex --fst Linkage disequilibrium --indep... --r/--r2 --show-tags --blocks Distance matrices Identity-by-state/Hamming (--distance...) Relationship/covariance (--make-grm-bin...) --rel-cutoff Distance-pheno. analysis (--ibs-test...) Identity-by-descent --genome --homozyg... Population stratification --cluster --pca --mds-plot --neighbour Association analysis Basic case/control (--assoc, --model) Stratified case/control (--mh, --mh2, --homog) Quantitative trait (--assoc, --gxe) Regression w/ covariates (--linear, --logistic) --dosage --lasso --test-missing Monte Carlo permutation Set-based tests REML additive heritability Family-based association --tdt --dfam --qfam... --tucc Report postprocessing --annotate --clump --gene-report --meta-analysis Epistasis --fast-epistasis --epistasis --twolocus Allelic scoring (--score) R plugins (--R)
  • 21. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 21
  • 22. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 22
  • 23. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 23
  • 24. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 24
  • 25. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 25
  • 26. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 26 1. Converting one file format to another file format. 2. Converting one file format to another file format. 3. Converting one file format to another file format.
  • 27. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 27 “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  • 28. Genomic analysis on big data is hard! • Existing tools are often difficult to – Scale – Learn – Integrate 28 “Give a statistical geneticist an awk line, feed him for a day, teach a statistical geneticist how to awk, feed him for a lifetime...”
  • 29. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 29
  • 30. • Open-source toolkit for large-scale genomic analysis 30
  • 31. • Open-source toolkit for large-scale genomic analysis • Built on Spark for biobank scale • Query and use built-in commands with familiar languages using Spark SQL • Compatible with existing genomic tools and formats, as well as big data and ML tools 31
  • 32. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 32
  • 37. Genomic variant data 37 Variant information: depends on metadata
  • 38. Genomic variant data 38 MapType(StringType, StringType): {“DP” -> “14”, “AF” -> “0.5”} Genomic variant data
  • 39. Genomic variant data 39 MapType(StringType, StringType) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 40. Genomic variant data 40 MapType(StringType, StringType) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”> ##INFO=<ID=AF, Number=?, Type=?, Description=?>
  • 41. Genomic variant data 41 MapType(StringType, StringType): lose metadata and slow querying
  • 42. Genomic variant data 42 Dynamic schema: preserve metadata and fast querying
  • 43. Genomic variant data 43 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 44. Genomic variant data 44 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 45. Genomic variant data 45 Genotype information: depends on metadata
  • 46. Genomic variant data 46 Genotype information: width depends on number of samples
  • 47. Genomic variant data 47 Sample NA00001 Genotype 0|0 Genotype quality 48 Depth 1 Haplotype quality 51,51
  • 48. Genomic variant data 48 Sample NA00001 NA0002 Genotype 0|0 0|0 Genotype quality 48 49 Depth 1 3 Haplotype quality 51,51 58,50
  • 49. Genomic variant data 49 Sample NA00001 NA0002 Genotype 0|0 0|0 Genotype quality 48 49 Depth 1 3 Haplotype quality 51,51 58,50 ... UK Biobank has 500,000 participants!
  • 50. Genomic variant data 50 Sample Genotype Genotype quality Depth Haplotype quality NA0001 0|0 48 1 51,51 NA0002 0|0 49 3 58,50 ...
  • 51. Genomic variant data • Static fields – eg. Chromosome • Dynamic fields – Variant information – Genotype information • Preserves metadata • Fast querying • Limited number of columns 51
  • 52. Genomic variant data 52 VCF VCF rows spark.read .format(“vcf”) .load(“genotypes.vcf”)
  • 54. Genomic variant data 54 VCF rows spark.write .format(“delta”) .save(“genotypes.delta”)
  • 55. Delta Lake 55 • Genomic data – VCF, BGEN, BED • Medical images • Electronic health records • Waveform data • Real world evidence • ...
  • 56. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 56
  • 57. Built-in functions • Convert genotype probabilities to hard calls • Normalize variants • Liftover between reference assemblies • Annotate variants • Genome-wide association studies • ... 57
  • 58. Built-in functions • Convert genotype probabilities to hard calls • Normalize variants • Liftover between reference assemblies • Annotate variants • Genome-wide association studies • ... 58
  • 60. Single-node bioinformatics tools • SAIGE – R library – VCF → CSV 60 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
  • 61. Single-node bioinformatics tools • Require flat file splicing and combination 61
  • 62. Single-node bioinformatics tools 62 Command line tool Text Text Text Text Text Text Text Text Command line tool Command line tool ... ... ...
  • 64. rdd.pipe() • Input and output RDDs have single text column – Input: set header as pipe context – Output: mixed header and text data • Convert between genomic file formats – Changing specs 64
  • 66. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 66
  • 67. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 67 DataFrame VCF
  • 68. glow.transform(‘pipe’) • VCF input formatter – Set header based on schema – Convert Spark Rows to Java objects – Third-party library writes header and variant rows 68 StructField( name = “INFO_AF”, dataType = DoubleType, nullable = true, metadata = Map( “vcf_header_count” -> “A”, “vcf_header_description” -> “Allele Frequency”)) ##INFO=<ID=AF, Number=A, Type=Float, Description=”Allele Frequency”>
  • 69. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 69 Rscript step2_SPAtests.R
  • 70. glow.transform(‘pipe’) • For each partition – Input formatter writes to the command’s stdin – Output formatter reads from the command’s stdout – If running the command triggers an exception, the error is propagated to the driver 70
  • 71. glow.transform(‘pipe’) glow.transform( "pipe", input_df, cmd=cmd, input_formatter='vcf', in_vcf_header='infer', output_formatter='csv', out_header='true', out_delimiter=' ') 71 DataFrame CSV
  • 72. glow.transform(‘pipe’) • CSV output formatter – Write schema to first element in iterator – Write remaining rows to iterator 72 CHR POS BETA SE p.value 22 35292447 1.206 3.285 0.714 22 35292456 1.358 2.534 0.592 StructType( Seq(“CHR”, “POS”, “BETA”, “SE”, “p.value”).map( StructField(_, StringType)) InternalRow(“22”, “35292447”, “1.206”, “3.285”, “0.714”) InternalRow(“22”, “35292456”, “1.358”, “2.534”, “0.592”)
  • 73. glow.transform(‘pipe’) • Input and output DataFrames – Input: infer header from schema – Output: infer schema from header • Convert genomic data under the hood – Spark Row ↔ Java object ↔ text 73
  • 74. Agenda • Genomics overview – Big data problem – Real-world applications – Pain points at biobank scale • Glow – Datasources – Built-in functions – Extensibility 74
  • 75. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 75
  • 76. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 76 spark.read.format("vcf") .load(“genotypes.vcf”)
  • 77. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 77 variant_df.selectExpr("*", "expand_struct(call_summary_stats(genotypes))", "expand_struct(hardy_weinberg(genotypes))") .where((col("alleleFrequencies").getItem(0) >= allele_freq_cutoff) & (col("alleleFrequencies").getItem(0) <= (1.0 - allele_freq_cutoff)) & (col("pValueHwe") >= hwe_cutoff))
  • 78. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 78 qc_df.write .format(“delta”) .save(delta_path)
  • 79. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 79 matrix.computeSVD(num_pcs)
  • 80. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 80 genotypes.crossJoin( phenotypeAndCovariates) .selectExpr( “expand_struct( ” “linear_regression_gwas( ” “genotype_states(genotypes), ” “phenotype_values, covariates))”)
  • 81. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 81 gwas_results_rdf <- as.data.frame(gwas_results) install.packages("qqman", `repos="http://cran.us.r-project.org") library(qqman) png('/databricks/driver/manhattan.png') manhattan(gwas_results_rdf)
  • 82. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 82 http://pheweb.sph.umich.edu/SAIGE-UKB/pheno/250
  • 83. GWAS • Load variants • Perform quality control • Control for ancestry • Run regression against trait • Log Manhattan plot 83 mlflow.log_artifact( '/databricks/driver/manhattan.png')
  • 86. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT