SlideShare ist ein Scribd-Unternehmen logo
1 von 62
Downloaden Sie, um offline zu lesen
by Data Fellas,
Spark London Meetup July, 1st ‘15
Share and analyse genomic data
at scale with Spark, Adam, Tachyon and the Spark Notebook
PART I
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
Outline
PART II
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services
Andy Petrella
@noootsab
Maths
scala
Apache Spark
Spark Notebook
Trainer
Data Banana
Xavier Tordoir
@xtordoir
Physics
Bioinformatics
Scala
Spark
PART I
Spark & Genomics
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
So that’s the
thing that
separates us?
Adam
What is genomics data
Okay, sounds
good. Give me
two of them!
Genome is an important factor in health:
Medical Diagnostics
Drug response
Diseases mechanisms
…
Adam
What is genomics data
You mean devs
are slacking
of?
On the data production:
Fast biotech progress
No so fast IT progress?
Adam
What is genomics data
No! They’re
just sticky
bubbles...
On the data production:
Sequence {A, T, G, C}
3 billion bases
Adam
What is genomics data
Okay, a lot of
bubbles.
On the data production:
Sequence {A, T, G, C}
3 billion bases
… x 30 (x 60?)
Adam
What is genomics data
C’mon. a big
mess of plenty
of lil’ bubbles
then.
On the data production: massively parallel
Sequence {A, T, G, C}
3 billion bases
… x 30 (x 60?)
Adam
What is genomics data
Ah that
explain why
the black bars
are differents
Adam
What is genomics data
Dude... Tens of
millions
Adam
What is genomics data
Staaaaaaph Tens of
millions
1000’s
1,000,000’s
…
Adam
What is genomics data
‘coz it makes
sparkling
bubbles, right?
Ok, looks like Apache Spark
makes a lot of sense here …
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Well done, a
spec as text in
a pDf…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Take that
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Dunno what is
a Genotype but
it contains a
Variant.
Apparently.
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
An understandable model
Yeaaah:
generate
client == more
slack
Adam provides an avro
schema
Adam
An efficient storage
Machism in I.
T., what a
flaw!
● Distribute data
● Schema based
● Read/query efficient
● Compact
Adam
An efficient storage
That’s a quick
step
● Distribute data
● Schema based
● Read/query efficient
● Compact
PARQUET!
Adam
An efficient storage
Is Eve okay to
use the
parquet for
that?
● Distribute data
● Schema based
● Read/query efficient
● Compact
PARQUET!
Adam provides parquet as storage format
Adam
A clean API
Object
Wrappedy
adam Context
Adam
A clean API
I could have
done this as a
one liner
adam Context
IO methods
Adam
A clean API
At least, it’s
going to be
simpler than
the chemistry
● Scala classes generated from Avro
● Data loaded as RDDs
● functions on RDDs
○ write to HDFS
○ genomic objects manipulations
○ Primitives to query genomics
datasets
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Part of a pipeline
human | Seq |
SNAP |
Avocado |
Adam | Ga4gh
ADAM is JVM library leveraging
- Spark
- Avro
- Parquet
It still needs to be combined with sources
(snap)
Adam data is part of processes (AVOCADO).
It CAN ALSO BE THE SOURCE FOR external
PROCESSING, LEARNING (LIKE mllIB).
Thousands Genomes
Open Data Set
Games without
Frontiers
1000 genomes: http://www.1000genomes.org/
Produces BAMs, VCFs, ...
Thousands Genomes
Why do you
complain, they
are
compressed …
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Thousands Genomes
Where are the data
DNA Russian
roulette:
which is
fastest?
● EBI FTP: ftp://ftp.1000genomes.ebi.ac.
uk/vol1/ftp/
● NCBI FTP: ftp://ftp-trace.ncbi.nih.
gov/1000genomes/ftp/
● S3: http://aws.amazon.com/1000genomes/
● GS: gs://genomics-public-data/ftp-trace.ncbi.
nih.gov/1000genomes/ftp
Thousands Genomes
Adam that shit on S3
Hmmm like in
the good old
days of HPC
The bad part …
● get the vcf.gz file on local disk (& time for a
coffee)
● uncompress (& go for lunch)
● put in HDFS (& take dessert)
Thousands Genomes
Adam that shit on S3
what?
No grappa?
The good part …
the Notebook (this one)
Thousands Genomes
Adam that shit on S3
Okay, good
enough to wait
a bit…
What did we gain?
● before: 152 GB (gzipped) in 23 files
● After: 71 GB in 9172 partitions
(43,372,735,220 genotypes)
Explore Genomics
Access the data
Just in case,
you don’t
believe us -_-’
Access data from this notebook
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Explore Genomics
Compute statistics
We’re there to
compute,
right?
Compute Freqs from this spark
notebook
Learn Genomics
The problem
Insane, you’ll
have hard time
with me |:-[
How to deal with heterogenous data?
● Population stratification
● Identify natural clusters
● Assign genomes to these clusters
Learn Genomics
The dimensions
Wiiiiiiiiiiiiiiiiide
rows
● 1000 Samples (Rows)
● 30,000,000 variants (columns or
variables)
Hard to explore such a feature space…
Learn Genomics
The dimensions
*LDA for
Latent
Dirichelet
Allocation…
Dimensionality reduction?
● Ideal would be a “Genetic” Mixture
measure (lda* would do that…)
● Or a genetic distance (edit distance)
KMeans & distances to centroids
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Learn Genomics
The model
Reduce, train,
validate, infer
● Split training/validation set
● Train KMeans with 25 clusters
● Compute distances to each centroid as
new features
● Train Random Forest
● Validation
Learn Genomics
The notebook
Define and train the model in this
Notebook
The whole
shebang?
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Adam
Our pipeline
I am a Llama
Convert VCFs to ADAM
StoRE ADAM to S3
Compute alleles frequencies
Store alleles frequencies to S3
Compute Minor Allele frequency distribution
Train a Model for stratification
Hmmm… quite some missing pieces, right?
PART II
Standards & Micro Services
Wake up!
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Ga4GH
Let’s fix the baseline
In I.T. it’s easy
everything is
standardized…
Global Alliance for Genomic and Health
http://genomicsandhealth.org/
http://ga4gh.org/
Framework for responsible data sharing
● Define schemas
● Define services
Along with Ethical, Legal, security, clinical aspects
GA4GH
models
… everybody
has is own
standard
GA4GH
Services
But a shared
schema is a bit
better!
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
GA4GH
Metadata
The data of my
data is also
my data
Work In Progress
● Individual
● Sample
● Experiment
● Dataset
● IndividualGroup
● Analysis
But still very young
and too much centered on data
Beacon ⁽*⁾
Tells the world you have data.
CLearly not enough
Med At Scale
By Data Fellas
Existing scalable implementation:
Google Genomics
Uses
● BigQuery
● google cloud computing
● dremel
● …
That’s what
happens when
you think you
have…
Med At Scale
By Data Fellas
Google Genomics is pushing Hard
…
Med At Scale
Scalability first
BIG
There is another scalable implementation:
Med At Scale, by Data Fellas
Uses
● Apache Spark
● Adam
● S3
● HDFS
● …
Med At Scale
Scalability first
Data Fellas is pushing TOO
BIG
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Composability
very BIG
GA4GH defines quite some methods, or
services
They don’t have all the same requirements
in term of exposure and data processing
→ micro services for the Win
Allows granular deployment and
composition/chaining of methods to
answer a global question
Med At Scale
Customization
Data Fellas is a data science company
Thus our goal is to expose data analyses
A data analysis is
● elaborated in a notebook
● validated on a cluster
● deployed as a micro service it self
Still defining a Schema and Service
VERY VERY BIG
Med At Scale
Ready for the load
Balls!
We saw that one row has
30,000,000 columns
The queries are slicing and dicing those columns
→ views are huge
Hence, Tachyon via RDD.persist/save will optimize
the collocated queries in space and time.
The hard part (will/)is to size the tachyon cluster
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Ad Hoc Analytics
Who left the
rats out?
Standards are very important
However, they cannot define everything,
mostly OLAP.
Ad-Hoc analytics are thus allowed on the raw
data using Apache Spark directly.
Of course, interactivity is a key to
performance… hence the Spark-Notebook is
involved.
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
How it works
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
ADAM (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
MLlib (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Efficient binary data
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Micro Service
Finally…
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Med At Scale
Cache and Collaboration
Finally…
Explore
Using GA4GH endpoints
notebook TIME!
Use scala/Java Avro client from
the browser.
I give you
Bananas
You give me
Ananas
Customize
Create and Use micro service (WIP)
Planning the
next gear
Remember the frequencies use case?
There is a custom endpoint manually created
We’re working on an Integrated Workflow
In a notebook:
● create the process
● create Cassandra schema
● persist (using connector)
● Define service AVRO IDL
● Generate project for DCOS
● Log usage (see next)
TIPS 1:
Lorem Ipsum is
simply dummy text
of the printing and
typesetting
industry.
Optimization
Query mining (Roadmap)
Always look
at the bright
side
Back to the high dimensionality problem
Caching beforehands is a good solution
but is not optimal.
Plan: ANalyse the Request/Response
objects and the gathered runtime metrics
to adapt the caching policies -- query
mining processes
References
Adam: https://github.com/bigdatagenomics/adam
Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/
GA4GH data working group: http://ga4gh.org/
Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/
Q/A⁽*⁾
THANKS!
⁽*⁾ or head to the pub (at least beers…)

Weitere ähnliche Inhalte

Was ist angesagt?

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMfnothaft
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Andy Petrella
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaAndy Petrella
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014StampedeCon
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Timothy Danford
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMMatt Massie
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleDatabricks
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...Sri Ambati
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekingeProf. Wim Van Criekinge
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIAllen Day, PhD
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 

Was ist angesagt? (20)

Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Lightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and ScalaLightning fast genomics with Spark, Adam and Scala
Lightning fast genomics with Spark, Adam and Scala
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
From Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at ScaleFrom Genomics to Medicine: Advancing Healthcare at Scale
From Genomics to Medicine: Advancing Healthcare at Scale
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
2016 02 23_biological_databases_part1
2016 02 23_biological_databases_part12016 02 23_biological_databases_part1
2016 02 23_biological_databases_part1
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Tese phd
Tese phdTese phd
Tese phd
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
MongoDB and the Connectivity Map: Making Connections Between Genetics and Dis...
 
2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge2016 bioinformatics i_databases_wim_vancriekinge
2016 bioinformatics i_databases_wim_vancriekinge
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 

Ähnlich wie Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Codemotion
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valuePeadar Coyle
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysisodsc
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesDatabricks
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Austin Ogilvie
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInCarl Steinbach
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesCodePolitan
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami PresentationGreg Werner
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Ahmed Kamal
 

Ähnlich wie Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook (20)

Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
Artificial Intelligence in practice - Gerbert Kaandorp - Codemotion Amsterdam...
 
Distributed machine learning 101 using apache spark from a browser devoxx.b...
Distributed machine learning 101 using apache spark from a browser   devoxx.b...Distributed machine learning 101 using apache spark from a browser   devoxx.b...
Distributed machine learning 101 using apache spark from a browser devoxx.b...
 
From Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into valueFrom Lab to Factory: Or how to turn data into value
From Lab to Factory: Or how to turn data into value
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Data Science as Scale
Data Science as ScaleData Science as Scale
Data Science as Scale
 
Recurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text AnalysisRecurrent Neural Networks for Text Analysis
Recurrent Neural Networks for Text Analysis
 
Rental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean DownesRental Cars and Industrialized Learning to Rank with Sean Downes
Rental Cars and Industrialized Learning to Rank with Sean Downes
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
Applied Data Science: Building a Beer Recommender | Data Science MD - Oct 2014
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Machine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & OpportunitiesMachine Learning - Challenges, Learnings & Opportunities
Machine Learning - Challenges, Learnings & Opportunities
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami Presentation
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 

Mehr von Andy Petrella

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best PracicesAndy Petrella
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data MappingAndy Petrella
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooksAndy Petrella
 
Governance compliance
Governance   complianceGovernance   compliance
Governance complianceAndy Petrella
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPRAndy Petrella
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and howAndy Petrella
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scalaAndy Petrella
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Andy Petrella
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Andy Petrella
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformAndy Petrella
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserAndy Petrella
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open ScienceAndy Petrella
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GISAndy Petrella
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-dataAndy Petrella
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries AvailableAndy Petrella
 

Mehr von Andy Petrella (20)

Data Observability Best Pracices
Data Observability Best PracicesData Observability Best Pracices
Data Observability Best Pracices
 
How to Build a Global Data Mapping
How to Build a Global Data MappingHow to Build a Global Data Mapping
How to Build a Global Data Mapping
 
Interactive notebooks
Interactive notebooksInteractive notebooks
Interactive notebooks
 
Governance compliance
Governance   complianceGovernance   compliance
Governance compliance
 
Data science governance and GDPR
Data science governance and GDPRData science governance and GDPR
Data science governance and GDPR
 
Data science governance : what and how
Data science governance : what and howData science governance : what and how
Data science governance : what and how
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
Agile data science with scala
Agile data science with scalaAgile data science with scala
Agile data science with scala
 
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
Agile data science: Distributed, Interactive, Integrated, Semantic, Micro Ser...
 
What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.What is a distributed data science pipeline. how with apache spark and friends.
What is a distributed data science pipeline. how with apache spark and friends.
 
Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)Towards a rebirth of data science (by Data Fellas)
Towards a rebirth of data science (by Data Fellas)
 
Leveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platformLeveraging mesos as the ultimate distributed data science platform
Leveraging mesos as the ultimate distributed data science platform
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Liège créative: Open Science
Liège créative: Open ScienceLiège créative: Open Science
Liège créative: Open Science
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Spark devoxx2014
Spark devoxx2014Spark devoxx2014
Spark devoxx2014
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Quanti-litative Revolution in GIS
Quanti-litative Revolution in GISQuanti-litative Revolution in GIS
Quanti-litative Revolution in GIS
 
Scala and-fp-in-big-data
Scala and-fp-in-big-dataScala and-fp-in-big-data
Scala and-fp-in-big-data
 
Software Crafted And Libraries Available
Software Crafted And Libraries AvailableSoftware Crafted And Libraries Available
Software Crafted And Libraries Available
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark notebook

  • 1. by Data Fellas, Spark London Meetup July, 1st ‘15 Share and analyse genomic data at scale with Spark, Adam, Tachyon and the Spark Notebook
  • 2. PART I Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model Outline PART II GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  • 3. Andy Petrella @noootsab Maths scala Apache Spark Spark Notebook Trainer Data Banana Xavier Tordoir @xtordoir Physics Bioinformatics Scala Spark
  • 4. PART I Spark & Genomics Adam: genomics on Spark 1K Genomes in Adam on S3 Explore: Compute Stats Learn: train a model So that’s the thing that separates us?
  • 5. Adam What is genomics data Okay, sounds good. Give me two of them! Genome is an important factor in health: Medical Diagnostics Drug response Diseases mechanisms …
  • 6. Adam What is genomics data You mean devs are slacking of? On the data production: Fast biotech progress No so fast IT progress?
  • 7. Adam What is genomics data No! They’re just sticky bubbles... On the data production: Sequence {A, T, G, C} 3 billion bases
  • 8. Adam What is genomics data Okay, a lot of bubbles. On the data production: Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  • 9. Adam What is genomics data C’mon. a big mess of plenty of lil’ bubbles then. On the data production: massively parallel Sequence {A, T, G, C} 3 billion bases … x 30 (x 60?)
  • 10. Adam What is genomics data Ah that explain why the black bars are differents
  • 11. Adam What is genomics data Dude... Tens of millions
  • 12. Adam What is genomics data Staaaaaaph Tens of millions 1000’s 1,000,000’s …
  • 13. Adam What is genomics data ‘coz it makes sparkling bubbles, right? Ok, looks like Apache Spark makes a lot of sense here …
  • 14. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Well done, a spec as text in a pDf…
  • 15. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Take that
  • 16. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Dunno what is a Genotype but it contains a Variant. Apparently.
  • 17. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam An understandable model Yeaaah: generate client == more slack Adam provides an avro schema
  • 18. Adam An efficient storage Machism in I. T., what a flaw! ● Distribute data ● Schema based ● Read/query efficient ● Compact
  • 19. Adam An efficient storage That’s a quick step ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET!
  • 20. Adam An efficient storage Is Eve okay to use the parquet for that? ● Distribute data ● Schema based ● Read/query efficient ● Compact PARQUET! Adam provides parquet as storage format
  • 22. Adam A clean API I could have done this as a one liner adam Context IO methods
  • 23. Adam A clean API At least, it’s going to be simpler than the chemistry ● Scala classes generated from Avro ● Data loaded as RDDs ● functions on RDDs ○ write to HDFS ○ genomic objects manipulations ○ Primitives to query genomics datasets
  • 24. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Part of a pipeline human | Seq | SNAP | Avocado | Adam | Ga4gh ADAM is JVM library leveraging - Spark - Avro - Parquet It still needs to be combined with sources (snap) Adam data is part of processes (AVOCADO). It CAN ALSO BE THE SOURCE FOR external PROCESSING, LEARNING (LIKE mllIB).
  • 25. Thousands Genomes Open Data Set Games without Frontiers 1000 genomes: http://www.1000genomes.org/
  • 26. Produces BAMs, VCFs, ... Thousands Genomes Why do you complain, they are compressed …
  • 27. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Thousands Genomes Where are the data DNA Russian roulette: which is fastest? ● EBI FTP: ftp://ftp.1000genomes.ebi.ac. uk/vol1/ftp/ ● NCBI FTP: ftp://ftp-trace.ncbi.nih. gov/1000genomes/ftp/ ● S3: http://aws.amazon.com/1000genomes/ ● GS: gs://genomics-public-data/ftp-trace.ncbi. nih.gov/1000genomes/ftp
  • 28. Thousands Genomes Adam that shit on S3 Hmmm like in the good old days of HPC The bad part … ● get the vcf.gz file on local disk (& time for a coffee) ● uncompress (& go for lunch) ● put in HDFS (& take dessert)
  • 29. Thousands Genomes Adam that shit on S3 what? No grappa? The good part … the Notebook (this one)
  • 30. Thousands Genomes Adam that shit on S3 Okay, good enough to wait a bit… What did we gain? ● before: 152 GB (gzipped) in 23 files ● After: 71 GB in 9172 partitions (43,372,735,220 genotypes)
  • 31. Explore Genomics Access the data Just in case, you don’t believe us -_-’ Access data from this notebook
  • 32. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Explore Genomics Compute statistics We’re there to compute, right? Compute Freqs from this spark notebook
  • 33. Learn Genomics The problem Insane, you’ll have hard time with me |:-[ How to deal with heterogenous data? ● Population stratification ● Identify natural clusters ● Assign genomes to these clusters
  • 34. Learn Genomics The dimensions Wiiiiiiiiiiiiiiiiide rows ● 1000 Samples (Rows) ● 30,000,000 variants (columns or variables) Hard to explore such a feature space…
  • 35. Learn Genomics The dimensions *LDA for Latent Dirichelet Allocation… Dimensionality reduction? ● Ideal would be a “Genetic” Mixture measure (lda* would do that…) ● Or a genetic distance (edit distance) KMeans & distances to centroids
  • 36. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Learn Genomics The model Reduce, train, validate, infer ● Split training/validation set ● Train KMeans with 25 clusters ● Compute distances to each centroid as new features ● Train Random Forest ● Validation
  • 37. Learn Genomics The notebook Define and train the model in this Notebook The whole shebang?
  • 38. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Adam Our pipeline I am a Llama Convert VCFs to ADAM StoRE ADAM to S3 Compute alleles frequencies Store alleles frequencies to S3 Compute Minor Allele frequency distribution Train a Model for stratification Hmmm… quite some missing pieces, right?
  • 39. PART II Standards & Micro Services Wake up! GA4GH: Standard for Genomics med-at-scale project Explore: using Standards Create custom micro services
  • 40. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Ga4GH Let’s fix the baseline In I.T. it’s easy everything is standardized… Global Alliance for Genomic and Health http://genomicsandhealth.org/ http://ga4gh.org/ Framework for responsible data sharing ● Define schemas ● Define services Along with Ethical, Legal, security, clinical aspects
  • 43. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. GA4GH Metadata The data of my data is also my data Work In Progress ● Individual ● Sample ● Experiment ● Dataset ● IndividualGroup ● Analysis But still very young and too much centered on data Beacon ⁽*⁾ Tells the world you have data. CLearly not enough
  • 44. Med At Scale By Data Fellas Existing scalable implementation: Google Genomics Uses ● BigQuery ● google cloud computing ● dremel ● … That’s what happens when you think you have…
  • 45. Med At Scale By Data Fellas Google Genomics is pushing Hard …
  • 46. Med At Scale Scalability first BIG There is another scalable implementation: Med At Scale, by Data Fellas Uses ● Apache Spark ● Adam ● S3 ● HDFS ● …
  • 47. Med At Scale Scalability first Data Fellas is pushing TOO BIG
  • 48. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Composability very BIG GA4GH defines quite some methods, or services They don’t have all the same requirements in term of exposure and data processing → micro services for the Win Allows granular deployment and composition/chaining of methods to answer a global question
  • 49. Med At Scale Customization Data Fellas is a data science company Thus our goal is to expose data analyses A data analysis is ● elaborated in a notebook ● validated on a cluster ● deployed as a micro service it self Still defining a Schema and Service VERY VERY BIG
  • 50. Med At Scale Ready for the load Balls! We saw that one row has 30,000,000 columns The queries are slicing and dicing those columns → views are huge Hence, Tachyon via RDD.persist/save will optimize the collocated queries in space and time. The hard part (will/)is to size the tachyon cluster
  • 51. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Ad Hoc Analytics Who left the rats out? Standards are very important However, they cannot define everything, mostly OLAP. Ad-Hoc analytics are thus allowed on the raw data using Apache Spark directly. Of course, interactivity is a key to performance… hence the Spark-Notebook is involved.
  • 52. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale How it works Finally…
  • 53. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale ADAM (and Spark) Finally…
  • 54. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale MLlib (and Spark) Finally…
  • 55. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Efficient binary data Finally…
  • 56. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Micro Service Finally…
  • 57. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Med At Scale Cache and Collaboration Finally…
  • 58. Explore Using GA4GH endpoints notebook TIME! Use scala/Java Avro client from the browser. I give you Bananas You give me Ananas
  • 59. Customize Create and Use micro service (WIP) Planning the next gear Remember the frequencies use case? There is a custom endpoint manually created We’re working on an Integrated Workflow In a notebook: ● create the process ● create Cassandra schema ● persist (using connector) ● Define service AVRO IDL ● Generate project for DCOS ● Log usage (see next)
  • 60. TIPS 1: Lorem Ipsum is simply dummy text of the printing and typesetting industry. Optimization Query mining (Roadmap) Always look at the bright side Back to the high dimensionality problem Caching beforehands is a good solution but is not optimal. Plan: ANalyse the Request/Response objects and the gathered runtime metrics to adapt the caching policies -- query mining processes
  • 61. References Adam: https://github.com/bigdatagenomics/adam Bdg-Formats: https://github.com/bigdatagenomics/bdg-formats GA4GH website: http://genomicsandhealth.org/ GA4GH data working group: http://ga4gh.org/ Spark-Notebook: https://github.com/andypetrella/spark-notebook/ Med-At-Scale: https://github.com/med-at-scale/high-health Data Fellas: http://data-fellas.guru/
  • 62. Q/A⁽*⁾ THANKS! ⁽*⁾ or head to the pub (at least beers…)