4. Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow
5. AI & ML
what you need to know
Machine Learning:
Make Machines
Learn
Artificial Intelligence:
Make Intelligent
Machines
programming a computer
to be intelligent is hard
programming a computer
to learn to be intelligent
is easier and progress is
measurable
6. * Human Performance
based on analysis done
by Andrej Karpathy.
More details here.
Image understanding is (getting) better than human level
ImageNet Challenge: Given
an image, predict one of
1000+ of classes
%errors
7. Deep Neural Networks: Algorithms that Learn
● Modernization of artificial neural networks
● Made of of simple mathematical units,
organized in layers, that together can
compute some (arbitrary) function
● more layers = deeper = more general
● Learn from raw, heterogeneous data
8. “Given an image,
predict one of
1000+ of classes”
Image credit:
360phot0.blogspot.com
ImageNet
Challenge
9. Released in Nov. 2015
#1
repository
for “machine learning”
category on GitHub
TensorFlow
11. Transfer Learning
Quickly able to Learn New Concepts
“t-rex”“quidditch”
Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
13. Generate Marker Fingerprint
Select & Recombine Sample tissue
Breeding
Genotyping Lab
Extract DNAAnalyze & Model Data
Grow
Marker-Assisted Breeding Rapidly Increases Frequency of
Favorable Genes
Cloud ML
TensorFlow
14. Genomics & Genetics Problems:
How to Start Applying DNNs?
Must-haves for deep learning:
● Lots of data: >50k examples, >1M examples ideal
● High-quality input and labels for training
● Label ~ F(data) unknown but certainly function exists
● High-quality prev. efforts so we know that DNNs are key
○ i.e. hard to solve with classical statistical
approaches
SNP and indel calling from NGS data
17. Creating a universal SNP and small indel
variant caller with deep neural networks
Ryan Poplin, Cory McLean, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy,
Sam Gross, Madeleine Cule, Peyton Greenside, Justin Zook, Marc Salit, Mark
DePristo, Verily Life Sciences, October 2016
18. DNN (Inception V3) Predicts True Genotype from Pileup Images
{ 0.001, 0.994, 0.005 }
{ 0.001, 0.990, 0.009 }
{ 0.000, 0.001, 0.999 }
{ 0.600, 0.399, 0.001 }
Output:
Probability of diploid
genotype states
{ HOM_REF, HET, HOM_VAR }
Raw pixels
Input:
Millions of labeled pileup
images from gold standard
samples
19. DeepVariant #1 in PrecisionFDA Truth Challenge
v2 => v3 truth set
for unblinded
sample
Unblinded =>
blinded sample with
v3 truth set
99.85
99.70
98.91
24. Public Datasets Project
https://cloud.google.com/bigquery/public-data/
A public dataset is any dataset that is stored in BigQuery and made available to the general public. This URL lists a
special group of public datasets that Google BigQuery hosts for you to access and integrate into your applications.
Google pays for the storage of these data sets and provides public access to the data via BigQuery. You pay only for the
queries that you perform on the data (the first 1TB per month is free)
30. Google confidential │ Do not distribute
Google is good at handling massive volumes of data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
31. Google confidential │ Do not distribute
Google can Handle Massive Amounts of Genomic Data
uploads per minute
users
search index
query response time
300hrs
500M+
100PB+
0.25s
~6 Maize WGS
>100x US PhDs
~1M WGS
0.25s
33. New Public Dataset: 1K Cannabis
cloud.google.com/bigquery/public-data/1000-cannabis
Blog Post @ Medium:
DNA Sequencing of 1K Cannabis Strains publicly available in Google BigQuery
Open Source:
https://github.com/allenday/bfx-seq
Revise
Models
DNA
Reads
34. Build What’s Next
Thank You!
Allen Day, PhD // Science Advocate // @allenday // #genomics #ml #datascience