#Code2Cure: A field guide for software engineers on their journey to the world of genomics.

#Code2Cure: Engineering Genomics
: @mirkiani
A field guide for software engineers on their journey to the world of genomics.
Amirhossein Kiani
Sr. Lead Software Engineer
: amir@bina.com
Image courtesy of http://circos.ca
DISCLAIMER: The views expressed in this talk are mine alone and not
those of my employer.
Bina products are for use Research Use Only. Not for use in diagnostic
procedures.
Also, I’m a Computer Scientist by training and trying to help those with
similar background to learn about the field of genomics. Therefore there
has been a high degree of simplification done in explaining the scientific
concepts in this talk.

 https://www.youtube.com/watch?v=G1ZLyGW8rKY
2

www.bina.com
Why Genomics?
$3,000,000,000
13 years
 http://en.wikipedia.org/wiki/Human_Genome_Project
Past Present
$1000
24 hours
Future
3

www.bina.com
Why Genomics?
Some things we could do with genomics:
• Carrier Screening
• Prenatal Screening
• Newborn Screening
• Inherited Disease
• Infectious Disease
• Cancer Diagnostics
• Microbiome
• Personalized Medicine
4

But I have no genomics background!
It’s ok. 
5

www.bina.com
My personal story…
6
Now
Then

www.bina.com
What is cell, what is DNA?
 http://en.wikipedia.org/wiki/Cell_%28biology%29
 http://en.wikipedia.org/wiki/DNA
7
Image courtesy of Pinterest
Image courtesy of Tumblr

www.bina.com
Crash Course on Genomics
The field of studying the structure of genomes.
 http://en.wikipedia.org/wiki/Genomics
 http://en.wikipedia.org/wiki/RNA
 http://en.wikipedia.org/wiki/Protein
DNA RNA Protein You!
8

www.bina.com
How do we figure out what’s in DNA?
Like everything else, we turn the analog signal to digital, and then
analyze it.
 http://en.wikipedia.org/wiki/DNA_sequencing
 http://en.wikipedia.org/wiki/FASTQ_format
Illumina, Ion Torrent, Genia, …
Primary Analysis
FASTQ Format
9
Image courtesy of PersonalGenomes.org

www.bina.com
RAW Data to Variants (Secondary Analysis)
Step 1. Alignment
 http://en.wikipedia.org/wiki/DNA_sequencing
 http://en.wikipedia.org/wiki/FASTQ_format
10
Image courtesy of Wall Woodworks
Image courtesy of Wallpaper Up

www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Step 1. Short-Read Sequence Alignment
 http://en.wikipedia.org/wiki/Reference_genome
 http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism
 http://en.wikipedia.org/wiki/Indel
 http://en.wikipedia.org/wiki/Structural_variation
AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTG
Reference Genome (~3B bases)
ACTTTGGTCCACCCAAGG
AAGGGGGACACCCAAGGACACCC__GGGGGAAACT
GGACACCCAAGGGGGAA
ACCCAAGGGGGACACCC
ACCC__GGGGGAAACTTTG
AACACACCC__GGGGGAA
Coverage
Deletion Single Nucleotide Polymorphism
11

www.bina.com
• Burrows-Wheeler Aligner (BWA)
• Uses Burrows-Wheeler transform (also used in bzip)
• Uses Smith-Waterman algorithm
• Written in C++
• Uses ~4GB memory for human genome
 http://bio-bwa.sourceforge.net
 http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html
$ bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
Example
12

www.bina.com
Alignment
FASTQ SAM
Convert to Binary
BZIP (samtools)
BAM File
BAM File Index
 http://samtools.github.io/hts-specs/SAMv1.pdf
 http://samtools.github.io
13

www.bina.com
BAM File
BAM File Index
 http://www.broadinstitute.org/igv
 https://github.com/ekg/freebayes
 http://arxiv.org/abs/1207.3907
 https://www.broadinstitute.org/gatk
Visualize
Variant Calling
$ freebayes -f ref.fa aln.bam >var.vcf
Example
Interactive Genome Browser (IGV)
14

www.bina.com
15
… and here are your variants (VCF file)! 
 http://samtools.github.io/hts-specs/VCFv4.2.pdf

www.bina.com
What do we do with variant calls then?
Zooming in on the Central Dogma of Molecular Biology:
• There is redundancy in protein codes.
• But a mutation could change the protein coding.
16
Image courtesy of Wikipedia

www.bina.com
What do we do with variant calls then?
Annotation & Interpretation
• Functional Annotation  Figure out if the mutation is dangerous (Use SNPEff)
• Synonymous
• Non-Synonymous
• Frame-shift
• …
• Put in the context of existing findings
• dbSNP
• ClinVar
• COSMIC
• ESP
• 1000 Genomes
• …
 http://snpeff.sourceforge.net
 http://www.ncbi.nlm.nih.gov/SNP
17

www.bina.com
Statistics
Data AnalyticsBioinformatics
Genomics
Big Data Technologies
Compute and Data Science
19
Bringing three disciplines together

www.bina.com
Case Study: Bina GMS
20
Sequencing 2º Analysis 3º Analysis Interpretation
Meaningful Results
& Clinical Relevance
20+ DBs including over
140+ annotations:
HGMD // PGMD // Clinvar
COSMIC // dbNSFP // TRANSFAC
1000 Genome and more.
Tools & Workflows for:
WGS // WES // RNAseq
Somatic Mutations
Multi sample
Gene Panels
Bina Products are for Research Use Only

www.bina.com
Bina RAVE Architecture (1)
21
Secure REST Interface
Portal Server(s)
Portal Backend
(Distributed)
• Workflow Definition
• Templates
• QC/Monitoring
• System Management/Updates
Task Dependency
Graphs
Distributed
Workflow
Orchestration
Secure Push
Interface
WorkflowGeneration
Interactive UI // Command Line SDK
Executor
Dynamic
Scheduling
Local Storage
ExecutionEngine
Executor Nodes / VMs
Network Storage – Input/Output Data
Static
Scheduling
Workflows
Tools
Commands

www.bina.com
Bina RAVE Architecture (2)
Workflows (DNA, RNA ..)
Tools (BWA, GATK, SVs)
Services
(Logging, Storage, Caching,
Streaming)
Commands
(Samtools, GATK, URL,..)
Genome-aware – Workflow Generation
Distributed Coordination
Task Graph
JSON Request
(UI/CMD/SDK)
Nodes / VMs
Executor
Dynamic
scheduling
Graph
Triggers
Updates
Genome aware – Distributed Execution Framework
Syncing all
Nodes
Dependency
Graph
Task Status
Network storage – Input/output data
Local storage
• Dependency Aware Execution
• Locality Aware Execution (Caching)
• Streaming Through “Engines”
• In-Memory Computation
Output
(VCF,SV)
Input
(BAM, FASTQ)
Static
Scheduling

www.bina.com
Bina AAiM Architecture
Annotation and Indexing Engine
Input
VCF
UI/CMD
Clinical
Annotations
Genomic
Context
Prediction
Func. Impact
Population
Frequency
Distributed Execution
Framework
Annotation
(Join static DBs)
Indexing &
Functional Filters
MapReduce Jobs
Analytics Engine
NoSQL
Data Store
Indices
Metadata
Store
Tumor/Norma
l
Pedigree
Queries, Filters, Variant Sets, Reports
Bina
Secondary
Cohort StudyProband

www.bina.com
What next?
 http://www.genomicsengland.co.uk
 http://www.personalgenomes.org
• Apply this process to different domains and applications
• Come up with ways of ranking variants
• Keep learning from data
• Sequence everyone!
• Genomics England 100,000 Genome Project
• Personal Genomes Project
• Decrease cost
• Increase accuracy
• Make the technology faster and more usable!
Map of sequencers around the globe: http://omicsmaps.com
24

www.bina.com
Challenges in Genomics
• Accuracy
• Gold standard? What tool is best, there are so many!
• NIST, Dream Challenge
• Need to speak the same language… interoperability
• Global Alliance
• API, format, meta data, …
• Regulations
• HIPPA, CLIA: security, accuracy, anonymity and encryption
• Scalability
• Storage
• Need terabytes
• Each genome could be up to 1T
• Computation
• We still pretty much have no idea what most of DNA is doing…
• Can’t run on single machine. Need to scale to many nodes
• Need to leverage cloud technologies
• Provenance and auditability
• Importance of usability
• Different personas
• Errors are very expensive (life and death)
• Better visualization → faster discovery → faster cure
25

www.bina.com
Why should software engineers move to genomics?
Because genomics needs you, and you need genomics.
Work on something that matters! (#Code2Cure)
Things that SWEs do very well:
• Automation
• Elegant solutions for complex problems
• Enabling non-savvy users by
making the technology robust and accessible
• Scale
• Optimization
• Building production-grade platforms
• Tested
• Robust
• Secure
THESE ARE ALL NEEDED IN GENOMICS YESTERDAY!
26
Image courtesy of http://silvsoul.blogspot.com

www.bina.com
Open projects/resources to checkout/contribute to
Projects/Conferences
• Galaxy -- http://galaxyproject.org
• Arvados -- https://arvados.org
• Open Bio Conference -- http://www.open-bio.org
• BioViz -- http://www.biovis.net
• BioPython -- http://biopython.org
• Global Alliance for Genomics Health -- http://ga4gh.org
• Rosalind Project -- http://rosalind.info
Blogs/Websites
• http://bcb.io
• http://nextgenseek.com/
• http://ngs-expert.com/
• http://seqanswers.com/
• http://core-genomics.blogspot.com
• http://www.genomesunzipped.org
• http://genomeweb.com
27

Thank you.
And I hope you consider moving to genomics! 
 http://info.bina.com/code2cure-community
: @mirkiani
Amirhossein Kiani
Sr. Lead Software Engineer
: amir@bina.com

#Code2Cure: A field guide for software engineers on their journey to the world of genomics.

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

#Code2Cure: A field guide for software engineers on their journey to the world of genomics.

Hinweis der Redaktion