Recording on YouTube: https://www.youtube.com/watch?v=G419mmAL9qw
We are at the beginning of a pivotal chapter in the history of medicine.
It took more than a decade and billions of dollars to assemble the first human genome. Today, this tedious task be done in a only a few hours and with a cost as low as $1000.
This historical advancement in the genomics world has created a significant challenge and opportunity for storing, analyzing and understanding genomics information. Fortunately, the software industry, while working on massive ad networks, video games, and social applications, has invented tools, approaches and solutions that can directly be applied to the genomics world and enable the future of medicine.
This talk serves as a preliminary field guide for general software engineers, with no experience in genomics on what it takes to transition from the internet world to the genomics world. An industry ripe for innovation and great potential for applying big data, algorithms, system design and user interface design best practices from the software world.
Lots of us are tired of working on yet another ad network, social game, or mobile game. We want to work on things that change the world and affect human life in a positive way. And what would be better than curing human diseases?
At the end of this talk you will know what skills you can bring to the genomics world from the software world, pointers to the best resources for a software engineer to explore genomics and top open-source genomics tools/libraries used within the genomics industry.
Amirhossein Kiani
Sr. Lead Software Engineer
Bina Technologies Inc.
www.bina.com
#Code2Cure: A field guide for software engineers on their journey to the world of genomics.
1. #Code2Cure: Engineering Genomics
: @mirkiani
A field guide for software engineers on their journey to the world of genomics.
Amirhossein Kiani
Sr. Lead Software Engineer
: amir@bina.com
Image courtesy of http://circos.ca
DISCLAIMER: The views expressed in this talk are mine alone and not
those of my employer.
Bina products are for use Research Use Only. Not for use in diagnostic
procedures.
Also, I’m a Computer Scientist by training and trying to help those with
similar background to learn about the field of genomics. Therefore there
has been a high degree of simplification done in explaining the scientific
concepts in this talk.
4. www.bina.com
Why Genomics?
Some things we could do with genomics:
• Carrier Screening
• Prenatal Screening
• Newborn Screening
• Inherited Disease
• Infectious Disease
• Cancer Diagnostics
• Microbiome
• Personalized Medicine
4
5. But I have no genomics background!
It’s ok.
5
7. www.bina.com
What is cell, what is DNA?
http://en.wikipedia.org/wiki/Cell_%28biology%29
http://en.wikipedia.org/wiki/DNA
7
Image courtesy of Pinterest
Image courtesy of Tumblr
8. www.bina.com
Crash Course on Genomics
The field of studying the structure of genomes.
http://en.wikipedia.org/wiki/Genomics
http://en.wikipedia.org/wiki/RNA
http://en.wikipedia.org/wiki/Protein
DNA RNA Protein You!
8
9. www.bina.com
How do we figure out what’s in DNA?
Like everything else, we turn the analog signal to digital, and then
analyze it.
http://en.wikipedia.org/wiki/DNA_sequencing
http://en.wikipedia.org/wiki/FASTQ_format
Illumina, Ion Torrent, Genia, …
Primary Analysis
FASTQ Format
9
Image courtesy of PersonalGenomes.org
10. www.bina.com
RAW Data to Variants (Secondary Analysis)
Step 1. Alignment
http://en.wikipedia.org/wiki/DNA_sequencing
http://en.wikipedia.org/wiki/FASTQ_format
10
Image courtesy of Wall Woodworks
Image courtesy of Wallpaper Up
11. www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Step 1. Short-Read Sequence Alignment
http://en.wikipedia.org/wiki/Reference_genome
http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism
http://en.wikipedia.org/wiki/Indel
http://en.wikipedia.org/wiki/Structural_variation
AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTG
Reference Genome (~3B bases)
ACTTTGGTCCACCCAAGG
AAGGGGGACACCCAAGGACACCC__GGGGGAAACT
GGACACCCAAGGGGGAA
ACCCAAGGGGGACACCC
ACCC__GGGGGAAACTTTG
AACACACCC__GGGGGAA
Coverage
Deletion Single Nucleotide Polymorphism
11
12. www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
• Burrows-Wheeler Aligner (BWA)
• Uses Burrows-Wheeler transform (also used in bzip)
• Uses Smith-Waterman algorithm
• Written in C++
• Uses ~4GB memory for human genome
http://bio-bwa.sourceforge.net
http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html
$ bwa mem ref.fa read1.fq read2.fq > aln-pe.sam
Example
12
13. www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Alignment
FASTQ SAM
Convert to Binary
BZIP (samtools)
BAM File
BAM File Index
http://samtools.github.io/hts-specs/SAMv1.pdf
http://samtools.github.io
13
14. www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
BAM File
BAM File Index
http://www.broadinstitute.org/igv
https://github.com/ekg/freebayes
http://arxiv.org/abs/1207.3907
https://www.broadinstitute.org/gatk
Visualize
Variant Calling
$ freebayes -f ref.fa aln.bam >var.vcf
Example
Interactive Genome Browser (IGV)
14
15. www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
15
… and here are your variants (VCF file)!
http://samtools.github.io/hts-specs/VCFv4.2.pdf
16. www.bina.com
What do we do with variant calls then?
Zooming in on the Central Dogma of Molecular Biology:
• There is redundancy in protein codes.
• But a mutation could change the protein coding.
16
Image courtesy of Wikipedia
17. www.bina.com
What do we do with variant calls then?
Annotation & Interpretation
• Functional Annotation Figure out if the mutation is dangerous (Use SNPEff)
• Synonymous
• Non-Synonymous
• Frame-shift
• …
• Put in the context of existing findings
• dbSNP
• ClinVar
• COSMIC
• ESP
• 1000 Genomes
• …
http://snpeff.sourceforge.net
http://www.ncbi.nlm.nih.gov/SNP
17
20. www.bina.com
Case Study: Bina GMS
20
Sequencing 2º Analysis 3º Analysis Interpretation
Meaningful Results
& Clinical Relevance
20+ DBs including over
140+ annotations:
HGMD // PGMD // Clinvar
COSMIC // dbNSFP // TRANSFAC
1000 Genome and more.
Tools & Workflows for:
WGS // WES // RNAseq
Somatic Mutations
Multi sample
Gene Panels
Bina Products are for Research Use Only
23. www.bina.com
Bina AAiM Architecture
Annotation and Indexing Engine
Input
VCF
UI/CMD
Clinical
Annotations
Genomic
Context
Prediction
Func. Impact
Population
Frequency
Distributed Execution
Framework
Annotation
(Join static DBs)
Indexing &
Functional Filters
MapReduce Jobs
Analytics Engine
NoSQL
Data Store
Indices
Metadata
Store
Tumor/Norma
l
Pedigree
Queries, Filters, Variant Sets, Reports
Bina
Secondary
Cohort StudyProband
24. www.bina.com
What next?
http://www.genomicsengland.co.uk
http://www.personalgenomes.org
• Apply this process to different domains and applications
• Come up with ways of ranking variants
• Keep learning from data
• Sequence everyone!
• Genomics England 100,000 Genome Project
• Personal Genomes Project
• Decrease cost
• Increase accuracy
• Make the technology faster and more usable!
Map of sequencers around the globe: http://omicsmaps.com
24
25. www.bina.com
Challenges in Genomics
• Accuracy
• Gold standard? What tool is best, there are so many!
• NIST, Dream Challenge
• Need to speak the same language… interoperability
• Global Alliance
• API, format, meta data, …
• Regulations
• HIPPA, CLIA: security, accuracy, anonymity and encryption
• Scalability
• Storage
• Need terabytes
• Each genome could be up to 1T
• Computation
• We still pretty much have no idea what most of DNA is doing…
• Can’t run on single machine. Need to scale to many nodes
• Need to leverage cloud technologies
• Provenance and auditability
• Importance of usability
• Different personas
• Errors are very expensive (life and death)
• Better visualization → faster discovery → faster cure
25
26. www.bina.com
Why should software engineers move to genomics?
Because genomics needs you, and you need genomics.
Work on something that matters! (#Code2Cure)
Things that SWEs do very well:
• Automation
• Elegant solutions for complex problems
• Enabling non-savvy users by
making the technology robust and accessible
• Scale
• Optimization
• Building production-grade platforms
• Tested
• Robust
• Secure
THESE ARE ALL NEEDED IN GENOMICS YESTERDAY!
26
Image courtesy of http://silvsoul.blogspot.com
27. www.bina.com
Open projects/resources to checkout/contribute to
Projects/Conferences
• Galaxy -- http://galaxyproject.org
• Arvados -- https://arvados.org
• Open Bio Conference -- http://www.open-bio.org
• BioViz -- http://www.biovis.net
• BioPython -- http://biopython.org
• Global Alliance for Genomics Health -- http://ga4gh.org
• Rosalind Project -- http://rosalind.info
Blogs/Websites
• http://bcb.io
• http://nextgenseek.com/
• http://ngs-expert.com/
• http://seqanswers.com/
• http://core-genomics.blogspot.com
• http://www.genomesunzipped.org
• http://genomeweb.com
27
28. Thank you.
And I hope you consider moving to genomics!
http://info.bina.com/code2cure-community
: @mirkiani
Amirhossein Kiani
Sr. Lead Software Engineer
: amir@bina.com
Hinweis der Redaktion
Data scale problem
1000s of WGS
Customers at research and clinical
We are here to introduce our company, products, team
Get feedback
At bina, we focus on the analysis of next generation sequencing datasets and provide best in class tools for secondary and tertiary analysis of Whole Genome, Whole Exome, RNAseq and targeted panel data sets. By optimizing the workflows, and tools incorporated in those workflows, to function as efficiently as possible on hardware we supply, we can achieve speed and performance unparalleled in the industry. I will spend the majority of this presentation discussing our analytical workflows, the methods we use to benchmark the tools and workflows, and the performance of those tools running our appliances.
We at bina concentrate on exactly this challenge. We approach the problem from a different perspective than most traditional genomics companies. We have expertise across all of engineering, bioinformatics, software development and generally, on managing large datasets. For example, we have engineers that come from companies like Yahoo and Google that are experts at dealing with this large data challenge and bioinformatics scientists expertise with individuals coming from Stanford, and UCSF, 23 and me.
At bina, we focus on the analysis of next generation sequencing datasets and provide best in class tools for secondary and tertiary analysis of Whole Genome, Whole Exome, RNAseq and targeted panel data sets. By optimizing the workflows, and tools incorporated in those workflows, to function as efficiently as possible on hardware we supply, we can achieve speed and performance unparalleled in the industry. I will spend the majority of this presentation discussing our analytical workflows, the methods we use to benchmark the tools and workflows, and the performance of those tools running our appliances.
For customers with very large amount of data and compute demand or very sensitive data. In countries with no access to secure public clouds
In Memory computation – Alignment and sorting (samsorter) are running in memory, in parallel, as apposed to sequential, on disk way of doing (overlapping compute, minimizing I/O)