Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
1. Release Notes / Changes
Initially given on 12th
February 2015 to the
‘Toronto Social, Mobile, Analytics & Cloud Meet-
Up.
Minor typing corrections made on 23rd
February;
slight change to advertised title.
Major change is the addition of the extra
screenshots describing the –l parameter of the
Aspera client that allows it to operate at high
speed.
Contact: mjminformatics@gmail.com
(or secondary: michael.moorhouse@oicr.on.ca)
3. About Me – Michael Moorhouse
'Automation Engineer' at OICR in the 'Genomic
Informatics' Team
Bioinformatician by training
British
BSc Bioscience
1994
2003
PhD Bioscience
http://www.clipartpanda.com/
MSc Bioinformatics
1998
Oh, Canada!
2x Netherlands
1x UK
~10 years
4. About me – Michael J Moorhouse
IT/ComputerSci.
Biology/Medicine
Maths / Stats
What we useWhy….we
care
What we use
Bioinformatician:
Aka 'Computational Molecular Biology'
or 'Bio-Computing'
5. We use everything we can borrow
from FOSS …
• Rapid development
• ‘Just Download it, try it’
• Google it if you need the manual / YouTube demo
• Linux (a lot)
• Though I’m still a Windoze guy (I love my
CorelDraw!)
• Eclipse / NetBeans
• GNU – anything!
• Perl/Python/Web Widgets/Apache
6. My Day Job
From: http://oicr.on.ca//files/public/OICRSlidedeck4February2015.pdf
7. ‘Big Data’ at OICR (DNA
Sequencing)
Features:
Large: 500GB per sequencer run, 10 000s of files
~40-80TB per database submission
Variety: Raw scans, DNA sequences, ‘Biological’
Variation (aka: SNPs, CNVs, SSM), project data, sample
information
Velocity: created 5-8TB a week
Complexity: different sequencing platforms, levels of
analysis
Meta Data
Privacy / Confidentially:
Mostly of human origin
Often linked to ‘clinical data’ (tumor, normal, survival)
sometimes PHI (Personal Health Information)
9. Current Data Storage: SeqProdBio
HDD Image from: http://www.wdbrand.com/images/products/img6/lores/wdfEnterprise_RESATA.jpg
2 such racks in room;
2 in other datacenters
TB TB
Copy 4
per night
2PB
Total
10. The world of Biology...
Its about life:
− No 'manual'
− Specification unclear
− Highly varied
− Very complex
Interaction even more complex
Always an exception
− Molecular Level Storage
Digital
− Self replicating Typical 'Simple'
Pathway
http://www.genome.jp/kegg-bin/show_pathway?map01010
11. Biological Terms & Technology I'll
Explain First - Briefly
DNA, RNA, Protein
Gene, Genome, Genome Reference, Genomic
Location, Mutation / Variant (SNP), Annotation
DNA Sequencing, Alignment
Cancer, Tumour, Normal
12. Triplet Code, Folding, Structure and
Function
Good mapping from:
DNA -> RNA -> Structure -> Function
− Mostly!
From: http://upload.wikimedia.org/wikipedia/commons/d/d4/RNA-codons.png and:
http://www.rcsb.org/pdb/101/motm.do?momID=126
13. An Aside: Computer Graphics
Rasmol – from 1993
Then 'Doom' = OpenGL drivers
Probably Bankrupted Silicon Graphics Inc!
From my Msc Thesis,
1998
14. DNA is packaged into Genes,
Chromosomes and Genomes
From the Ensembl Genome Browser:
3,381,944,086
DNA Bases in
the Human
Genome
Chromosome
~124 Mb
(Mega Bases)
The current
'Reference
Genome'
15. Genes & Genomic Locations
Gene is a tricky concept...but...:
'convenient conceptual unit of coordinated DNA
features'
− Usually make one protein
7:55020388-55257987
Chr7:55020388-55257987
Chromosome Start End
And, yes, there is a name-space problem...
This is Bioinformatics.
19. Ensembl a ‘Genome Browser’
• Displays genomic information: sequence and
‘annotation’
• Annotation ~= ‘Signpost’ a form of ‘Markup’ of
interesting regions, what they mean.
• Created in LAMP: Linux/Apache/MySQL/Perl
• Now also a lot of Javascript on the Browser side
• The modules for manipulation, drawing released as
‘Bioperl’
21. DNA Sequencing = Data Explosion
• The current dominate technology is
‘Sequencing By Synthesis’ patented &
developed by Illumina
• Originally ‘Solexa’ a British company (~2007)
• Is essentially a microscope, a couple of lasers
+ detectors and coordinated chemical pumps
• Other technologies:
• Roche / 454; Thermo / Ion Torrent; Oxford
Nanopore; PacBio
90%+
22. General Workflow For A Sample
• Extract physical sample
• Easy if it is skin…otherwise: punch biosopy?
• Extract the DNA
• Sequence
• Align to a ‘Reference Genome’
• Look for differences
• Differences = causal, errors or general damage
• Its like a ‘bomb’ blast in some cancers
23. Actually: Tumor / Normal
• DNA from a non-cancerous part of the patient
is better than the ‘reference genome’
• Also it is available, if at double the cost
HG19 Reference
(composite)
Normal
Tumor
Better
Possible
24. Small Device: NextSeq 500…
2ft or 60cm
http://www.illumina.com/systems/nextseq-sequencer.html
26. …But Big Data
• Big: lots of individual DNA sequences ‘reads’
generated
• 800 000 sequences per run
• Long: allows easier identification of common
parts / disruptions
• 300 nucleotides
• The two combined give ‘High Coverage’
• Higher coverage =~ more confidence
• Higher coverage =~ better detection of rare events
in impure samples (as cancers tend to be)
27. Vendor Supplied Software
• Generally, terrible!
• As in truly awful…
• Equipment manufactures cannot seem to make
good software.
• Illumina is the exception
• Finally: the Solexa GAP was Unix, CLI and used ‘GNU
make’
32. Read / Sequence Alignment
• Simply (?)
• Find the best match of each sequenced read from
the sequence on the reference genome
• Essentially a pattern matching problem
• Problem is ‘NP-Hard’ if done properly by ‘dynamic
programing algorithm (circa. 1970)
• Actually, by modern standards very little ‘dynamic’ about
it…
• Hence, many aligners with their own ‘tricks’
• Typically, a lookup table or a fast indexing of ‘k-mers’ or
trade something for something
35. DNA – Mutation to Cancer (I)
So it is complex...but in three slides:
From: http://upload.wikimedia.org/wikipedia/commons/7/73/Cancer_requires_multiple_mutations_from_NIHen.png
Normal cells undergo
multiple uncorrected
DNA mutations (changes)
Many mutations are
corrected properly: copied
from the other strand!
36. DNA – Mutation to Cancer (II)
From: ‘The hallmarks of cancer’: Cell 2001, http://www.ncbi.nlm.nih.gov/pubmed/10647931
Multiple mutations needed to cause Cancer: 6-8 ones in key genes
37. DNA – Mutation to Cancer (III)
• Metastasis =
‘Whack-A-Mole’
• With whacking
molecules, radiation,
viruses, immune
stimulants…
"Rare odditity (2060587599)" by hawken king - rare odditity.
Licensed under CC BY 2.0 via Wikimedia Commons –
http://commons.wikimedia.org/wiki/File:Rare_odditity_(2060587599).jpg#
mediaviewer/File:Rare_odditity_(2060587599).jpg
39. Representation of DNA Sequence
Data
ACSI text is terrible: 4 nucleotides (ATGC)
4 states in a byte = 2 bits
− So 6 wasted
Also pattern rich: the same motifs appear many
times
− Hence standard compression (zlib, gzip) works well
We do a lot of indexing
− Also much analysis done 'by chromosome'
Easy parallelisation!
41. Common files – all ASCII
• FASTA (FASTQZ)
• (Q=With Quality Scores, Z=Zip)
• SAM (BAM)
• Sequence Alignment and Map Format (B=Binary,
Compressed)
• 1 or 2 lines per read (800 Million Lines)
• VCF
• Variant Calling Format
• 1 or 2 lines per variant (100 K Lines)
42. SAM / BAM
Simple – but comprehensive
http://samtools.github.io/hts-specs/SAMv1.pdf
Header
Alignment, 1 or 2
per read: 800 000
lines
Normally they are much, much simpler than this!
44. Output Files – ELAND Alignment
Output files are large (~1Gb each) and have ~10 million lines
Are machine / human readable (still) – see below
NB: variant of this format has individual base quality scores
Base Calls
(Bustard)
Sequence mapping based on GenomePhysical Position
(Firecrest)
47. Cloud Computing – (i.e. GNOS)
Map image from: http://pixabay.com/en/world-map-map-world-black-earth-297446/
Cloud
Repo.
Centres
Data is
deposited
Analysis by
different groups
49. ..or as a picture
https://seqware.github.io/about/
50. Or: Arvados
Has some interesting features regarding
‘auto re-run’ of workflows on file change
51. Illumina has a cloud too!
• Called ‘BaseSpace’
• Direct connect from Sequencer
• Illumina SLA
• (No good for us)
• ‘One-Way’ trip: can’t deposit data elsewhere.
• (No good for us)
• But it is free
• For the odd TB (not the 2PB we have)
Now with Apps!
Ok, it is great if you are a small lab, doing
standard things
54. Sharing: Part of the Scientific
Method
• To share to improve analysis
• Allow experts to process the data in novel ways
• To support published conclusions
• ‘I’aint making this stuff up…’
• Allows ‘meta-analysis’
• Bigger is better
• Allows International Consortia