Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics

Release Notes / Changes
Initially given on 12th
February 2015 to the
‘Toronto Social, Mobile, Analytics & Cloud Meet-
Up.
Minor typing corrections made on 23rd
February;
slight change to advertised title.
Major change is the addition of the extra
screenshots describing the –l parameter of the
Aspera client that allows it to operate at high
speed.
Contact: mjminformatics@gmail.com
(or secondary: michael.moorhouse@oicr.on.ca)

Managing & Processing
Big Data for Cancer Genomics,
an Insight Into Bioinformatics
?
?
?
?

About Me – Michael Moorhouse

'Automation Engineer' at OICR in the 'Genomic
Informatics' Team

Bioinformatician by training

British
BSc Bioscience
1994
2003
PhD Bioscience
http://www.clipartpanda.com/
MSc Bioinformatics
1998
Oh, Canada!
2x Netherlands
1x UK
~10 years

About me – Michael J Moorhouse
IT/ComputerSci.
Biology/Medicine
Maths / Stats
What we useWhy….we
care
What we use
Bioinformatician:
Aka 'Computational Molecular Biology'
or 'Bio-Computing'

We use everything we can borrow
from FOSS …
• Rapid development
• ‘Just Download it, try it’
• Google it if you need the manual / YouTube demo
• Linux (a lot)
• Though I’m still a Windoze guy (I love my
CorelDraw!)
• Eclipse / NetBeans
• GNU – anything!
• Perl/Python/Web Widgets/Apache

My Day Job
From: http://oicr.on.ca//files/public/OICRSlidedeck4February2015.pdf

‘Big Data’ at OICR (DNA
Sequencing)
Features:
Large: 500GB per sequencer run, 10 000s of files
~40-80TB per database submission
Variety: Raw scans, DNA sequences, ‘Biological’
Variation (aka: SNPs, CNVs, SSM), project data, sample
information
Velocity: created 5-8TB a week
Complexity: different sequencing platforms, levels of
analysis
Meta Data
Privacy / Confidentially:
Mostly of human origin
Often linked to ‘clinical data’ (tumor, normal, survival)
sometimes PHI (Personal Health Information)

OCIR Compute Resources
~8000+ processor nodes – i.e. typical

Current Data Storage: SeqProdBio
HDD Image from: http://www.wdbrand.com/images/products/img6/lores/wdfEnterprise_RESATA.jpg
2 such racks in room;
2 in other datacenters
TB TB
Copy 4
per night
2PB
Total

The world of Biology...

Its about life:
− No 'manual'
− Specification unclear
− Highly varied
− Very complex

Interaction even more complex

Always an exception
− Molecular Level Storage

Digital
− Self replicating Typical 'Simple'
Pathway
http://www.genome.jp/kegg-bin/show_pathway?map01010

Biological Terms & Technology I'll
Explain First - Briefly

DNA, RNA, Protein

Gene, Genome, Genome Reference, Genomic
Location, Mutation / Variant (SNP), Annotation

DNA Sequencing, Alignment

Cancer, Tumour, Normal

Triplet Code, Folding, Structure and
Function

Good mapping from:
DNA -> RNA -> Structure -> Function
− Mostly!
From: http://upload.wikimedia.org/wikipedia/commons/d/d4/RNA-codons.png and:
http://www.rcsb.org/pdb/101/motm.do?momID=126

An Aside: Computer Graphics

Rasmol – from 1993

Then 'Doom' = OpenGL drivers
Probably Bankrupted Silicon Graphics Inc!
From my Msc Thesis,
1998

DNA is packaged into Genes,
Chromosomes and Genomes

From the Ensembl Genome Browser:
3,381,944,086
DNA Bases in
the Human
Genome
Chromosome
~124 Mb
(Mega Bases)
The current
'Reference
Genome'

Genes & Genomic Locations

Gene is a tricky concept...but...:
'convenient conceptual unit of coordinated DNA
features'
− Usually make one protein
7:55020388-55257987
Chr7:55020388-55257987
Chromosome Start End
And, yes, there is a name-space problem...
This is Bioinformatics.

BioGraphics (BioPerl)
LocationTrack
Feature
Paradigm: ‘Feature on Track at Location’
(Chromosome:Start-End)

And the Version does matter...V.37

V.38
Feature
Track
Location
From: http://genome.ucsc.edu/
Also: http://www.ensembl.org/

Ensembl a ‘Genome Browser’
• Displays genomic information: sequence and
‘annotation’
• Annotation ~= ‘Signpost’ a form of ‘Markup’ of
interesting regions, what they mean.
• Created in LAMP: Linux/Apache/MySQL/Perl
• Now also a lot of Javascript on the Browser side
• The modules for manipulation, drawing released as
‘Bioperl’

DNA Sequencing = Data Explosion
• The current dominate technology is
‘Sequencing By Synthesis’ patented &
developed by Illumina
• Originally ‘Solexa’ a British company (~2007)
• Is essentially a microscope, a couple of lasers
+ detectors and coordinated chemical pumps
• Other technologies:
• Roche / 454; Thermo / Ion Torrent; Oxford
Nanopore; PacBio
90%+

General Workflow For A Sample
• Extract physical sample
• Easy if it is skin…otherwise: punch biosopy?
• Extract the DNA
• Sequence
• Align to a ‘Reference Genome’
• Look for differences
• Differences = causal, errors or general damage
• Its like a ‘bomb’ blast in some cancers

Actually: Tumor / Normal
• DNA from a non-cancerous part of the patient
is better than the ‘reference genome’
• Also it is available, if at double the cost
HG19 Reference
(composite)
Normal
Tumor
Better
Possible

Small Device: NextSeq 500…
2ft or 60cm
http://www.illumina.com/systems/nextseq-sequencer.html

FlowCell: 8 ‘lanes’
Injection
point
From: http://www.illumina.com/systems/nextseq-sequencer.html

…But Big Data
• Big: lots of individual DNA sequences ‘reads’
generated
• 800 000 sequences per run
• Long: allows easier identification of common
parts / disruptions
• 300 nucleotides
• The two combined give ‘High Coverage’
• Higher coverage =~ more confidence
• Higher coverage =~ better detection of rare events
in impure samples (as cancers tend to be)

Vendor Supplied Software
• Generally, terrible!
• As in truly awful…
• Equipment manufactures cannot seem to make
good software.
• Illumina is the exception
• Finally: the Solexa GAP was Unix, CLI and used ‘GNU
make’

Basic Process: Data Flow
CASAVA

Better now, but how it was…
Why these
were removed!

…as it is now
• Images not produced: ~10TB?
• Base Calls: 351 GB
• Alignments (BAM): 25GB
• VCF: <200 MB

Illumina – the Company
+450% in 5 years
P/E= 82

Read / Sequence Alignment
• Simply (?)
• Find the best match of each sequenced read from
the sequence on the reference genome
• Essentially a pattern matching problem
• Problem is ‘NP-Hard’ if done properly by ‘dynamic
programing algorithm (circa. 1970)
• Actually, by modern standards very little ‘dynamic’ about
it…
• Hence, many aligners with their own ‘tricks’
• Typically, a lookup table or a fast indexing of ‘k-mers’ or
trade something for something

We use NovoAlign or BWA
http://bio-bwa.sourceforge.net/

DNA – Mutation to Cancer (I)

So it is complex...but in three slides:
From: http://upload.wikimedia.org/wikipedia/commons/7/73/Cancer_requires_multiple_mutations_from_NIHen.png
Normal cells undergo
multiple uncorrected
DNA mutations (changes)
Many mutations are
corrected properly: copied
from the other strand!

DNA – Mutation to Cancer (II)
From: ‘The hallmarks of cancer’: Cell 2001, http://www.ncbi.nlm.nih.gov/pubmed/10647931
Multiple mutations needed to cause Cancer: 6-8 ones in key genes

DNA – Mutation to Cancer (III)
• Metastasis =
‘Whack-A-Mole’
• With whacking
molecules, radiation,
viruses, immune
stimulants…
"Rare odditity (2060587599)" by hawken king - rare odditity.
Licensed under CC BY 2.0 via Wikimedia Commons –
http://commons.wikimedia.org/wiki/File:Rare_odditity_(2060587599).jpg#
mediaviewer/File:Rare_odditity_(2060587599).jpg

Representation of DNA Sequence
Data

ACSI text is terrible: 4 nucleotides (ATGC)

4 states in a byte = 2 bits
− So 6 wasted

Also pattern rich: the same motifs appear many
times
− Hence standard compression (zlib, gzip) works well

We do a lot of indexing
− Also much analysis done 'by chromosome'

Easy parallelisation!

FASTA / Q / Z
FASTA
FASTQ
FASTZ to ~ 30%

Common files – all ASCII
• FASTA (FASTQZ)
• (Q=With Quality Scores, Z=Zip)
• SAM (BAM)
• Sequence Alignment and Map Format (B=Binary,
Compressed)
• 1 or 2 lines per read (800 Million Lines)
• VCF
• Variant Calling Format
• 1 or 2 lines per variant (100 K Lines)

SAM / BAM
Simple – but comprehensive
http://samtools.github.io/hts-specs/SAMv1.pdf
Header
Alignment, 1 or 2
per read: 800 000
lines
Normally they are much, much simpler than this!

Output Files – ELAND Alignment
Output files are large (~1Gb each) and have ~10 million lines
Are machine / human readable (still) – see below
NB: variant of this format has individual base quality scores
Base Calls
(Bustard)
Sequence mapping based on GenomePhysical Position
(Firecrest)

VCF – Variant Calling Format
• 3 samples (end columns), 4 variants

CLOUD COMPUTING IN
BIOINFORMATICS

Cloud Computing – (i.e. GNOS)
Map image from: http://pixabay.com/en/world-map-map-world-black-earth-297446/
Cloud
Repo.
Centres
Data is
deposited
Analysis by
different groups

Analysis Frameworks: Seqware
https://seqware.github.io/about/

..or as a picture
https://seqware.github.io/about/

Or: Arvados
Has some interesting features regarding
‘auto re-run’ of workflows on file change

Illumina has a cloud too!
• Called ‘BaseSpace’
• Direct connect from Sequencer
• Illumina SLA
• (No good for us)
• ‘One-Way’ trip: can’t deposit data elsewhere.
• (No good for us)
• But it is free
• For the odd TB (not the 2PB we have)
Now with Apps!
Ok, it is great if you are a small lab, doing
standard things

Sharing: Part of the Scientific
Method
• To share to improve analysis
• Allow experts to process the data in novel ways
• To support published conclusions
• ‘I’aint making this stuff up…’
• Allows ‘meta-analysis’
• Bigger is better
• Allows International Consortia

The EGA
European Genome-phenome Archive
From: https://www.ebi.ac.uk/ega/about

General Arrangement
Goal is to transfer files for this evening
- ignore the metadata needed for them

FTP Server is Slow, but Functional

Introducing Aspera
http://asperasoft.com/technology/transport/fasp/

Aspera not TCP in the Real World
http://asperasoft.com/technology/transport/fasp/#f
aspsolution-464

Typical aspera download command
./ascp -C1:2 -l500M -QT /var/tmp/test.10GB.rand
ega-box-358@fasp.ega.ebi.ac.uk:

Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (13)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics

Ähnlich wie Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics (20)

Mehr von Raul Chong

Mehr von Raul Chong (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics