SlideShare ist ein Scribd-Unternehmen logo
1 von 65
Release Notes / Changes
Initially given on 12th
February 2015 to the
‘Toronto Social, Mobile, Analytics & Cloud Meet-
Up.
Minor typing corrections made on 23rd
February;
slight change to advertised title.
Major change is the addition of the extra
screenshots describing the –l parameter of the
Aspera client that allows it to operate at high
speed.
Contact: mjminformatics@gmail.com
(or secondary: michael.moorhouse@oicr.on.ca)
Managing & Processing
Big Data for Cancer Genomics,
an Insight Into Bioinformatics
?
?
?
?
About Me – Michael Moorhouse

'Automation Engineer' at OICR in the 'Genomic
Informatics' Team

Bioinformatician by training

British
BSc Bioscience
1994
2003
PhD Bioscience
http://www.clipartpanda.com/
MSc Bioinformatics
1998
Oh, Canada!
2x Netherlands
1x UK
~10 years
About me – Michael J Moorhouse
IT/ComputerSci.
Biology/Medicine
Maths / Stats
What we useWhy….we
care
What we use
Bioinformatician:
Aka 'Computational Molecular Biology'
or 'Bio-Computing'
We use everything we can borrow
from FOSS …
• Rapid development
• ‘Just Download it, try it’
• Google it if you need the manual / YouTube demo
• Linux (a lot)
• Though I’m still a Windoze guy (I love my
CorelDraw!)
• Eclipse / NetBeans
• GNU – anything!
• Perl/Python/Web Widgets/Apache
My Day Job
From: http://oicr.on.ca//files/public/OICRSlidedeck4February2015.pdf
‘Big Data’ at OICR (DNA
Sequencing)
Features:
Large: 500GB per sequencer run, 10 000s of files
~40-80TB per database submission
Variety: Raw scans, DNA sequences, ‘Biological’
Variation (aka: SNPs, CNVs, SSM), project data, sample
information
Velocity: created 5-8TB a week
Complexity: different sequencing platforms, levels of
analysis
Meta Data
Privacy / Confidentially:
Mostly of human origin
Often linked to ‘clinical data’ (tumor, normal, survival)
sometimes PHI (Personal Health Information)
OCIR Compute Resources
~8000+ processor nodes – i.e. typical
Current Data Storage: SeqProdBio
HDD Image from: http://www.wdbrand.com/images/products/img6/lores/wdfEnterprise_RESATA.jpg
2 such racks in room;
2 in other datacenters
TB TB
Copy 4
per night
2PB
Total
The world of Biology...

Its about life:
− No 'manual'
− Specification unclear
− Highly varied
− Very complex

Interaction even more complex

Always an exception
− Molecular Level Storage

Digital
− Self replicating Typical 'Simple'
Pathway
http://www.genome.jp/kegg-bin/show_pathway?map01010
Biological Terms & Technology I'll
Explain First - Briefly

DNA, RNA, Protein

Gene, Genome, Genome Reference, Genomic
Location, Mutation / Variant (SNP), Annotation

DNA Sequencing, Alignment

Cancer, Tumour, Normal
Triplet Code, Folding, Structure and
Function

Good mapping from:
DNA -> RNA -> Structure -> Function
− Mostly!
From: http://upload.wikimedia.org/wikipedia/commons/d/d4/RNA-codons.png and:
http://www.rcsb.org/pdb/101/motm.do?momID=126
An Aside: Computer Graphics

Rasmol – from 1993

Then 'Doom' = OpenGL drivers
Probably Bankrupted Silicon Graphics Inc!
From my Msc Thesis,
1998
DNA is packaged into Genes,
Chromosomes and Genomes

From the Ensembl Genome Browser:
3,381,944,086
DNA Bases in
the Human
Genome
Chromosome
~124 Mb
(Mega Bases)
The current
'Reference
Genome'
Genes & Genomic Locations

Gene is a tricky concept...but...:
'convenient conceptual unit of coordinated DNA
features'
− Usually make one protein
7:55020388-55257987
Chr7:55020388-55257987
Chromosome Start End
And, yes, there is a name-space problem...
This is Bioinformatics.
BioGraphics (BioPerl)
LocationTrack
Feature
Paradigm: ‘Feature on Track at Location’
(Chromosome:Start-End)
And the Version does matter...V.37
V.38
Feature
Track
Location
From: http://genome.ucsc.edu/
Also: http://www.ensembl.org/
Ensembl a ‘Genome Browser’
• Displays genomic information: sequence and
‘annotation’
• Annotation ~= ‘Signpost’ a form of ‘Markup’ of
interesting regions, what they mean.
• Created in LAMP: Linux/Apache/MySQL/Perl
• Now also a lot of Javascript on the Browser side
• The modules for manipulation, drawing released as
‘Bioperl’
THE SEQUENCING TECHNOLOGY
DNA Sequencing = Data Explosion
• The current dominate technology is
‘Sequencing By Synthesis’ patented &
developed by Illumina
• Originally ‘Solexa’ a British company (~2007)
• Is essentially a microscope, a couple of lasers
+ detectors and coordinated chemical pumps
• Other technologies:
• Roche / 454; Thermo / Ion Torrent; Oxford
Nanopore; PacBio
90%+
General Workflow For A Sample
• Extract physical sample
• Easy if it is skin…otherwise: punch biosopy?
• Extract the DNA
• Sequence
• Align to a ‘Reference Genome’
• Look for differences
• Differences = causal, errors or general damage
• Its like a ‘bomb’ blast in some cancers
Actually: Tumor / Normal
• DNA from a non-cancerous part of the patient
is better than the ‘reference genome’
• Also it is available, if at double the cost
HG19 Reference
(composite)
Normal
Tumor
Better
Possible
Small Device: NextSeq 500…
2ft or 60cm
http://www.illumina.com/systems/nextseq-sequencer.html
FlowCell: 8 ‘lanes’
Injection
point
From: http://www.illumina.com/systems/nextseq-sequencer.html
…But Big Data
• Big: lots of individual DNA sequences ‘reads’
generated
• 800 000 sequences per run
• Long: allows easier identification of common
parts / disruptions
• 300 nucleotides
• The two combined give ‘High Coverage’
• Higher coverage =~ more confidence
• Higher coverage =~ better detection of rare events
in impure samples (as cancers tend to be)
Vendor Supplied Software
• Generally, terrible!
• As in truly awful…
• Equipment manufactures cannot seem to make
good software.
• Illumina is the exception
• Finally: the Solexa GAP was Unix, CLI and used ‘GNU
make’
Basic Process: Data Flow
CASAVA
Better now, but how it was…
Why these
were removed!
…as it is now
• Images not produced: ~10TB?
• Base Calls: 351 GB
• Alignments (BAM): 25GB
• VCF: <200 MB
Illumina – the Company
+450% in 5 years
P/E= 82
Read / Sequence Alignment
• Simply (?)
• Find the best match of each sequenced read from
the sequence on the reference genome
• Essentially a pattern matching problem
• Problem is ‘NP-Hard’ if done properly by ‘dynamic
programing algorithm (circa. 1970)
• Actually, by modern standards very little ‘dynamic’ about
it…
• Hence, many aligners with their own ‘tricks’
• Typically, a lookup table or a fast indexing of ‘k-mers’ or
trade something for something
We use NovoAlign or BWA
http://bio-bwa.sourceforge.net/
CANCER
DNA – Mutation to Cancer (I)

So it is complex...but in three slides:
From: http://upload.wikimedia.org/wikipedia/commons/7/73/Cancer_requires_multiple_mutations_from_NIHen.png
Normal cells undergo
multiple uncorrected
DNA mutations (changes)
Many mutations are
corrected properly: copied
from the other strand!
DNA – Mutation to Cancer (II)
From: ‘The hallmarks of cancer’: Cell 2001, http://www.ncbi.nlm.nih.gov/pubmed/10647931
Multiple mutations needed to cause Cancer: 6-8 ones in key genes
DNA – Mutation to Cancer (III)
• Metastasis =
‘Whack-A-Mole’
• With whacking
molecules, radiation,
viruses, immune
stimulants…
"Rare odditity (2060587599)" by hawken king - rare odditity.
Licensed under CC BY 2.0 via Wikimedia Commons –
http://commons.wikimedia.org/wiki/File:Rare_odditity_(2060587599).jpg#
mediaviewer/File:Rare_odditity_(2060587599).jpg
REAL DATA
Representation of DNA Sequence
Data

ACSI text is terrible: 4 nucleotides (ATGC)

4 states in a byte = 2 bits
− So 6 wasted

Also pattern rich: the same motifs appear many
times
− Hence standard compression (zlib, gzip) works well

We do a lot of indexing
− Also much analysis done 'by chromosome'

Easy parallelisation!
FASTA / Q / Z
FASTA
FASTQ
FASTZ to ~ 30%
Common files – all ASCII
• FASTA (FASTQZ)
• (Q=With Quality Scores, Z=Zip)
• SAM (BAM)
• Sequence Alignment and Map Format (B=Binary,
Compressed)
• 1 or 2 lines per read (800 Million Lines)
• VCF
• Variant Calling Format
• 1 or 2 lines per variant (100 K Lines)
SAM / BAM
Simple – but comprehensive
http://samtools.github.io/hts-specs/SAMv1.pdf
Header
Alignment, 1 or 2
per read: 800 000
lines
Normally they are much, much simpler than this!
Try this instead…
Output Files – ELAND Alignment
Output files are large (~1Gb each) and have ~10 million lines
Are machine / human readable (still) – see below
NB: variant of this format has individual base quality scores
Base Calls
(Bustard)
Sequence mapping based on GenomePhysical Position
(Firecrest)
VCF – Variant Calling Format
• 3 samples (end columns), 4 variants
CLOUD COMPUTING IN
BIOINFORMATICS
Cloud Computing – (i.e. GNOS)
Map image from: http://pixabay.com/en/world-map-map-world-black-earth-297446/
Cloud
Repo.
Centres
Data is
deposited
Analysis by
different groups
Analysis Frameworks: Seqware
https://seqware.github.io/about/
..or as a picture
https://seqware.github.io/about/
Or: Arvados
Has some interesting features regarding
‘auto re-run’ of workflows on file change
Illumina has a cloud too!
• Called ‘BaseSpace’
• Direct connect from Sequencer
• Illumina SLA
• (No good for us)
• ‘One-Way’ trip: can’t deposit data elsewhere.
• (No good for us)
• But it is free
• For the odd TB (not the 2PB we have)
Now with Apps!
Ok, it is great if you are a small lab, doing
standard things
DATA DEPOSITION
Sharing: Part of the Scientific
Method
• To share to improve analysis
• Allow experts to process the data in novel ways
• To support published conclusions
• ‘I’aint making this stuff up…’
• Allows ‘meta-analysis’
• Bigger is better
• Allows International Consortia
The EGA
European Genome-phenome Archive
From: https://www.ebi.ac.uk/ega/about
Location: in the UK
1km
General Arrangement
Goal is to transfer files for this evening
- ignore the metadata needed for them
FTP Server is Slow, but Functional
Introducing Aspera
http://asperasoft.com/technology/transport/fasp/
Aspera not TCP in the Real World
http://asperasoft.com/technology/transport/fasp/#f
aspsolution-464
Typical aspera download command
./ascp -C1:2 -l500M -QT /var/tmp/test.10GB.rand
ega-box-358@fasp.ega.ebi.ac.uk:
Tweakable: 750M
Updated: Aspera Full-Throttle
Aspera: no –l parameter

Weitere ähnliche Inhalte

Was ist angesagt?

How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeLex Nederbragt
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingStephen Turner
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writingKeith Bradnam
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Creating a Planetary Scale OptIPuter
Creating a Planetary Scale OptIPuterCreating a Planetary Scale OptIPuter
Creating a Planetary Scale OptIPuterLarry Smarr
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...QBiC_Tue
 
Using field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsUsing field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsJoe Parker
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Instituteinside-BigData.com
 

Was ist angesagt? (13)

2014 villefranche
2014 villefranche2014 villefranche
2014 villefranche
 
2014 naples
2014 naples2014 naples
2014 naples
 
How to sequence a large eukaryotic genome
How to sequence a large eukaryotic genomeHow to sequence a large eukaryotic genome
How to sequence a large eukaryotic genome
 
Examining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencingExamining gene expression and methylation with next gen sequencing
Examining gene expression and methylation with next gen sequencing
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Basics of Genome Assembly
Basics of Genome Assembly Basics of Genome Assembly
Basics of Genome Assembly
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Creating a Planetary Scale OptIPuter
Creating a Planetary Scale OptIPuterCreating a Planetary Scale OptIPuter
Creating a Planetary Scale OptIPuter
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...Data Management for Quantitative Biology - Data sources (Next generation tech...
Data Management for Quantitative Biology - Data sources (Next generation tech...
 
Using field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomicsUsing field-based DNA sequencing to accelerate phylogenomics
Using field-based DNA sequencing to accelerate phylogenomics
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 

Andere mochten auch

Design thinking
Design thinkingDesign thinking
Design thinkingRaul Chong
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationJoaquin Dopazo
 
BLU at IOD: Highlights from 2013
BLU at IOD: Highlights from 2013BLU at IOD: Highlights from 2013
BLU at IOD: Highlights from 2013IBM Analytics
 
What has IBM Watson been up to since the Jeopardy! challenge?
What has IBM Watson been up to since the Jeopardy! challenge?What has IBM Watson been up to since the Jeopardy! challenge?
What has IBM Watson been up to since the Jeopardy! challenge?Raul Chong
 
Risk and financial portfolio analytics - A technical Introduction
Risk and financial portfolio analytics - A technical IntroductionRisk and financial portfolio analytics - A technical Introduction
Risk and financial portfolio analytics - A technical IntroductionRaul Chong
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 
The need to redefine genomic data sharing - moving towards Open Science Oct ...
The need to redefine genomic data sharing - moving towards Open Science  Oct ...The need to redefine genomic data sharing - moving towards Open Science  Oct ...
The need to redefine genomic data sharing - moving towards Open Science Oct ...Fiona Nielsen
 
Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Learning and Development Freelancer
 
Genomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osuGenomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osuBen Busby
 
Sprint Review and Planning Template
Sprint Review and Planning TemplateSprint Review and Planning Template
Sprint Review and Planning TemplateMike Lally
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...Ilkay Altintas, Ph.D.
 
Optimising Big Data Analytics for Non-Aeronautical Activity
Optimising Big Data Analytics for Non-Aeronautical ActivityOptimising Big Data Analytics for Non-Aeronautical Activity
Optimising Big Data Analytics for Non-Aeronautical ActivityConcessionaire Analyzer+
 
Advanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuAdvanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuBen Busby
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilSpark Summit
 
Business Analytics Module 5 14MBA14 according to New VTU syllabus
Business Analytics  Module 5 14MBA14 according to New VTU syllabusBusiness Analytics  Module 5 14MBA14 according to New VTU syllabus
Business Analytics Module 5 14MBA14 according to New VTU syllabusprasadkulkarnigit
 
Predictive Analytics for Customer Targeting: A Telemarketing Banking Example
Predictive Analytics for Customer Targeting: A Telemarketing Banking ExamplePredictive Analytics for Customer Targeting: A Telemarketing Banking Example
Predictive Analytics for Customer Targeting: A Telemarketing Banking ExamplePedro Ecija Serrano
 

Andere mochten auch (20)

Design thinking
Design thinkingDesign thinking
Design thinking
 
How to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical informationHow to transform genomic big data into valuable clinical information
How to transform genomic big data into valuable clinical information
 
BLU at IOD: Highlights from 2013
BLU at IOD: Highlights from 2013BLU at IOD: Highlights from 2013
BLU at IOD: Highlights from 2013
 
What has IBM Watson been up to since the Jeopardy! challenge?
What has IBM Watson been up to since the Jeopardy! challenge?What has IBM Watson been up to since the Jeopardy! challenge?
What has IBM Watson been up to since the Jeopardy! challenge?
 
Risk and financial portfolio analytics - A technical Introduction
Risk and financial portfolio analytics - A technical IntroductionRisk and financial portfolio analytics - A technical Introduction
Risk and financial portfolio analytics - A technical Introduction
 
Scrum in a nutshell
Scrum in a nutshellScrum in a nutshell
Scrum in a nutshell
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
Coaching poster
Coaching posterCoaching poster
Coaching poster
 
The need to redefine genomic data sharing - moving towards Open Science Oct ...
The need to redefine genomic data sharing - moving towards Open Science  Oct ...The need to redefine genomic data sharing - moving towards Open Science  Oct ...
The need to redefine genomic data sharing - moving towards Open Science Oct ...
 
Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...Key knowledge, skills and behaviours required by Learning and Development Pro...
Key knowledge, skills and behaviours required by Learning and Development Pro...
 
Genomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osuGenomic futures v_pitt_kent_osu
Genomic futures v_pitt_kent_osu
 
Sprint Review and Planning Template
Sprint Review and Planning TemplateSprint Review and Planning Template
Sprint Review and Planning Template
 
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis o...
 
Optimising Big Data Analytics for Non-Aeronautical Activity
Optimising Big Data Analytics for Non-Aeronautical ActivityOptimising Big Data Analytics for Non-Aeronautical Activity
Optimising Big Data Analytics for Non-Aeronautical Activity
 
Advanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuAdvanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osu
 
Business Analytics
 Business Analytics  Business Analytics
Business Analytics
 
Processing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And ToilProcessing 70Tb Of Genomics Data With ADAM And Toil
Processing 70Tb Of Genomics Data With ADAM And Toil
 
Business Analytics Module 5 14MBA14 according to New VTU syllabus
Business Analytics  Module 5 14MBA14 according to New VTU syllabusBusiness Analytics  Module 5 14MBA14 according to New VTU syllabus
Business Analytics Module 5 14MBA14 according to New VTU syllabus
 
Predictive Analytics for Customer Targeting: A Telemarketing Banking Example
Predictive Analytics for Customer Targeting: A Telemarketing Banking ExamplePredictive Analytics for Customer Targeting: A Telemarketing Banking Example
Predictive Analytics for Customer Targeting: A Telemarketing Banking Example
 
Module 5
Module 5Module 5
Module 5
 

Ähnlich wie Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics

Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingEdizonJambormias2
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517M. Gonzalo Claros
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Microarray
MicroarrayMicroarray
Microarrayjain7177
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchDavid Ruau
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud ExperiencesGuy Coates
 
Benevolent machine learning
Benevolent machine learningBenevolent machine learning
Benevolent machine learningScott Turner
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxxRowlet
 
Hpai class 15 - genes, mini-modules, and learning
Hpai   class 15 - genes, mini-modules, and learningHpai   class 15 - genes, mini-modules, and learning
Hpai class 15 - genes, mini-modules, and learningmelendez321
 
DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesDNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesBarbera van Schaik
 

Ähnlich wie Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics (20)

2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517De los rasgos poligénicos a los poligenómicos 250517
De los rasgos poligénicos a los poligenómicos 250517
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Microarray
MicroarrayMicroarray
Microarray
 
Cool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical ResearchCool Informatics Tools and Services for Biomedical Research
Cool Informatics Tools and Services for Biomedical Research
 
Climb bath
Climb bathClimb bath
Climb bath
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
Benevolent machine learning
Benevolent machine learningBenevolent machine learning
Benevolent machine learning
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Hpai class 15 - genes, mini-modules, and learning
Hpai   class 15 - genes, mini-modules, and learningHpai   class 15 - genes, mini-modules, and learning
Hpai class 15 - genes, mini-modules, and learning
 
DNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differencesDNA analysis on your laptop: Spot the differences
DNA analysis on your laptop: Spot the differences
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
T1 2018 bioinformatics
T1 2018 bioinformaticsT1 2018 bioinformatics
T1 2018 bioinformatics
 

Mehr von Raul Chong

Introducing Bluemix
Introducing BluemixIntroducing Bluemix
Introducing BluemixRaul Chong
 
Business Analytics and Optimization Introduction (part 2)
Business Analytics and Optimization Introduction (part 2)Business Analytics and Optimization Introduction (part 2)
Business Analytics and Optimization Introduction (part 2)Raul Chong
 
Business Analytics and Optimization Introduction
Business Analytics and Optimization IntroductionBusiness Analytics and Optimization Introduction
Business Analytics and Optimization IntroductionRaul Chong
 
SMAC projects - The best summer internship experience I ever had!
SMAC projects - The best summer internship experience I ever had!SMAC projects - The best summer internship experience I ever had!
SMAC projects - The best summer internship experience I ever had!Raul Chong
 
Starting your education in big data - Sneak peek to the new Big Data University
Starting your education in big data - Sneak peek to the new Big Data UniversityStarting your education in big data - Sneak peek to the new Big Data University
Starting your education in big data - Sneak peek to the new Big Data UniversityRaul Chong
 
Developing wearable technology apps quickly
Developing wearable technology apps quicklyDeveloping wearable technology apps quickly
Developing wearable technology apps quicklyRaul Chong
 
Mobile solutions for iOS (and other platforms) - Cloudant
Mobile solutions for iOS (and other platforms) - CloudantMobile solutions for iOS (and other platforms) - Cloudant
Mobile solutions for iOS (and other platforms) - CloudantRaul Chong
 
Mobile solutions for iOS (and other platforms) - Worklight
Mobile solutions for iOS (and other platforms) - WorklightMobile solutions for iOS (and other platforms) - Worklight
Mobile solutions for iOS (and other platforms) - WorklightRaul Chong
 
Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...
Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...
Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...Raul Chong
 
An Intro to Text Analytics on Big Data with a use case
An Intro to Text Analytics on Big Data with a use caseAn Intro to Text Analytics on Big Data with a use case
An Intro to Text Analytics on Big Data with a use caseRaul Chong
 
0626 2014 01_toronto-smac meetup_io_t
0626 2014 01_toronto-smac meetup_io_t0626 2014 01_toronto-smac meetup_io_t
0626 2014 01_toronto-smac meetup_io_tRaul Chong
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
0430 toronto smac_meetup_worklight_intro_final
0430 toronto smac_meetup_worklight_intro_final0430 toronto smac_meetup_worklight_intro_final
0430 toronto smac_meetup_worklight_intro_finalRaul Chong
 

Mehr von Raul Chong (13)

Introducing Bluemix
Introducing BluemixIntroducing Bluemix
Introducing Bluemix
 
Business Analytics and Optimization Introduction (part 2)
Business Analytics and Optimization Introduction (part 2)Business Analytics and Optimization Introduction (part 2)
Business Analytics and Optimization Introduction (part 2)
 
Business Analytics and Optimization Introduction
Business Analytics and Optimization IntroductionBusiness Analytics and Optimization Introduction
Business Analytics and Optimization Introduction
 
SMAC projects - The best summer internship experience I ever had!
SMAC projects - The best summer internship experience I ever had!SMAC projects - The best summer internship experience I ever had!
SMAC projects - The best summer internship experience I ever had!
 
Starting your education in big data - Sneak peek to the new Big Data University
Starting your education in big data - Sneak peek to the new Big Data UniversityStarting your education in big data - Sneak peek to the new Big Data University
Starting your education in big data - Sneak peek to the new Big Data University
 
Developing wearable technology apps quickly
Developing wearable technology apps quicklyDeveloping wearable technology apps quickly
Developing wearable technology apps quickly
 
Mobile solutions for iOS (and other platforms) - Cloudant
Mobile solutions for iOS (and other platforms) - CloudantMobile solutions for iOS (and other platforms) - Cloudant
Mobile solutions for iOS (and other platforms) - Cloudant
 
Mobile solutions for iOS (and other platforms) - Worklight
Mobile solutions for iOS (and other platforms) - WorklightMobile solutions for iOS (and other platforms) - Worklight
Mobile solutions for iOS (and other platforms) - Worklight
 
Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...
Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...
Rapidly developing IoT (Internet of Things) applications - Part 2: Arduino, B...
 
An Intro to Text Analytics on Big Data with a use case
An Intro to Text Analytics on Big Data with a use caseAn Intro to Text Analytics on Big Data with a use case
An Intro to Text Analytics on Big Data with a use case
 
0626 2014 01_toronto-smac meetup_io_t
0626 2014 01_toronto-smac meetup_io_t0626 2014 01_toronto-smac meetup_io_t
0626 2014 01_toronto-smac meetup_io_t
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
0430 toronto smac_meetup_worklight_intro_final
0430 toronto smac_meetup_worklight_intro_final0430 toronto smac_meetup_worklight_intro_final
0430 toronto smac_meetup_worklight_intro_final
 

Kürzlich hochgeladen

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Kürzlich hochgeladen (20)

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics

  • 1. Release Notes / Changes Initially given on 12th February 2015 to the ‘Toronto Social, Mobile, Analytics & Cloud Meet- Up. Minor typing corrections made on 23rd February; slight change to advertised title. Major change is the addition of the extra screenshots describing the –l parameter of the Aspera client that allows it to operate at high speed. Contact: mjminformatics@gmail.com (or secondary: michael.moorhouse@oicr.on.ca)
  • 2. Managing & Processing Big Data for Cancer Genomics, an Insight Into Bioinformatics ? ? ? ?
  • 3. About Me – Michael Moorhouse  'Automation Engineer' at OICR in the 'Genomic Informatics' Team  Bioinformatician by training  British BSc Bioscience 1994 2003 PhD Bioscience http://www.clipartpanda.com/ MSc Bioinformatics 1998 Oh, Canada! 2x Netherlands 1x UK ~10 years
  • 4. About me – Michael J Moorhouse IT/ComputerSci. Biology/Medicine Maths / Stats What we useWhy….we care What we use Bioinformatician: Aka 'Computational Molecular Biology' or 'Bio-Computing'
  • 5. We use everything we can borrow from FOSS … • Rapid development • ‘Just Download it, try it’ • Google it if you need the manual / YouTube demo • Linux (a lot) • Though I’m still a Windoze guy (I love my CorelDraw!) • Eclipse / NetBeans • GNU – anything! • Perl/Python/Web Widgets/Apache
  • 6. My Day Job From: http://oicr.on.ca//files/public/OICRSlidedeck4February2015.pdf
  • 7. ‘Big Data’ at OICR (DNA Sequencing) Features: Large: 500GB per sequencer run, 10 000s of files ~40-80TB per database submission Variety: Raw scans, DNA sequences, ‘Biological’ Variation (aka: SNPs, CNVs, SSM), project data, sample information Velocity: created 5-8TB a week Complexity: different sequencing platforms, levels of analysis Meta Data Privacy / Confidentially: Mostly of human origin Often linked to ‘clinical data’ (tumor, normal, survival) sometimes PHI (Personal Health Information)
  • 8. OCIR Compute Resources ~8000+ processor nodes – i.e. typical
  • 9. Current Data Storage: SeqProdBio HDD Image from: http://www.wdbrand.com/images/products/img6/lores/wdfEnterprise_RESATA.jpg 2 such racks in room; 2 in other datacenters TB TB Copy 4 per night 2PB Total
  • 10. The world of Biology...  Its about life: − No 'manual' − Specification unclear − Highly varied − Very complex  Interaction even more complex  Always an exception − Molecular Level Storage  Digital − Self replicating Typical 'Simple' Pathway http://www.genome.jp/kegg-bin/show_pathway?map01010
  • 11. Biological Terms & Technology I'll Explain First - Briefly  DNA, RNA, Protein  Gene, Genome, Genome Reference, Genomic Location, Mutation / Variant (SNP), Annotation  DNA Sequencing, Alignment  Cancer, Tumour, Normal
  • 12. Triplet Code, Folding, Structure and Function  Good mapping from: DNA -> RNA -> Structure -> Function − Mostly! From: http://upload.wikimedia.org/wikipedia/commons/d/d4/RNA-codons.png and: http://www.rcsb.org/pdb/101/motm.do?momID=126
  • 13. An Aside: Computer Graphics  Rasmol – from 1993  Then 'Doom' = OpenGL drivers Probably Bankrupted Silicon Graphics Inc! From my Msc Thesis, 1998
  • 14. DNA is packaged into Genes, Chromosomes and Genomes  From the Ensembl Genome Browser: 3,381,944,086 DNA Bases in the Human Genome Chromosome ~124 Mb (Mega Bases) The current 'Reference Genome'
  • 15. Genes & Genomic Locations  Gene is a tricky concept...but...: 'convenient conceptual unit of coordinated DNA features' − Usually make one protein 7:55020388-55257987 Chr7:55020388-55257987 Chromosome Start End And, yes, there is a name-space problem... This is Bioinformatics.
  • 16. BioGraphics (BioPerl) LocationTrack Feature Paradigm: ‘Feature on Track at Location’ (Chromosome:Start-End)
  • 17. And the Version does matter...V.37
  • 19. Ensembl a ‘Genome Browser’ • Displays genomic information: sequence and ‘annotation’ • Annotation ~= ‘Signpost’ a form of ‘Markup’ of interesting regions, what they mean. • Created in LAMP: Linux/Apache/MySQL/Perl • Now also a lot of Javascript on the Browser side • The modules for manipulation, drawing released as ‘Bioperl’
  • 21. DNA Sequencing = Data Explosion • The current dominate technology is ‘Sequencing By Synthesis’ patented & developed by Illumina • Originally ‘Solexa’ a British company (~2007) • Is essentially a microscope, a couple of lasers + detectors and coordinated chemical pumps • Other technologies: • Roche / 454; Thermo / Ion Torrent; Oxford Nanopore; PacBio 90%+
  • 22. General Workflow For A Sample • Extract physical sample • Easy if it is skin…otherwise: punch biosopy? • Extract the DNA • Sequence • Align to a ‘Reference Genome’ • Look for differences • Differences = causal, errors or general damage • Its like a ‘bomb’ blast in some cancers
  • 23. Actually: Tumor / Normal • DNA from a non-cancerous part of the patient is better than the ‘reference genome’ • Also it is available, if at double the cost HG19 Reference (composite) Normal Tumor Better Possible
  • 24. Small Device: NextSeq 500… 2ft or 60cm http://www.illumina.com/systems/nextseq-sequencer.html
  • 25. FlowCell: 8 ‘lanes’ Injection point From: http://www.illumina.com/systems/nextseq-sequencer.html
  • 26. …But Big Data • Big: lots of individual DNA sequences ‘reads’ generated • 800 000 sequences per run • Long: allows easier identification of common parts / disruptions • 300 nucleotides • The two combined give ‘High Coverage’ • Higher coverage =~ more confidence • Higher coverage =~ better detection of rare events in impure samples (as cancers tend to be)
  • 27. Vendor Supplied Software • Generally, terrible! • As in truly awful… • Equipment manufactures cannot seem to make good software. • Illumina is the exception • Finally: the Solexa GAP was Unix, CLI and used ‘GNU make’
  • 28. Basic Process: Data Flow CASAVA
  • 29. Better now, but how it was… Why these were removed!
  • 30. …as it is now • Images not produced: ~10TB? • Base Calls: 351 GB • Alignments (BAM): 25GB • VCF: <200 MB
  • 31. Illumina – the Company +450% in 5 years P/E= 82
  • 32. Read / Sequence Alignment • Simply (?) • Find the best match of each sequenced read from the sequence on the reference genome • Essentially a pattern matching problem • Problem is ‘NP-Hard’ if done properly by ‘dynamic programing algorithm (circa. 1970) • Actually, by modern standards very little ‘dynamic’ about it… • Hence, many aligners with their own ‘tricks’ • Typically, a lookup table or a fast indexing of ‘k-mers’ or trade something for something
  • 33. We use NovoAlign or BWA http://bio-bwa.sourceforge.net/
  • 35. DNA – Mutation to Cancer (I)  So it is complex...but in three slides: From: http://upload.wikimedia.org/wikipedia/commons/7/73/Cancer_requires_multiple_mutations_from_NIHen.png Normal cells undergo multiple uncorrected DNA mutations (changes) Many mutations are corrected properly: copied from the other strand!
  • 36. DNA – Mutation to Cancer (II) From: ‘The hallmarks of cancer’: Cell 2001, http://www.ncbi.nlm.nih.gov/pubmed/10647931 Multiple mutations needed to cause Cancer: 6-8 ones in key genes
  • 37. DNA – Mutation to Cancer (III) • Metastasis = ‘Whack-A-Mole’ • With whacking molecules, radiation, viruses, immune stimulants… "Rare odditity (2060587599)" by hawken king - rare odditity. Licensed under CC BY 2.0 via Wikimedia Commons – http://commons.wikimedia.org/wiki/File:Rare_odditity_(2060587599).jpg# mediaviewer/File:Rare_odditity_(2060587599).jpg
  • 39. Representation of DNA Sequence Data  ACSI text is terrible: 4 nucleotides (ATGC)  4 states in a byte = 2 bits − So 6 wasted  Also pattern rich: the same motifs appear many times − Hence standard compression (zlib, gzip) works well  We do a lot of indexing − Also much analysis done 'by chromosome'  Easy parallelisation!
  • 40. FASTA / Q / Z FASTA FASTQ FASTZ to ~ 30%
  • 41. Common files – all ASCII • FASTA (FASTQZ) • (Q=With Quality Scores, Z=Zip) • SAM (BAM) • Sequence Alignment and Map Format (B=Binary, Compressed) • 1 or 2 lines per read (800 Million Lines) • VCF • Variant Calling Format • 1 or 2 lines per variant (100 K Lines)
  • 42. SAM / BAM Simple – but comprehensive http://samtools.github.io/hts-specs/SAMv1.pdf Header Alignment, 1 or 2 per read: 800 000 lines Normally they are much, much simpler than this!
  • 44. Output Files – ELAND Alignment Output files are large (~1Gb each) and have ~10 million lines Are machine / human readable (still) – see below NB: variant of this format has individual base quality scores Base Calls (Bustard) Sequence mapping based on GenomePhysical Position (Firecrest)
  • 45. VCF – Variant Calling Format • 3 samples (end columns), 4 variants
  • 47. Cloud Computing – (i.e. GNOS) Map image from: http://pixabay.com/en/world-map-map-world-black-earth-297446/ Cloud Repo. Centres Data is deposited Analysis by different groups
  • 49. ..or as a picture https://seqware.github.io/about/
  • 50. Or: Arvados Has some interesting features regarding ‘auto re-run’ of workflows on file change
  • 51. Illumina has a cloud too! • Called ‘BaseSpace’ • Direct connect from Sequencer • Illumina SLA • (No good for us) • ‘One-Way’ trip: can’t deposit data elsewhere. • (No good for us) • But it is free • For the odd TB (not the 2PB we have) Now with Apps! Ok, it is great if you are a small lab, doing standard things
  • 52.
  • 54. Sharing: Part of the Scientific Method • To share to improve analysis • Allow experts to process the data in novel ways • To support published conclusions • ‘I’aint making this stuff up…’ • Allows ‘meta-analysis’ • Bigger is better • Allows International Consortia
  • 55. The EGA European Genome-phenome Archive From: https://www.ebi.ac.uk/ega/about
  • 57. General Arrangement Goal is to transfer files for this evening - ignore the metadata needed for them
  • 58.
  • 59. FTP Server is Slow, but Functional
  • 61. Aspera not TCP in the Real World http://asperasoft.com/technology/transport/fasp/#f aspsolution-464
  • 62. Typical aspera download command ./ascp -C1:2 -l500M -QT /var/tmp/test.10GB.rand ega-box-358@fasp.ega.ebi.ac.uk:
  • 65. Aspera: no –l parameter