An overview of genomic epidemiology, Canada's IRIDA project for genome-based outbreak investigation, and a breathless romp through the awesome potential of the MinION
7. 7
Influenza A
RNA genome (14,000 nucleotides)
Eight segments
(Image: Tao and Zheng, Science 2012)
S. Typhi CT18
DNA genome (~5,100,000 nucleotides)
One chromosome + two plasmids
Science (2001)
VIRUS BACTERIUM
13. MiSeq projects at Dalhousie
• Bedford Basin microbial monitoring
• Pediatric Crohn’s disease samples
• Global microbial air sampling
• Mink genomes
• Sequencing Lactobacillus genomes from the poop of
old mice
• Wastewater diversity and function in the Arctic
• Verifying ingredients in dog food ( )
• Exercise and the Microbiome
13
14. Integrated Rapid Infectious Disease Analysis
www.irida.ca
14
1.56M, 3-year Genome Canada Large-Scale Applied
Platform Grant
SFU / BCCDC / PHAC-NML / Dalhousie
DNA sequencing and downstream applications
• data management / federation
• analysis workflows
• ontologies
• APIs
• 3rd-party applications
Implementation in provincial public health labs
Training
16. 16
Ontologies and data standards
NCBI, MiXS, vegetables
Metadata
Data provenance
Data quality
Environmental information
17. Data sharing!
• BIG challenges – different jurisdictions,
“ownership” of epi data. Privacy!
• Health service providers – concerns about
privacy and data breach
• Technology outstrips policy
• What digital records could we get TODAY?
• Canada lagging in data sharing
17
18. 18
Calling isolates based on
genetic variation
Traditional:
Pulsed-field
Multi-locus (standards! mlst.net)
Whole genomes:
Lots of information!
Too much information!
Lots of filtering and quality
control required
19. 19
Workflow management
REST-like API (3rd – party
applications)
Security: authentication /
authorization
Data models &
implementation
26. Full Privileges
Cluster
Line
List ID
Patient
Name
Prov.
Health
No.
Age Sex Location
Sample
ID
Collection
Date
Culture
Result
A 1
John
Smith
4513253244 26 M Vancouver F14231 14/03/21
Salmonella
sp.
A 2
Sally
Smith
4519567458 24 F Vancouver F14235 14/03/21
Salmonella
sp.
B 3
Tom
Jones
4517543216 35 M Vancouver M6542 14/03/24
Salmonella
sp.
B 4
Helen
Jones
9856321124 35 F Vancouver S1245 14/03/22
Salmonella
sp.
C 5
Jennifer
Lee
4516853122 29 F Vancouver S5642 14/03/22
Salmonella
sp.
C 6
Michael
Brown
9456534561 45 M Victoria T68954 14/03/25
Salmonella
sp.
Phylogenetic
Tree
Genetic Distance
27. Limited Privileges
Cluster
Line
List ID
Patient
Name
Prov.
Health
No.
Age Sex Location
Sample
ID
Collection
Date
Culture
Result
A 1
John
Smith
4513253244 26 M Vancouver F14231 14/03/21
Salmonella
sp.
A 2
Sally
Smith
4519567458 24 F Vancouver F14235 14/03/21
Salmonella
sp.
B 3
Tom
Jones
4517543216 35 M Vancouver M6542 14/03/24
Salmonella
sp.
B 4
Helen
Jones
9856321124 35 F Vancouver S1245 14/03/22
Salmonella
sp.
C 5
Jennifer
Lee
4516853122 29 F Vancouver S5642 14/03/22
Salmonella
sp.
C 6
Michael
Brown
9456534561 45 M Victoria T68954 14/03/25
Salmonella
sp.
Phylogenetic
Tree
Genetic Distance
30. Public Health England project
(>10,000 Salmonella so far)
• As of 2015, sequencing every sampled Salmonella
isolate collected in England
• Over 10,000 sequenced to date
• 8000 already available for download in the public
databases
30
32. 32
What’s next?
??? per run
$900 / run, 6 hours
Huge pieces (max so far – 200-300 kilobases)
Can stop / restart using same disposable flowcell
2015: Oxford Nanopore MinION
15 cm (-ish)
thehightechsociety.com
33. Quick et al. (2015)
“Using a novel streaming phylogenetic
placement method samples can be
assigned to a serotype in 40 minutes and
determined to be part of the outbreak in less
than 2 h.”
33
36. Challenges
• Sample extraction: getting DNA from stuff
• Clinical-grade evaluation
• Training
• Equipment reliability
• Sequencing errors
• Quality of reference data / attribution algorithms
• Database updates in real time
• Ethics / privacy (Genomes Sequenced While U Wait)
36
38. Acknowledgements
PIs
Fiona Brinkman – SFU
Will Hsiao – PHMRL
Gary Van Domselaar – NML
Morag Graham - NML
Rob Beiko – Dalhousie
University of Lisbon
Joᾶo Carriҫo
National Microbiology Laboratory (NML)
Franklin Bristow
Aaron Petkau
Thomas Matthews
Josh Adam
Adam Olsen
Tara Lynch
Shaun Tyler
Philip Mabon
Philip Au
Celine Nadon
Matthew Stuart-Edwards
Chrystal Berry
Lorelee Tschetter
Laboratory for Foodborne Zoonoses (LFZ)
Eduardo Taboada
Peter Kruczkiewicz
Chad Laing
Vic Gannon
Matthew Whiteside
Ross Duncan
Steven Mutschall
Simon Fraser University (SFU)
Melanie Courtot
Emma Griffiths
Geoff Winsor
Julie Shay
Matthew Laird
Bhav Dhillon
Raymond Lo
BC Public Health Microbiology &
Reference Laboratory (PHMRL) and BC
Centre for Disease Control (BCCDC)
Judy Isaac-Renton
Patrick Tang
Natalie Prystajecky
Jennifer Gardy
Damion Dooley
Linda Hoang
Kim MacDonald
Yin Chang
Eleni Galanis
Marsha Taylor
Cletus D’Souza
Ana Paccagnella
University of Maryland
Lynn Schriml
Canadian Food Inspection Agency (CFIA)
Burton Blais
Catherine Carrillo
Dominic Lambert
Dalhousie University
Alex Keddy 38
McMaster University
Andrew McArthur
Daim Sardar
European Nucleotide Archive
Guy Cochrane
Petra ten Hoopen
Clara Amid
European Food Safety Agency
Leibana Criado Ernesto
Vernazza Francesco
Rizzi Valentina
40. 40
Materials to be available on
http://bioinformatics.ca/
June 24-26, 2015
41. The Bioinformatics Exam of the Future
41
tagc.com.au
commons.wikimedia.org/wiki/File:DNA_ahelatest_moodustunud_niit_katsuti_korgil..JPG
http://omicfrontiers.com/2014/06/11/diaryofaminion_part2/
42. 2009 was a long time ago
42
J. Craig Venter Institute
43. 43Photo credit: Emma Allen-Vercoe
Some slides courtesy of Gary Van Domselaar, NML
Hinweis der Redaktion
The central issue facing bioinformaticians today can be summed up quite nicely with this graph charting the cost of generating biological sequencing data and the associated cost of computing this data.
The white line at the top represents Moore’s law, which describes an observation of the long-term trend towards decreased computing cost over time. It’ named after Gordon Moore, a co-founder of Intel, who first described the trend over 50 years ago. It derives from the observation that the number of components that can be crammed into an integrated circuit, like a cpu, approximately doubles every year to a year and a half., which translates into the cost of computing decreasing by half over the same time period. The trend has held steady for 5 decades and is expected to continue at this rate for at least another 10 years.
The cost of generating biological sequence data approximately followed this same trend, but the trend was upset by the introduction of next-generation sequencing technology near the end of 2005 followed by its widespread adoption in biotechnology over the following two years. From then on the rate of reduction in the cost of generating biological sequence data has fallen dramatically, to the point that today any microbiology lab can afford to routinely generate the sequences of the organisms that they study.
The consequence of this drastic reduction in the cost and time to generate biological sequence data stands to revolutionize public health research and morag’ presentation provides some nice examples of this, so biologists look at this line and rejoice, but bioinformatics scientests look at the gap between this line and this line, and well, we panic.
When a user goes to request the samples that are available for a project, that installation will query the local storage for what it has there, then also go out to the remote APIs and ask what they can provide.
Those remote APIs will decide what the user has permission to from the request, and provide them back to the caller.