SlideShare a Scribd company logo
1 of 32
Download to read offline
Building bioinformatics resources
for the global community
James Pettengill
james.pettengill@fda.hhs.gov
Biostatistics and Bioinformatics Staff
Office of Analytics and Outreach
FDA Center for Food Safety and Applied Nutrition
GMI9
May 24, 2016
Rome, Italy
CFSAN’s open-access peer reviewed methods for analyzing and differentiating
among samples based on WGS data.
Submitted 16 April 2014
Accepted 23 September 2014
Published 14 October 2014
Corresponding author
Errol Strain,
Errol.Strain@fda.hhs.gov
Academic editor
Keith Crandall
Additional Information and
Declarations can be found on
page 21
DOI 10.7717/peerj.620
An evaluation of alternative methods for
constructing phylogenies from whole
genome sequence data: a case study with
Salmonella
James B. Pettengill, Yan Luo, Steven Davis, Yi Chen,
Narjol Gonzalez-Escalona, Andrea Ottesen, Hugh Rand,
Marc W. Allard and Errol Strain
Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration, College Park,
MD, USA
ABSTRACT
Comparative genomics based on whole genome sequencing (WGS) is increasingly
being applied to investigate questions within evolutionary and molecular biology,
as well as questions concerning public health (e.g., pathogen outbreaks). Given the
impact that conclusions derived from such analyses may have, we have evaluated
the robustness of clustering individuals based on WGS data to three key factors: (1)
next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and
SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism)
matrix (reference-based and reference-free), and (3) phylogenetic inference method
(FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole
genome sequences representing 107 unique Salmonella enterica subsp. enterica ser.
Montevideo strains. Reference-based approaches for identifying SNPs produced trees
that were significantly more similar to one another than those produced under the
reference-free approach. Topologies inferred using a core matrix (i.e., no missing
data) were significantly more discordant than those inferred using a non-core matrix
that allows for some missing data. However, allowing for too much missing data likely
results in a high false discovery rate of SNPs. When analyzing the same SNP matrix,
we observed that the more thorough inference methods implemented in GARLI and
RAxML produced more similar topologies than FastTreeMP. Our results also confirm
that reproducibility varies among NGS platforms where the MiSeq had the lowest
number of pairwise diVerences among replicate runs. Our investigation into the ro-
bustness of clustering patterns illustrates the importance of carefully considering how
data from diVerent platforms are combined and analyzed. We found clear diVerences
in the topologies inferred, and certain methods performed significantly better than
others for discriminating between the highly clonal organisms investigated here. The
methods supported by our results represent a preliminary set of guidelines and a
step towards developing validated standards for clustering based on whole genome
sequence data.
Real-time pathogen detection in the era of whole-genome
sequencing and big data: K-mer and site-based methods for
inferring the distances among tens of thousands of
Salmonella samples
James Pettengill
james.pettengill@fda.hhs.gov
Biostatistics and Bioinformatics Staff
Office of Analytics and Outreach
FDA Center for Food Safety and Applied Nutrition
GMI9
May 24, 2016
Rome, Italy
•  The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
Premise/Background of the project
•  The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
•  These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
Premise/Background of the project
•  The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
•  These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
•  For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
Premise/Background of the project
•  The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
•  These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
•  For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
•  Being able to do so is challenging due to both biological (evolutionary diverse
samples) and computational (petabytes of sequence data) issues.
Premise/Background of the project
•  The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
•  These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
•  For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
•  Being able to do so is challenging due to both biological (evolutionary diverse
samples) and computational (petabytes of sequence data) issues.
•  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean,
Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and
whole-genome multi-locus sequence typing (wgMLST))
Premise/Background of the project
•  The adoption of whole-genome sequencing within the public health realm has
resulted in large databases being populated in real-time.
•  These databases contain 60,000+ samples and are expected to grow to hundreds of
thousands within a few years.
•  For these databases to be of optimal use one must be able to quickly interrogate them
to accurately determine the genomic distances among a set of samples.
•  Being able to do so is challenging due to both biological (evolutionary diverse
samples) and computational (petabytes of sequence data) issues.
•  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean,
Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and
multi-locus sequence typing (MLST))
•  Empirical data: whole-genome sequence data from 18,997 Salmonella isolates
Premise/Background of the project
NutButter Outbreak
?
http://www.cdc.gov/salmonella/braenderup-08-14/index.html
NCBI GenomeTrakr Tree
Efficient method
inter-category comparisons
intra-category comparisons
genetic distances
Experimental design: based on a classification scheme determine how
well each distance measure performs
#
Inefficient method
genetic distances
#
Experimental design:
Simulated data:
Experimental design:
Empirical data:
•  Analyze different distance methods on de novo assemblies of all Salmonella
samples in GenomeTrakr
•  Use serovar as the classification scheme
Efficient method
inter-enteritidis comparisons
intra-enteritidis comparisons
genetic distances
#
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Taxonomic/contamination filtering
using Kraken with custom db
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
Assembly workflow:
Obtain latest metadata file
from NCBI pathogen database
Parse metadata and download raw data
Quality filter using fastx toolkit
Taxonomic/contamination filtering
using Kraken with custom db
Assembly using
SPAdes
Experimental design:
Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
1.  Obtain an assembly for each sample within GenomeTrakr
•  Use pilot of cloud computing to accomplish assemblies – “cloudbursting”
Summary! –! We! have! successfully!completed! running! Use! Cases! 2! and! 3! on! AWS!
servers!via!the!CycleCloud!platform.!Even!without!time!for!extensive!optimization!of!
the!clusters,!we!were!able!to!complete!the!Use!Cases!rapidly!and!inexpensively.!!
!
!
!
Use)Case)2)–))Listeria)Isolates)
)
! A!workflow!was!designed!to!analyze!sequencing!data!from!all!of!the!publicly!
available! Listeria! isolates! (3645)! collected! by! the! GenomeTrackr! network.! This!
workflow! involves! downloading! data! from! the! NCBI! servers,! trimming! the!
sequencing!reads!based!on!quality!scores,!filtering!the!reads!based!on!quality!and!
taxonomy,!and!assembling!the!reads!into!contiguous!genome!segments.!The!results!
of! this! workflow! will! allow! us! to! improve! our! methods! of! identifying! outbreak!
isolates.!
! !
1.!!Cluster!Specs!–!!
! Max!cores! ! 4000!
! Max!parallel!jobs! 1000!
! Master!node! ! i2.4xlarge!
! Compute!nodes!! r3.2xlarge,!r3.4xlarge!
!
2.!Results!–!
! Jobs! ! ! 3645!
! Run!time! ! 8)hours!!
! Job!completion!rate! 99.8%!
! Approximate!cost! $1800.00!
!
3.!Additional!Notes!–!
! Local!runtime!! ! 3.5)days!
! Feasible!to!run!locally! YES!
! Anticipated!frequency! once/quarter!
! Estimated!yearly!cost! !$9000.00!
*!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!900!additional!samples!per!
quarter!for!the!next!year.!
!
3,645 Listeria assemblies
!
Use)Case)3)–))Salmonella)Isolates)
)
! Our!revised!Use!Case!3!applies!the!workflow!described!
publicly! available! Salmonella! isolates! (25765)! collected! by!
network.!The!analysis!of!this!dataset!is!much!more!difficult!due!
size!and!a!much!larger!number!of!isolates!and!is!not!feasible!o
resources.!
!
1.!!Cluster!Specs!–!!
! Max!cores! ! 12000!
! Max!parallel!jobs! 3000!
! Master!node! ! i2.4xlarge!
! Compute!nodes!! r3.2xlarge,!r3.4xlarge,!r3.8xlarge!
!
2.!Results!–!
! Jobs! ! ! 25765!
! Run)time! ! 20)hours!!
! Job!completion!rate! 99.1%!
! Approximate!cost! $8000.00!
!
3.!Additional!Notes!–!
! Estimated)local)runtime! 23)days!
! Feasible)to)run)locally! NO!
! Anticipated!frequency! once/quarter!
! Estimated!yearly!cost! !$56000.00!
*!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!12,50
per!quarter!for!the!next!year.!
!
25,765 Salmonella assemblies
Site-based:
Sample1: ACCTAGTACC
Sample2: ACGTACTACC
Requires statements about homology/
sequence alignment
Kmer-based (L = 9):
Sample1: ACCTAGTACC
kmer1: ACCTAGTAC
kmer2: CCTAGTACC
Sample2: ACGTACTACC
kmer1: ACGTACTAC
kmer2: CGTACTACC
Fast but loss/oversimplification of
information
Similarity = 0.8
Similarity = 0
Experimental design:
Distance measures
Summary of methods used to infer the relationships among samples.
Class Method Description Exec. time (s)
Site-based
Nucmer§
Pairwise genome alignment using suffix arrays 11.9
wgMLST¶
Gene based approach 46.95
K-mer
based
Jaccard Index§ The intersection divided by the union of all K-mers found between two
samples
9.4
Manhattan Distance§ Sum of the absolute differences between the abundance of each K-mer
present between two samples
45.1
Euclidean Distance§ The square root of the sum of square of all pairwise differences in K-
mer abundance
44.2
Mash Distance
MinHash (Broder 1998) technique to reduce genomes to sketches and
estimates a novel evolutionary distance metric among them
1.2
Mash Jaccard Distance
The Jaccard Distance (as described above) but based on the sketch size
(e.g., the number of hashes)
1.2
§
Performed using de novo assemblies and requires k-mer indexing, which with jellyfish takes 7.4s (0.8) per sample (2.1
days for 25,000 samples)
¶
Requires a reference genome
Classification of simulated data:
ROC curves identical across
different distance methods
* Simulated data is not complex/
noisy enough
Summary/Implications:
•  There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
•  Treating absent data as informative may be problematic
Summary/Implications:
•  There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
•  Treating absent data as informative may be problematic
•  Site-based methods, like NUCmer and MLST, tended to be superior in
performance
Summary/Implications:
•  There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
•  Treating absent data as informative may be problematic
•  Site-based methods, like NUCmer and MLST, tended to be superior in
performance
•  Accessing the computing resources necessary to perform site-based
methods may be challenging when analyzing large databases.
Summary/Implications:
•  There are features (e.g., genomic, assembly, and contamination) that cause
k-mer based methods to fail to accurately capture the distance between
samples.
•  Treating absent data as informative may be problematic
•  Site-based methods, like NUCmer and MLST, tended to be superior in
performance
•  Accessing the computing resources necessary to perform site-based
methods may be challenging when analyzing large databases.
•  If working with k-mer distances err on the side of false positives
•  And have high quality assemblies
Acknowledgements
FDA
•  Center for Food Safety and Applied Nutrition
•  Biostats/Bioinformatics staff – J. Baugher, H. Rand,
J. Miller, Y. Luo, S. Davis, E. Strain
•  Center for Veterinary Medicine
•  Office of Regulatory Affairs
National Institutes of Health
•  National Center for Biotechnology Information
State Health and University Labs
•  Alaska
•  Arizona
•  California
•  Florida
•  Hawaii
•  Maryland
•  Minnesota
•  New Mexico
•  New York
•  South Dakota
•  Texas
•  Virginia
•  Washington
USDA/FSIS
•  Eastern Laboratory
CDC
•  Enteric Diseases Laboratory
•  INEI-ANLIS “Carolos Malbran Institute,”
Argentina
•  Centre for Food Safety, University College
Dublin, Ireland
•  Food Environmental Research Agency, UK
•  Public Health England, UK
•  WHO
•  Illumina
•  Pac Bio
•  CLC Bio
•  Other independent collaborators
•  False negatives are primarily
due to failure to meet
consensus frequency
threshold
ConsensusFrequency<0.9
Coverage<8
X20x_Coverage
X100x_Coverage
X20x_Coverage
X100x_Coverage
0 1000 2000 3000 4000 5000
value
variable
variable
X20x_Coverage
X100x_Coverage
Validation exercise key findings:
Number of false negatives
•  False negatives are not random
across the genome
Validation exercise of CFSAN SNP Pipeline key findings:
•  100× dataset
•  Recovered 98.9% of the introduced SNPs
•  False positive rate of 1.04 × 10−6
•  20× dataset
•  Recovered 98.8% of SNPs
•  False positive rate of 8.34 × 10−7

More Related Content

What's hot

Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14
mhaendel
 

What's hot (20)

Next Generation Sequencing
Next Generation SequencingNext Generation Sequencing
Next Generation Sequencing
 
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
Next Generation Sequencing for Identification and Subtyping of Foodborne Pat...
 
20170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_10120170209 ngs for_cancer_genomics_101
20170209 ngs for_cancer_genomics_101
 
Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Next generation sequencing in preimplantation genetic screening (NGS in PGS)Next generation sequencing in preimplantation genetic screening (NGS in PGS)
Next generation sequencing in preimplantation genetic screening (NGS in PGS)
 
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
Application of Whole Genome Sequencing in the infectious disease’ in vitro di...
 
Bioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesisBioinformatics as a tool for understanding carcinogenesis
Bioinformatics as a tool for understanding carcinogenesis
 
Errors and Limitaions of Next Generation Sequencing
Errors and Limitaions of Next Generation SequencingErrors and Limitaions of Next Generation Sequencing
Errors and Limitaions of Next Generation Sequencing
 
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
Building a Community Cyberinfrastructure to Support Marine Microbial Ecology ...
 
Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14
 
Pattemore 2015
Pattemore 2015Pattemore 2015
Pattemore 2015
 
Advancing the Metagenomics Revolution
Advancing the Metagenomics RevolutionAdvancing the Metagenomics Revolution
Advancing the Metagenomics Revolution
 
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008CAMERA Presentation at KNAW ICoMM Colloquium May 2008
CAMERA Presentation at KNAW ICoMM Colloquium May 2008
 
'Novel technologies to study the resistome'
'Novel technologies to study the resistome''Novel technologies to study the resistome'
'Novel technologies to study the resistome'
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 
High-Throughput Sequencing
High-Throughput SequencingHigh-Throughput Sequencing
High-Throughput Sequencing
 
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
NGS Applications I (UEB-UAT Bioinformatics Course - Session 2.1.2 - VHIR, Bar...
 
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
WGS in public health microbiology - MDU/VIDRL Seminar - wed 17 jun 2015
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
Making Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and AnnotationsMaking Use of NGS Data: From Reads to Trees and Annotations
Making Use of NGS Data: From Reads to Trees and Annotations
 
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)Viral Metagenomics (CABBIO 20150629 Buenos Aires)
Viral Metagenomics (CABBIO 20150629 Buenos Aires)
 

Viewers also liked

Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceAug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
GenomeInABottle
 

Viewers also liked (20)

Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practiceAug2013 Heidi Rehm integrating large scale sequencing into clinical practice
Aug2013 Heidi Rehm integrating large scale sequencing into clinical practice
 
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun SequencesTools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
Tools for Metagenomics with 16S/ITS and Whole Genome Shotgun Sequences
 
Plant genome sequencing and crop improvement
Plant genome sequencing and crop improvementPlant genome sequencing and crop improvement
Plant genome sequencing and crop improvement
 
Bioinformática Introdução (Basic NGS)
Bioinformática Introdução (Basic NGS)Bioinformática Introdução (Basic NGS)
Bioinformática Introdução (Basic NGS)
 
Rossen eccmid2015v1.5
Rossen eccmid2015v1.5Rossen eccmid2015v1.5
Rossen eccmid2015v1.5
 
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
Speeding up sequencing: Sequencing in an hour enables sample to answer in a w...
 
transforming clinical microbiology by next generation sequencing
transforming clinical microbiology by next generation sequencingtransforming clinical microbiology by next generation sequencing
transforming clinical microbiology by next generation sequencing
 
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
Introduction to Metagenomics Data Analysis - UEB-VHIR - 2013
 
Ngs presentation
Ngs presentationNgs presentation
Ngs presentation
 
NGS overview
NGS overviewNGS overview
NGS overview
 
Metagenomics
MetagenomicsMetagenomics
Metagenomics
 
Ngs intro_v6_public
 Ngs intro_v6_public Ngs intro_v6_public
Ngs intro_v6_public
 
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library PrepQIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
QIAseq Technologies for Metagenomics and Microbiome NGS Library Prep
 
NGx Sequencing 101-platforms
NGx Sequencing 101-platformsNGx Sequencing 101-platforms
NGx Sequencing 101-platforms
 
Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020Next-generation sequencing from 2005 to 2020
Next-generation sequencing from 2005 to 2020
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
Exploring Spark for Scalable Metagenomics Analysis: Spark Summit East talk by...
 
NGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platformsNGS - Basic principles and sequencing platforms
NGS - Basic principles and sequencing platforms
 
Next Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and OpportunitiesNext Generation Sequencing Informatics - Challenges and Opportunities
Next Generation Sequencing Informatics - Challenges and Opportunities
 
Poster ESHG
Poster ESHGPoster ESHG
Poster ESHG
 

Similar to Building bioinformatics resources for the global community

FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
Yatpang Cheung
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
Michael Atkins
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
Benjamin Good
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
Ian Foster
 

Similar to Building bioinformatics resources for the global community (20)

Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
FedCentric_Presentation
FedCentric_PresentationFedCentric_Presentation
FedCentric_Presentation
 
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
Big Data at Golden Helix: Scaling to Meet the Demand of Clinical and Research...
 
ASHG_2014_AP
ASHG_2014_APASHG_2014_AP
ASHG_2014_AP
 
Bioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysisBioinformatics tools for NGS data analysis
Bioinformatics tools for NGS data analysis
 
2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)2015 GU-ICBI Poster (third printing)
2015 GU-ICBI Poster (third printing)
 
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™Affymetrix OncoScan®* data analysis with Nexus Copy Number™
Affymetrix OncoScan®* data analysis with Nexus Copy Number™
 
WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...WikiPathways: how open source and open data can make omics technology more us...
WikiPathways: how open source and open data can make omics technology more us...
 
Bioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptxBioinformatics_1_ChenS.pptx
Bioinformatics_1_ChenS.pptx
 
Open Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysisOpen Science and Ecological meta-anlaysis
Open Science and Ecological meta-anlaysis
 
Multi-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/BioconductorMulti-omics infrastructure and data for R/Bioconductor
Multi-omics infrastructure and data for R/Bioconductor
 
Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomicsEmerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
 
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbuJax bio dataworldcongress.ngs.20181128finalwithoutbu
Jax bio dataworldcongress.ngs.20181128finalwithoutbu
 
Branch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiersBranch: An interactive, web-based tool for building decision tree classifiers
Branch: An interactive, web-based tool for building decision tree classifiers
 
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial CommunitiesProcessing Amplicon Sequence Data for the Analysis of Microbial Communities
Processing Amplicon Sequence Data for the Analysis of Microbial Communities
 
Next Generation Sequencing methods
Next Generation Sequencing methods Next Generation Sequencing methods
Next Generation Sequencing methods
 
Brief introduction to Bioinformatics
Brief introduction to BioinformaticsBrief introduction to Bioinformatics
Brief introduction to Bioinformatics
 
Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009Quantitative Medicine Feb 2009
Quantitative Medicine Feb 2009
 
Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128Giab jan2016 intro and update 160128
Giab jan2016 intro and update 160128
 
10.1.1.80.2149
10.1.1.80.214910.1.1.80.2149
10.1.1.80.2149
 

More from ExternalEvents

More from ExternalEvents (20)

Mauritania
Mauritania Mauritania
Mauritania
 
Malawi - M. Munthali
Malawi - M. MunthaliMalawi - M. Munthali
Malawi - M. Munthali
 
Malawi (Mbewe)
Malawi (Mbewe)Malawi (Mbewe)
Malawi (Mbewe)
 
Malawi (Desideri)
Malawi (Desideri)Malawi (Desideri)
Malawi (Desideri)
 
Lesotho
LesothoLesotho
Lesotho
 
Kenya
KenyaKenya
Kenya
 
ICRAF: Soil-plant spectral diagnostics laboratory
ICRAF: Soil-plant spectral diagnostics laboratoryICRAF: Soil-plant spectral diagnostics laboratory
ICRAF: Soil-plant spectral diagnostics laboratory
 
Ghana
GhanaGhana
Ghana
 
Ethiopia
EthiopiaEthiopia
Ethiopia
 
Item 15
Item 15Item 15
Item 15
 
Item 14
Item 14Item 14
Item 14
 
Item 13
Item 13Item 13
Item 13
 
Item 7
Item 7Item 7
Item 7
 
Item 6
Item 6Item 6
Item 6
 
Item 3
Item 3Item 3
Item 3
 
Item 16
Item 16Item 16
Item 16
 
Item 9: Soil mapping to support sustainable agriculture
Item 9: Soil mapping to support sustainable agricultureItem 9: Soil mapping to support sustainable agriculture
Item 9: Soil mapping to support sustainable agriculture
 
Item 8: WRB, World Reference Base for Soil Resouces
Item 8: WRB, World Reference Base for Soil ResoucesItem 8: WRB, World Reference Base for Soil Resouces
Item 8: WRB, World Reference Base for Soil Resouces
 
Item 7: Progress made in Nepal
Item 7: Progress made in NepalItem 7: Progress made in Nepal
Item 7: Progress made in Nepal
 
Item 6: International Center for Biosaline Agriculture
Item 6: International Center for Biosaline AgricultureItem 6: International Center for Biosaline Agriculture
Item 6: International Center for Biosaline Agriculture
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Recently uploaded (20)

Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 

Building bioinformatics resources for the global community

  • 1. Building bioinformatics resources for the global community James Pettengill james.pettengill@fda.hhs.gov Biostatistics and Bioinformatics Staff Office of Analytics and Outreach FDA Center for Food Safety and Applied Nutrition GMI9 May 24, 2016 Rome, Italy
  • 2. CFSAN’s open-access peer reviewed methods for analyzing and differentiating among samples based on WGS data. Submitted 16 April 2014 Accepted 23 September 2014 Published 14 October 2014 Corresponding author Errol Strain, Errol.Strain@fda.hhs.gov Academic editor Keith Crandall Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj.620 An evaluation of alternative methods for constructing phylogenies from whole genome sequence data: a case study with Salmonella James B. Pettengill, Yan Luo, Steven Davis, Yi Chen, Narjol Gonzalez-Escalona, Andrea Ottesen, Hugh Rand, Marc W. Allard and Errol Strain Center for Food Safety & Applied Nutrition, U.S. Food & Drug Administration, College Park, MD, USA ABSTRACT Comparative genomics based on whole genome sequencing (WGS) is increasingly being applied to investigate questions within evolutionary and molecular biology, as well as questions concerning public health (e.g., pathogen outbreaks). Given the impact that conclusions derived from such analyses may have, we have evaluated the robustness of clustering individuals based on WGS data to three key factors: (1) next-generation sequencing (NGS) platform (HiSeq, MiSeq, IonTorrent, 454, and SOLiD), (2) algorithms used to construct a SNP (single nucleotide polymorphism) matrix (reference-based and reference-free), and (3) phylogenetic inference method (FastTreeMP, GARLI, and RAxML). We carried out these analyses on 194 whole genome sequences representing 107 unique Salmonella enterica subsp. enterica ser. Montevideo strains. Reference-based approaches for identifying SNPs produced trees that were significantly more similar to one another than those produced under the reference-free approach. Topologies inferred using a core matrix (i.e., no missing data) were significantly more discordant than those inferred using a non-core matrix that allows for some missing data. However, allowing for too much missing data likely results in a high false discovery rate of SNPs. When analyzing the same SNP matrix, we observed that the more thorough inference methods implemented in GARLI and RAxML produced more similar topologies than FastTreeMP. Our results also confirm that reproducibility varies among NGS platforms where the MiSeq had the lowest number of pairwise diVerences among replicate runs. Our investigation into the ro- bustness of clustering patterns illustrates the importance of carefully considering how data from diVerent platforms are combined and analyzed. We found clear diVerences in the topologies inferred, and certain methods performed significantly better than others for discriminating between the highly clonal organisms investigated here. The methods supported by our results represent a preliminary set of guidelines and a step towards developing validated standards for clustering based on whole genome sequence data.
  • 3. Real-time pathogen detection in the era of whole-genome sequencing and big data: K-mer and site-based methods for inferring the distances among tens of thousands of Salmonella samples James Pettengill james.pettengill@fda.hhs.gov Biostatistics and Bioinformatics Staff Office of Analytics and Outreach FDA Center for Food Safety and Applied Nutrition GMI9 May 24, 2016 Rome, Italy
  • 4. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. Premise/Background of the project
  • 5. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. Premise/Background of the project
  • 6. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. Premise/Background of the project
  • 7. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. •  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. Premise/Background of the project
  • 8. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. •  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. •  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and whole-genome multi-locus sequence typing (wgMLST)) Premise/Background of the project
  • 9. •  The adoption of whole-genome sequencing within the public health realm has resulted in large databases being populated in real-time. •  These databases contain 60,000+ samples and are expected to grow to hundreds of thousands within a few years. •  For these databases to be of optimal use one must be able to quickly interrogate them to accurately determine the genomic distances among a set of samples. •  Being able to do so is challenging due to both biological (evolutionary diverse samples) and computational (petabytes of sequence data) issues. •  Evaluated 7 measures of genetic distance based on k-mer profiles (Jaccard, Euclidean, Manhattan, Mash Jaccard, and Mash distances) and nucleotide sites (NUCmer and multi-locus sequence typing (MLST)) •  Empirical data: whole-genome sequence data from 18,997 Salmonella isolates Premise/Background of the project
  • 11. Efficient method inter-category comparisons intra-category comparisons genetic distances Experimental design: based on a classification scheme determine how well each distance measure performs # Inefficient method genetic distances #
  • 13. Experimental design: Empirical data: •  Analyze different distance methods on de novo assemblies of all Salmonella samples in GenomeTrakr •  Use serovar as the classification scheme Efficient method inter-enteritidis comparisons intra-enteritidis comparisons genetic distances #
  • 14. Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data Assembly workflow: Obtain latest metadata file from NCBI pathogen database
  • 15. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  • 16. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Quality filter using fastx toolkit Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  • 17. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Quality filter using fastx toolkit Taxonomic/contamination filtering using Kraken with custom db Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  • 18. Assembly workflow: Obtain latest metadata file from NCBI pathogen database Parse metadata and download raw data Quality filter using fastx toolkit Taxonomic/contamination filtering using Kraken with custom db Assembly using SPAdes Experimental design: Empirical data: using cloud computing to perform assemblies on GenomeTrakr data
  • 19. 1.  Obtain an assembly for each sample within GenomeTrakr •  Use pilot of cloud computing to accomplish assemblies – “cloudbursting” Summary! –! We! have! successfully!completed! running! Use! Cases! 2! and! 3! on! AWS! servers!via!the!CycleCloud!platform.!Even!without!time!for!extensive!optimization!of! the!clusters,!we!were!able!to!complete!the!Use!Cases!rapidly!and!inexpensively.!! ! ! ! Use)Case)2)–))Listeria)Isolates) ) ! A!workflow!was!designed!to!analyze!sequencing!data!from!all!of!the!publicly! available! Listeria! isolates! (3645)! collected! by! the! GenomeTrackr! network.! This! workflow! involves! downloading! data! from! the! NCBI! servers,! trimming! the! sequencing!reads!based!on!quality!scores,!filtering!the!reads!based!on!quality!and! taxonomy,!and!assembling!the!reads!into!contiguous!genome!segments.!The!results! of! this! workflow! will! allow! us! to! improve! our! methods! of! identifying! outbreak! isolates.! ! ! 1.!!Cluster!Specs!–!! ! Max!cores! ! 4000! ! Max!parallel!jobs! 1000! ! Master!node! ! i2.4xlarge! ! Compute!nodes!! r3.2xlarge,!r3.4xlarge! ! 2.!Results!–! ! Jobs! ! ! 3645! ! Run!time! ! 8)hours!! ! Job!completion!rate! 99.8%! ! Approximate!cost! $1800.00! ! 3.!Additional!Notes!–! ! Local!runtime!! ! 3.5)days! ! Feasible!to!run!locally! YES! ! Anticipated!frequency! once/quarter! ! Estimated!yearly!cost! !$9000.00! *!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!900!additional!samples!per! quarter!for!the!next!year.! ! 3,645 Listeria assemblies ! Use)Case)3)–))Salmonella)Isolates) ) ! Our!revised!Use!Case!3!applies!the!workflow!described! publicly! available! Salmonella! isolates! (25765)! collected! by! network.!The!analysis!of!this!dataset!is!much!more!difficult!due! size!and!a!much!larger!number!of!isolates!and!is!not!feasible!o resources.! ! 1.!!Cluster!Specs!–!! ! Max!cores! ! 12000! ! Max!parallel!jobs! 3000! ! Master!node! ! i2.4xlarge! ! Compute!nodes!! r3.2xlarge,!r3.4xlarge,!r3.8xlarge! ! 2.!Results!–! ! Jobs! ! ! 25765! ! Run)time! ! 20)hours!! ! Job!completion!rate! 99.1%! ! Approximate!cost! $8000.00! ! 3.!Additional!Notes!–! ! Estimated)local)runtime! 23)days! ! Feasible)to)run)locally! NO! ! Anticipated!frequency! once/quarter! ! Estimated!yearly!cost! !$56000.00! *!Assuming!the!current!growth!rate!of!this!dataset,!we!estimate!12,50 per!quarter!for!the!next!year.! ! 25,765 Salmonella assemblies
  • 20. Site-based: Sample1: ACCTAGTACC Sample2: ACGTACTACC Requires statements about homology/ sequence alignment Kmer-based (L = 9): Sample1: ACCTAGTACC kmer1: ACCTAGTAC kmer2: CCTAGTACC Sample2: ACGTACTACC kmer1: ACGTACTAC kmer2: CGTACTACC Fast but loss/oversimplification of information Similarity = 0.8 Similarity = 0 Experimental design: Distance measures
  • 21. Summary of methods used to infer the relationships among samples. Class Method Description Exec. time (s) Site-based Nucmer§ Pairwise genome alignment using suffix arrays 11.9 wgMLST¶ Gene based approach 46.95 K-mer based Jaccard Index§ The intersection divided by the union of all K-mers found between two samples 9.4 Manhattan Distance§ Sum of the absolute differences between the abundance of each K-mer present between two samples 45.1 Euclidean Distance§ The square root of the sum of square of all pairwise differences in K- mer abundance 44.2 Mash Distance MinHash (Broder 1998) technique to reduce genomes to sketches and estimates a novel evolutionary distance metric among them 1.2 Mash Jaccard Distance The Jaccard Distance (as described above) but based on the sketch size (e.g., the number of hashes) 1.2 § Performed using de novo assemblies and requires k-mer indexing, which with jellyfish takes 7.4s (0.8) per sample (2.1 days for 25,000 samples) ¶ Requires a reference genome
  • 22. Classification of simulated data: ROC curves identical across different distance methods * Simulated data is not complex/ noisy enough
  • 23.
  • 24. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic
  • 25. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic •  Site-based methods, like NUCmer and MLST, tended to be superior in performance
  • 26. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic •  Site-based methods, like NUCmer and MLST, tended to be superior in performance •  Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases.
  • 27. Summary/Implications: •  There are features (e.g., genomic, assembly, and contamination) that cause k-mer based methods to fail to accurately capture the distance between samples. •  Treating absent data as informative may be problematic •  Site-based methods, like NUCmer and MLST, tended to be superior in performance •  Accessing the computing resources necessary to perform site-based methods may be challenging when analyzing large databases. •  If working with k-mer distances err on the side of false positives •  And have high quality assemblies
  • 28. Acknowledgements FDA •  Center for Food Safety and Applied Nutrition •  Biostats/Bioinformatics staff – J. Baugher, H. Rand, J. Miller, Y. Luo, S. Davis, E. Strain •  Center for Veterinary Medicine •  Office of Regulatory Affairs National Institutes of Health •  National Center for Biotechnology Information State Health and University Labs •  Alaska •  Arizona •  California •  Florida •  Hawaii •  Maryland •  Minnesota •  New Mexico •  New York •  South Dakota •  Texas •  Virginia •  Washington USDA/FSIS •  Eastern Laboratory CDC •  Enteric Diseases Laboratory •  INEI-ANLIS “Carolos Malbran Institute,” Argentina •  Centre for Food Safety, University College Dublin, Ireland •  Food Environmental Research Agency, UK •  Public Health England, UK •  WHO •  Illumina •  Pac Bio •  CLC Bio •  Other independent collaborators
  • 29.
  • 30.
  • 31. •  False negatives are primarily due to failure to meet consensus frequency threshold ConsensusFrequency<0.9 Coverage<8 X20x_Coverage X100x_Coverage X20x_Coverage X100x_Coverage 0 1000 2000 3000 4000 5000 value variable variable X20x_Coverage X100x_Coverage Validation exercise key findings: Number of false negatives •  False negatives are not random across the genome
  • 32. Validation exercise of CFSAN SNP Pipeline key findings: •  100× dataset •  Recovered 98.9% of the introduced SNPs •  False positive rate of 1.04 × 10−6 •  20× dataset •  Recovered 98.8% of SNPs •  False positive rate of 8.34 × 10−7