SlideShare a Scribd company logo
1 of 1
The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput DNA
Sequence Data
Mackenna Galicia - UC Davis Genome Center, Bioinformatics Core
B.S. Biotechnology, UC Davis 2019
Supervisor: Matthew Settles
Abstract
Background Information
Methods
Discussion/Conclusion
I would like to thank my supervisor and mentor, Matthew
Settles, for proposing and guiding me throughout this research
project; Zev Kronenberg for the support and providing me
with the similarity search tool, Bevel.
My project utilizes a sequence analysis technique, k-mer
minimizers, to identify bacterium from a shotgun genomic
DNA sample. We used the algorithm, Bevel, to compare DNA
sequences against standardized referenced genomes in the
PATRIC whole genome bacterial database. Bevel is a sequence
similarity tool that uses a minimizer database. Minimizers are
representative k-mers, subsequences of length k observed to
have the minimum hash value across a genomic region and are
therefore unique and comparable to that genomic region. The
two databases are queried against each other, resulting in a list
of positions where two or more sequences match. I am
developing two Python applications that first, process the
results of the algorithm and secondly, return a score that
enable the ranking of bacterium matches. The higher the
score, the better the match between the unknown bacteria and
the standardized reference genome.
Sample “Seqmatch” Output
What is Bioinformatics?
● Combines the elements of biology, computer science, and
statistics to work with genome sequencing
● Large genomes are difficult to sequence due to their size
and complex structure, so bioinformatics is an efficient way
to sequence the genomes
What is Whole Shotgun Genome Sequencing?
● A quick, efficient, and more accurate way to sequence large
genomes
● Cuts genome into small fragments of DNA that are then
reassembled by computer programs
Reads are the small fragments of DNA produced from Whole
Shotgun Sequencing. The sequence reads are assembled and form
contiguous genomic sequences called contigs. Scaffolds consist of
one or more contigs, typically joined with NNN’s which represent
sequencing gaps. The scaffolds are then properly ordered,
oriented, and assembled to form complete assemblies.
What are k-mer minimizers?
● A hash-based counting method that reduces redundancy
from neighboring k-mers, who differ from each other in
only one nucleotide position
Future work to build upon this project would include:
1. Continue collection of query minimizer scores of
query-target sequences pairs remaining to be processed
2. Correlate the results of my project with the previous
findings acquired in a laboratory
● The goal of this experiment is to show that minimizers
are a fast mean of characterizing bacterial shotgun
assembly contigs
● Given assembled contigs we can compare those to a
database of whole genome sequences
● The Query Sequence and the Target Sequence with the
most matches is likely the same organism
● This minimizer approach is used to identify unknown
samples, or to check for contamination, samples with
multiple organisms in it
Sample “Bevel” Output
Whole Shotgun Sequencing
Dot Plot Results
K-mer Minimizers
Acknowledgements
● Running Bevel
○ Store every other match (-w 2)
○ K-mer/word size of 15 (-k 15)
○ Filter matches occurring > 10 times (-n
10)
● Tally the hits/matches and assign a score.
● Target sequences with a higher score
suggest a likely match with the querying
organism.
Why use a Dot Plot?
● Useful to easily identify long regions of strong similarity between two
sequences
● Clearly reveals the presence of insertions, deletions, and mutations that are
usually hard to identify with other methods
● Plot of target sequence accn|CP005975 and query
sequence NODE_2_length_654753_cov_26.8031_ID_3779
● The diagonal line of dots shows the regions of local
similarity between the two sequences
● The gaps in the diagonal lines represent mutations or
distinctions between the sequences
● Isolated dots outside of the diagonal line represent random
matches
● The Bevel output provides a “raw” listing of all target
sequences (and query sequences) with more than one match
with an organism sequence
● Seqmatch is a Python application I created that takes the
Bevel output for each query/target sequence pair and
calculates and assigns a “minimizer score”
● The higher the score, the greater number of “hits” or
matches between the two sequences
● The score is the sum of all query minimizers for each
unique target/query sequence ID pair
https://www.ncbi.nlm.nih.gov/nuccore/CP005975.1
https://en.wikipedia.org/wiki/Shotgun_sequencing
Sample GenBank Result
● Using the highest query minimizer scores (“best matches”), we
can search the NCBI GenBank for unidentified bacteria using
their accession number
Future Work
(A) The two sequences are broken down
into its constituent k-mers.
(B) All k-mers are converted into hash
values. In this example, the window
size is four (r1...r4).
(C) The lowest hash scores
(minimizers/min-mers) for each k-mer
is extracted and listed.
(D) The fragments are assembled
according to the four lowest
minimizers to find overlapped regionshttp://dx/doi.org/10.1101/008003

More Related Content

What's hot

A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...
A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...
A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...ijwmn
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
A novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locationsA novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locationsiosrjce
 
PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual MetricPR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual MetricTaesu Kim
 
Privacy preserving and truthful detection of packet dropping attacks in wirel...
Privacy preserving and truthful detection of packet dropping attacks in wirel...Privacy preserving and truthful detection of packet dropping attacks in wirel...
Privacy preserving and truthful detection of packet dropping attacks in wirel...LogicMindtech Nologies
 
Interpretation of the biological knowledge using networks approach
Interpretation of the biological knowledge using networks approachInterpretation of the biological knowledge using networks approach
Interpretation of the biological knowledge using networks approachElena Sügis
 
Survey on Text Prediction Techniques
Survey on Text Prediction TechniquesSurvey on Text Prediction Techniques
Survey on Text Prediction Techniquesvivatechijri
 
Tamil Character Recognition based on Back Propagation Neural Networks
Tamil Character Recognition based on Back Propagation Neural NetworksTamil Character Recognition based on Back Propagation Neural Networks
Tamil Character Recognition based on Back Propagation Neural NetworksDR.P.S.JAGADEESH KUMAR
 
Neuron level interpretation of deep nlp model
Neuron level interpretation of deep nlp model Neuron level interpretation of deep nlp model
Neuron level interpretation of deep nlp model Shreya Goyal
 
Assisting Code Search with Automatic Query Reformulation for Bug Localization
Assisting Code Search with Automatic Query Reformulation for Bug LocalizationAssisting Code Search with Automatic Query Reformulation for Bug Localization
Assisting Code Search with Automatic Query Reformulation for Bug LocalizationBunyamin Sisman
 
Internet Worm Classification and Detection using Data Mining Techniques
Internet Worm Classification and Detection using Data Mining TechniquesInternet Worm Classification and Detection using Data Mining Techniques
Internet Worm Classification and Detection using Data Mining Techniquesiosrjce
 
Network Security IEEE 2015 Projects
Network Security IEEE 2015 ProjectsNetwork Security IEEE 2015 Projects
Network Security IEEE 2015 ProjectsVijay Karan
 
Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...
Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...
Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...IRJET Journal
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal ClubMed_KU
 

What's hot (18)

A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...
A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...
A HYBRID FUZZY SYSTEM BASED COOPERATIVE SCALABLE AND SECURED LOCALIZATION SCH...
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
A novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locationsA novel algorithm to protect and manage memory locations
A novel algorithm to protect and manage memory locations
 
PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual MetricPR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
PR12-151 The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
 
W4301117121
W4301117121W4301117121
W4301117121
 
Privacy preserving and truthful detection of packet dropping attacks in wirel...
Privacy preserving and truthful detection of packet dropping attacks in wirel...Privacy preserving and truthful detection of packet dropping attacks in wirel...
Privacy preserving and truthful detection of packet dropping attacks in wirel...
 
Interpretation of the biological knowledge using networks approach
Interpretation of the biological knowledge using networks approachInterpretation of the biological knowledge using networks approach
Interpretation of the biological knowledge using networks approach
 
Survey on Text Prediction Techniques
Survey on Text Prediction TechniquesSurvey on Text Prediction Techniques
Survey on Text Prediction Techniques
 
Tamil Character Recognition based on Back Propagation Neural Networks
Tamil Character Recognition based on Back Propagation Neural NetworksTamil Character Recognition based on Back Propagation Neural Networks
Tamil Character Recognition based on Back Propagation Neural Networks
 
Spam email filtering
Spam email filteringSpam email filtering
Spam email filtering
 
Neuron level interpretation of deep nlp model
Neuron level interpretation of deep nlp model Neuron level interpretation of deep nlp model
Neuron level interpretation of deep nlp model
 
Assisting Code Search with Automatic Query Reformulation for Bug Localization
Assisting Code Search with Automatic Query Reformulation for Bug LocalizationAssisting Code Search with Automatic Query Reformulation for Bug Localization
Assisting Code Search with Automatic Query Reformulation for Bug Localization
 
1855 1860
1855 18601855 1860
1855 1860
 
Internet Worm Classification and Detection using Data Mining Techniques
Internet Worm Classification and Detection using Data Mining TechniquesInternet Worm Classification and Detection using Data Mining Techniques
Internet Worm Classification and Detection using Data Mining Techniques
 
Network Security IEEE 2015 Projects
Network Security IEEE 2015 ProjectsNetwork Security IEEE 2015 Projects
Network Security IEEE 2015 Projects
 
Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...
Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...
Common-Key Encryption in Duplex Server with Key Search for Reliable Distort S...
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club20131019 生物物理若手 Journal Club
20131019 生物物理若手 Journal Club
 

Similar to The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput DNA Sequence Data

Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxRanjan Jyoti Sarma
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfH K Yoon
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHijdms
 
Optimized cartesian k means
Optimized cartesian k meansOptimized cartesian k means
Optimized cartesian k meansieeepondy
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger Eli Kaminuma
 
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitantChoksi1
 

Similar to The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput DNA Sequence Data (20)

Database Searching
Database SearchingDatabase Searching
Database Searching
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Sequence Analysis
Sequence AnalysisSequence Analysis
Sequence Analysis
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
Bioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptxBioinformaatics for M.Sc. Biotecchnology.pptx
Bioinformaatics for M.Sc. Biotecchnology.pptx
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 
Sequence Alignment
Sequence AlignmentSequence Alignment
Sequence Alignment
 
Parwati sihag
Parwati sihagParwati sihag
Parwati sihag
 
AI 바이오 (4일차).pdf
AI 바이오 (4일차).pdfAI 바이오 (4일차).pdf
AI 바이오 (4일차).pdf
 
Molecular Biology Software Links
Molecular Biology Software LinksMolecular Biology Software Links
Molecular Biology Software Links
 
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACHGPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
GPCODON ALIGNMENT: A GLOBAL PAIRWISE CODON BASED SEQUENCE ALIGNMENT APPROACH
 
Ijetr042111
Ijetr042111Ijetr042111
Ijetr042111
 
Sequence alignment
Sequence alignmentSequence alignment
Sequence alignment
 
Use of NCBI Databases in qPCR Assay Design
Use of NCBI Databases in qPCR Assay DesignUse of NCBI Databases in qPCR Assay Design
Use of NCBI Databases in qPCR Assay Design
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Optimized cartesian k means
Optimized cartesian k meansOptimized cartesian k means
Optimized cartesian k means
 
Thesis Presentation
Thesis PresentationThesis Presentation
Thesis Presentation
 
[2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger [2017-05-29] DNASmartTagger
[2017-05-29] DNASmartTagger
 
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptxNitant_Choksi_CAP6545_Presentation_Slides.pptx
Nitant_Choksi_CAP6545_Presentation_Slides.pptx
 
RNA-Seq with R-Bioconductor
RNA-Seq with R-BioconductorRNA-Seq with R-Bioconductor
RNA-Seq with R-Bioconductor
 

Recently uploaded

DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...HafsaHussainp
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfAtiaGohar1
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 

Recently uploaded (20)

DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
DOG BITE management in pediatrics # for Pediatric pgs# topic presentation # f...
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 

The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput DNA Sequence Data

  • 1. The Use of K-mer Minimizers to Identify Bacterium Genomes in High Throughput DNA Sequence Data Mackenna Galicia - UC Davis Genome Center, Bioinformatics Core B.S. Biotechnology, UC Davis 2019 Supervisor: Matthew Settles Abstract Background Information Methods Discussion/Conclusion I would like to thank my supervisor and mentor, Matthew Settles, for proposing and guiding me throughout this research project; Zev Kronenberg for the support and providing me with the similarity search tool, Bevel. My project utilizes a sequence analysis technique, k-mer minimizers, to identify bacterium from a shotgun genomic DNA sample. We used the algorithm, Bevel, to compare DNA sequences against standardized referenced genomes in the PATRIC whole genome bacterial database. Bevel is a sequence similarity tool that uses a minimizer database. Minimizers are representative k-mers, subsequences of length k observed to have the minimum hash value across a genomic region and are therefore unique and comparable to that genomic region. The two databases are queried against each other, resulting in a list of positions where two or more sequences match. I am developing two Python applications that first, process the results of the algorithm and secondly, return a score that enable the ranking of bacterium matches. The higher the score, the better the match between the unknown bacteria and the standardized reference genome. Sample “Seqmatch” Output What is Bioinformatics? ● Combines the elements of biology, computer science, and statistics to work with genome sequencing ● Large genomes are difficult to sequence due to their size and complex structure, so bioinformatics is an efficient way to sequence the genomes What is Whole Shotgun Genome Sequencing? ● A quick, efficient, and more accurate way to sequence large genomes ● Cuts genome into small fragments of DNA that are then reassembled by computer programs Reads are the small fragments of DNA produced from Whole Shotgun Sequencing. The sequence reads are assembled and form contiguous genomic sequences called contigs. Scaffolds consist of one or more contigs, typically joined with NNN’s which represent sequencing gaps. The scaffolds are then properly ordered, oriented, and assembled to form complete assemblies. What are k-mer minimizers? ● A hash-based counting method that reduces redundancy from neighboring k-mers, who differ from each other in only one nucleotide position Future work to build upon this project would include: 1. Continue collection of query minimizer scores of query-target sequences pairs remaining to be processed 2. Correlate the results of my project with the previous findings acquired in a laboratory ● The goal of this experiment is to show that minimizers are a fast mean of characterizing bacterial shotgun assembly contigs ● Given assembled contigs we can compare those to a database of whole genome sequences ● The Query Sequence and the Target Sequence with the most matches is likely the same organism ● This minimizer approach is used to identify unknown samples, or to check for contamination, samples with multiple organisms in it Sample “Bevel” Output Whole Shotgun Sequencing Dot Plot Results K-mer Minimizers Acknowledgements ● Running Bevel ○ Store every other match (-w 2) ○ K-mer/word size of 15 (-k 15) ○ Filter matches occurring > 10 times (-n 10) ● Tally the hits/matches and assign a score. ● Target sequences with a higher score suggest a likely match with the querying organism. Why use a Dot Plot? ● Useful to easily identify long regions of strong similarity between two sequences ● Clearly reveals the presence of insertions, deletions, and mutations that are usually hard to identify with other methods ● Plot of target sequence accn|CP005975 and query sequence NODE_2_length_654753_cov_26.8031_ID_3779 ● The diagonal line of dots shows the regions of local similarity between the two sequences ● The gaps in the diagonal lines represent mutations or distinctions between the sequences ● Isolated dots outside of the diagonal line represent random matches ● The Bevel output provides a “raw” listing of all target sequences (and query sequences) with more than one match with an organism sequence ● Seqmatch is a Python application I created that takes the Bevel output for each query/target sequence pair and calculates and assigns a “minimizer score” ● The higher the score, the greater number of “hits” or matches between the two sequences ● The score is the sum of all query minimizers for each unique target/query sequence ID pair https://www.ncbi.nlm.nih.gov/nuccore/CP005975.1 https://en.wikipedia.org/wiki/Shotgun_sequencing Sample GenBank Result ● Using the highest query minimizer scores (“best matches”), we can search the NCBI GenBank for unidentified bacteria using their accession number Future Work (A) The two sequences are broken down into its constituent k-mers. (B) All k-mers are converted into hash values. In this example, the window size is four (r1...r4). (C) The lowest hash scores (minimizers/min-mers) for each k-mer is extracted and listed. (D) The fragments are assembled according to the four lowest minimizers to find overlapped regionshttp://dx/doi.org/10.1101/008003