27. Possible direction
To handle large genomes and larger datasets.
To handle insertion and deletion errors.
To correct hybrid datasets from multiple next generation
platforms.
To develop error correction methods for datasets in population
studies.
18
13年11月5⽇日星期⼆二
29. short read
find similar pairs of reads by
SlideSort
vote each position by paired
read
decide the new base
correct the erroneous
bases
13年11月5⽇日星期⼆二
30. Slidesort
• All pairs similarity search (APSS) for
sequence dataset.
• APSS: find all similar pairs in a
dataset.
• Performance of SlideSort
•
13年11月5⽇日星期⼆二
• 10 minutes for 10 million reads.
• 2~3G byte for 10 million reads.
Complexity of SlideSort
• Time: O(N+α)
• Equivalence classes are found in O(N).
• α is a number of neighbor pairs.
21
31. Slidesort
Output
Input
Alignments and distances
of all similar pairs.
• A set of short reads
• Distance threshold d
ATGCATA ATTCATT
ATGCTCA ATGCCCA
SlideSort
AAGTCGG ATGTATT
AAGGTCG ATGCTTA
22
13年11月5⽇日星期⼆二
ATGCATA ed= 1
ATGCTTA
ATGCATA ed= 2
ATGCTCA
AAG-TCGG ed= 2
AAGGTCG-
34. ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
35. ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
36. ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
37. ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
38. ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
39. ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
40. ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
41. ACGC.….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
42. ACGC.….
ATGC…….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
43. ACGC.….
ATGC…….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
44. ACGC.….
ATGC…….
ATGC…….
AAGT…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
45. ATGC…….
ACGC.….
AAGT…….
ATGC…….
*Animation by Prof. Shimizu
Basic strategy:
1. Filtering stage
Find subsets sharing common substring(s)
2. Pair-wise comparison stage
Compares all pairs for each subset.
13年11月5⽇日星期⼆二
46. Slidesort
S1 S2 are decomposed into m blocks.
If edit distance of S1 S2 is at most d,
there exist at least (m-d) common
blocks between S1S2, at similar
position.
13年11月5⽇日星期⼆二
47. Slidesort
First step:
•
Quickly finds a subset of short
reads which shares (m-d)
common blocks. (k-mers)
•
Second step:
•
•
•
13年11月5⽇日星期⼆二
Calculates edit-dist between all
pairs included in the subset
(equivalence class).
Outputs pairs whose edit-dist
are more than d, as well as
alignments and scores.
Equivalence class
S1
S2
S1
S3
S2
S4
S5
S5
S6
ATGC…….
S1
S2
S5
48. Toy Experiment
Data: test.fasta
Simulator: Stampy. (An open source
that can simulate short read error.)
Num of sequence : 5
Max_seq_length: 51
Min_seq_length: 51
32
13年11月5⽇日星期⼆二
50. Discussion
• Not sure if test data generated by
Stampy is good or not.
• Data set is way too small.
34
13年11月5⽇日星期⼆二
51. Future work
• Proper, bigger dataset.
• Select data sets from real
experiments from online database
instead of simulations.
• Try Bayesian model
35
13年11月5⽇日星期⼆二
52. References
•
•
•
•
Kana Shimizu1, Koji Tsuda. SlideSort: all pairs similarity search for short reads.
Bioinformatics (2011) 27 (4): 464-470.
•
13年11月5⽇日星期⼆二
Elaine R. Mardis. A decade’s perspective on DNA sequencing technology.
Next Generation Sequencing (NGS) Market [Platforms (Illumina HiSeq, MiSeq, Life
Technologies Ion Proton/PGM, 454 Roche), Bioinformatics (RNA-Seq, ChIP-Seq),
(Pyrosequencing, SBS, SMRT), (Diagnostics, Personalized Medicine)] - Global
Forecast to 2017.
Michael L. Metzker. Sequencing technologies — the next generation.
Xiao Yang, Sriram P. Chockalingam, Srinivas Aluru. A survey of error-correction
methods for next-generation sequencing. Briefing in Bioinformatics (2013) 14 (1):
56-66.