The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Algorithm of NGS Data
1. Speaker: Eric C.Y., LEE
Advisor: I-Fang Chung
2011.Mar.21
Monday, March 21, 2011 1
2. Outline
• Motivation
• Workflow
• Result
• Conclusion
• My Comment
Monday, March 21, 2011 2
3. Motivation
• High throughput sequence technology play
an important role in the life science now.
• Different high throughput sequence
technologies are competing to be able to
sequence an individual human genome for
less than $1,000 within a few years.
2006.Mar.17 Vol.311 Science
Monday, March 21, 2011 3
4. Motivation
• The amount of data produced by HTS
technologies creates significant
bioinformatics challenge to understand,
store and share data.
Monday, March 21, 2011 4
5. Workflow
Evaluate Analysis Preliminary
algorithms datasets result
Golomb-Rice Dataset1 For location
Elias Gamma Dataset2 For mismatch
MOV Dataset3 ...
Huffman ...
...
Monday, March 21, 2011 5
6. Coding Strategy
Optimal encoding of these integers from a
compression standpoint depends on their
distribution in order to assign shorter
binary codes to more probable symbols.
~ Shannon’s Entropy Coding Theory
Claude Shannoon
1916~2001
Monday, March 21, 2011 6
7. Encoding Strategies
• Fixed Codes
• Golomb-Rice Codes
• Elias Gamma Codes
• Monotone Value Codes
• Variable Codes
• Huffman Code
Monday, March 21, 2011 7
8. Golomb-Rice Codes
Set m=10, and try to encode 42
Encoding of quotient part Encoding of remainder part
q output bits r binary output bits
0 0 0 0000 000
1 10 1 0001 001
2 110 2 0010 010
3 1110 3 0011 011
4 11110 4 0100 100
5 111110 5 0101 101
6 1111110 6 1100 1100
.. .. 7 1101 1101
N <N repetitions of 1> 8 1110 1110
n=42, n/m q=4, r=2 9 1111 1111
output is 11110010
Monday, March 21, 2011 8
11. Huffman Codes
“this is an example of a huffman tree”
Monday, March 21, 2011 11
12. Workflow
Evaluate Analysis Preliminary
algorithms datasets result
Golomb-Rice Dataset1 For location
Elias Gamma Dataset2 For mismatch
MOV Dataset3 ...
Huffman ...
...
Monday, March 21, 2011 12
13. Dataset1
• Retrotransposon Ty3 insertion sites in the
yeast genome.
• 6,439,584 reads in 19 bp.
• Highly Clustered. 2
32%
• High degree of repetition. 0
54%
• Most two substitutions. 1
14%
Monday, March 21, 2011 13
14. Dataset2
• In vivo binding site locations of the neuron-
restrictive silencer factor (NRSF)in humans.
• Mapped to hg18. 1
2
6%
• 1,697,990 reads in 25 bp. 18%
• Most two substitutions. 0
76%
Monday, March 21, 2011 14
16. Dataset3
• Corresponds to a full diploid human
genome sequencing experiment for an
Asian individual.
• Large dataset. Only mapped to chr.22.
• 31,118,531 reads. 30~40bp. 2
19%
1
0
20%
61%
Monday, March 21, 2011 16
17. Workflow
Evaluate Analysis Preliminary
algorithms datasets result
Golomb-Rice Dataset1 For location
Elias Gamma Dataset2 For mismatch
MOV Dataset3 ...
Huffman ...
...
Monday, March 21, 2011 17
18. Alignment Result Example
Name of read that aligned Name of reference
Read sequence Value of celing
sequence occurs
Strand 0-bases offset into the Mismatch descriptors
Read quality
forward reference strand
Bowtie
Monday, March 21, 2011 18
19. Encoding Location
Information
• Standalone: Encoding each column
independently.
• Combine: Combining column of then
chromosome, strand and mismatch
compressing together.
Monday, March 21, 2011 19
20. Apply the Algorithms
• Elias Gamma (EG) Absolute
• Sequence can’t be sort.
• Apply to Dataset3.
Monday, March 21, 2011 20
21. Apply the Algorithms
• Elias Gamma Relative (REG)
• Sequence can be sort, compression
performance much better.
• Sorting the location address using relative
instead of absolute.
Monday, March 21, 2011 21
22. Apply the Algorithms
• Relative Elias Gamma Indexed (REG Indexed)
• Sorting and creating index file.
• Combine chromosome, strand,
mismatches together. Compressing them
by relative location.
• Can’t apply to dataset 3.
Monday, March 21, 2011 22
23. Apply the Algorithms
• Monotone Value (MOV)
• Based on chromosome and location,
sorting the sequences.
• Coding the absolute address.
Monday, March 21, 2011 23
24. Apply the Algorithms
• Huffman codes
• Focused on “relative” start position.
• This algorithm has to storing the
Huffman tree for decompression.
Monday, March 21, 2011 24
25. Comments for
encoding location
• REG is suit for the three datasets.
• From dataset 1, using unique location of
chromosome and counting the frequencies
for coding. REG is an ideal solution for
highly repetitive dataset.
• Huffman code it’s not good for dataset 1.
Monday, March 21, 2011 25
26. Encoding Mismatch
Information
• Each read may contains 1 or 2 mismatch
and has the nucleotide value.
• Using one line to record the mismatch
information. If no mismatch leave the line
blank.
Monday, March 21, 2011 26
27. Mismatches of Dataset2
If the mismatch at 23
From start is 22.
10110
From end is 2.
10
Calculate the position from the end of the reads.
Monday, March 21, 2011 27
28. Nucleotide Substitution
• Using number instead of characters.
A: 65
1000001
C: 67
1000011
G: 71
1000111
T: 84
1010100
A: 00 C:01 G:10 T:11
Monday, March 21, 2011 28
29. Combining Location
and Mismatch
19G Count the frequencies,
coding the location and
30A mismatch together.
34T 19G: 00001010110
{ 11bit }
19G: 10110
{5bit}
Monday, March 21, 2011 29
30. Final Encoding
• Dataset1: Mismatches dominates most of
space, because of it already be sorted.
• Dataset2: Location is sparse, it dominates
lots of storage.
• Dataset3: This dataset is balanced, because
of it has full coverage of genome.
Monday, March 21, 2011 30
31. Implementation
• Based on REG indexed for location
information and combined encoding for
mismatch information.
• Pass1: Counting the mismatches.
• Pass2: Actual encoding.
Monday, March 21, 2011 31
32. Result
Original 1,030,333,440
Best Compression 56,078,940
GenCompress 56,166,419
gzip 41,378,624
bzip2 42,233,336
7zip 30,651,664
0 275,000,000 550,000,000 825,000,000 1,100,000,000
(bytes)
Dataset1
Monday, March 21, 2011 32
33. Result
Original 353,181,920
Best Compression 35,983,322
GenCompress 36,099,480
gzip 95,688,992
bzip2 94,030,320
7zip 83,319,584
0 100000000 200000000 300000000 400000000
(bytes)
Dataset2
Monday, March 21, 2011 33
34. Result
Original 8,869,613,392
Best Compression 390,541,330
GenCompress 390,541,330
gzip 618,818,824
bzip2 955,061,616
7zip 411,811,520
0 2250000000 4500000000 6750000000 9000000000
(bytes)
Dataset3
Monday, March 21, 2011 34
35. Conclusion
• Any genome sequence can be used for
mapping the reads.
• From the view of time consuming,
GenCompress is worth to use.
Monday, March 21, 2011 35
38. Conclusion
• Hard drive is not expensive, the cost is the
bandwidth.
• Doesn’t consider the quality score.
• Read identifier is also important.
• Maybe mismatches are contaminants, de
novo. Or the reference sequence is
unfinished.
• Only consider the best match.
Monday, March 21, 2011 38
39. Conclusion
• Huffman tree in dataset 1 and 2.
Monday, March 21, 2011 39
40. My Comments
• They should open source.
• Hardware configuration.
Why RAID1?
Monday, March 21, 2011 40