Algorithm of NGS Data

Speaker: Eric C.Y., LEE
Advisor: I-Fang Chung

2011.Mar.21

Monday, March 21, 2011 1

Outline

• Motivation
• Workﬂow
• Result
• Conclusion
• My Comment


Motivation
• High throughput sequence technology play
an important role in the life science now.
• Different high throughput sequence
technologies are competing to be able to
sequence an individual human genome for
less than $1,000 within a few years.

2006.Mar.17 Vol.311 Science


Motivation

• The amount of data produced by HTS
technologies creates signiﬁcant
bioinformatics challenge to understand,
store and share data.


Workﬂow
Evaluate Analysis Preliminary
algorithms datasets result
Golomb-Rice Dataset1 For location
Elias Gamma Dataset2 For mismatch
MOV Dataset3 ...
Huffman ...
...


Coding Strategy

Optimal encoding of these integers from a
compression standpoint depends on their
distribution in order to assign shorter
binary codes to more probable symbols.
~ Shannon’s Entropy Coding Theory

Claude Shannoon
1916~2001


Encoding Strategies
• Fixed Codes
• Golomb-Rice Codes
• Elias Gamma Codes
• Monotone Value Codes
• Variable Codes
• Huffman Code

Golomb-Rice Codes
Set m=10, and try to encode 42
Encoding of quotient part Encoding of remainder part
q output bits r binary output bits
0 0 0 0000 000
1 10 1 0001 001
2 110 2 0010 010
3 1110 3 0011 011
4 11110 4 0100 100
5 111110 5 0101 101
6 1111110 6 1100 1100
.. .. 7 1101 1101
N <N repetitions of 1> 8 1110 1110
n=42, n/m q=4, r=2 9 1111 1111
output is 11110010

Elias Gamma Codes
number 2^n output
1 20+0 1
2 21+0 010
3
4
21+1
22+0
011
00100 Example
5 22+1 00101
6 22+2 00110
7 22+3 00111 42=25+10
8 23+0 0001000
9 23+1 0001001
10 23+2 0001010
11
12
23+3
23+4
0001011
0001100
00000101010
13 23+5 0001101
14 23+6 0001110
15 23+7 0001111
16 24+0 000010000
17 24+1 000010001

MOV Coding
number 2^n output
1 20+0 1
2
3
21+0
21+1
10
11
Beginning with Elias Gamma
4 22+0 100 code’s signiﬁcant 1-bit.
5 22+1 101
6 22+2 110
7 22+3 111 Decode:
8 23+0 1000 10001
9 23+1 1001 {4bit}
10 23+2 1010
11 23+3 1011
12 23+4 1100
13 23+5 1101 24 + (0001)2
14 23+6 1110
15 23+7 1111
16
17
24+0
24+1
10000
10001
17

Huffman Codes
“this is an example of a huffman tree”


Workﬂow
MOV Dataset3 ...
Huffman ...
...


Dataset1
• Retrotransposon Ty3 insertion sites in the
yeast genome.
• 6,439,584 reads in 19 bp.
• Highly Clustered. 2
32%

• High degree of repetition. 0
54%

• Most two substitutions. 1
14%


Dataset2

• In vivo binding site locations of the neuron-
restrictive silencer factor (NRSF)in humans.
• Mapped to hg18. 1
2
6%

• 1,697,990 reads in 25 bp. 18%

• Most two substitutions. 0
76%


Dataset2 Nucleotide Substitutions


Dataset3
• Corresponds to a full diploid human
genome sequencing experiment for an
Asian individual.
• Large dataset. Only mapped to chr.22.
• 31,118,531 reads. 30~40bp. 2
19%

1
0
20%
61%


Workﬂow
MOV Dataset3 ...
Huffman ...
...


Alignment Result Example
Name of read that aligned Name of reference
Read sequence Value of celing
sequence occurs
Strand 0-bases offset into the Mismatch descriptors
Read quality
forward reference strand

Bowtie

Encoding Location
Information
• Standalone: Encoding each column
independently.

• Combine: Combining column of then
chromosome, strand and mismatch
compressing together.


Apply the Algorithms

• Elias Gamma (EG) Absolute
• Sequence can’t be sort.
• Apply to Dataset3.



• Elias Gamma Relative (REG)
• Sequence can be sort, compression
performance much better.
• Sorting the location address using relative
instead of absolute.


• Relative Elias Gamma Indexed (REG Indexed)
• Sorting and creating index ﬁle.
• Combine chromosome, strand,
mismatches together. Compressing them
by relative location.
• Can’t apply to dataset 3.



• Monotone Value (MOV)
• Based on chromosome and location,
sorting the sequences.
• Coding the absolute address.



• Huffman codes
• Focused on “relative” start position.
• This algorithm has to storing the
Huffman tree for decompression.


Comments for
encoding location
• REG is suit for the three datasets.
• From dataset 1, using unique location of
chromosome and counting the frequencies
for coding. REG is an ideal solution for
highly repetitive dataset.
• Huffman code it’s not good for dataset 1.


Encoding Mismatch
Information
• Each read may contains 1 or 2 mismatch
and has the nucleotide value.
• Using one line to record the mismatch
information. If no mismatch leave the line
blank.


Mismatches of Dataset2
If the mismatch at 23

From start is 22.

10110
From end is 2.
10
Calculate the position from the end of the reads.


Nucleotide Substitution
• Using number instead of characters.
A: 65
1000001
C: 67
1000011
G: 71
1000111
T: 84
1010100

A: 00 C:01 G:10 T:11

Combining Location
and Mismatch
19G Count the frequencies,
coding the location and
30A mismatch together.

34T 19G: 00001010110
{ 11bit }

19G: 10110
{5bit}

Final Encoding

• Dataset1: Mismatches dominates most of
space, because of it already be sorted.
• Dataset2: Location is sparse, it dominates
lots of storage.
• Dataset3: This dataset is balanced, because
of it has full coverage of genome.


Implementation

• Based on REG indexed for location
information and combined encoding for
mismatch information.
• Pass1: Counting the mismatches.
• Pass2: Actual encoding.


Result
Original 1,030,333,440

Best Compression 56,078,940

GenCompress 56,166,419

gzip 41,378,624

bzip2 42,233,336

7zip 30,651,664

0 275,000,000 550,000,000 825,000,000 1,100,000,000
(bytes)
Dataset1


Result
Original 353,181,920



gzip 95,688,992

bzip2 94,030,320

7zip 83,319,584

0 100000000 200000000 300000000 400000000
(bytes)
Dataset2


Result
Original 8,869,613,392



gzip 618,818,824

bzip2 955,061,616

7zip 411,811,520

0 2250000000 4500000000 6750000000 9000000000
(bytes)
Dataset3


Conclusion

• Any genome sequence can be used for
mapping the reads.
• From the view of time consuming,
GenCompress is worth to use.


Compression Time
20
GenCompress gzip
10 bzip2 7zip
Dataset1 78
107

5
13
Dataset2 20
77

111
70
Dataset3 422
447

0 125 250 375 500
(sec)


Decompression Time
2
GenCompress gzip
2 bzip2 7zip
Dataset1 7
4

1
1
Dataset2 4
2

15
13
Dataset3 53
21

0 15 30 45 60
(sec)


Conclusion
• Hard drive is not expensive, the cost is the
bandwidth.
• Doesn’t consider the quality score.
• Read identiﬁer is also important.
• Maybe mismatches are contaminants, de
novo. Or the reference sequence is
unﬁnished.
• Only consider the best match.

Conclusion
• Huffman tree in dataset 1 and 2.


My Comments
• They should open source.

• Hardware conﬁguration.
Why RAID1?


Thanks for your attention!


Algorithm of NGS Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Algorithm of NGS Data

Ähnlich wie Algorithm of NGS Data (20)

Mehr von Eric Lee

Mehr von Eric Lee (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Algorithm of NGS Data