SlideShare ist ein Scribd-Unternehmen logo
1 von 41
Downloaden Sie, um offline zu lesen
Speaker: Eric C.Y., LEE
                         Advisor: I-Fang Chung

                              2011.Mar.21

Monday, March 21, 2011                             1
Outline

                   • Motivation
                   • Workflow
                   • Result
                   • Conclusion
                   • My Comment

Monday, March 21, 2011                 2
Motivation
                   • High throughput sequence technology play
                         an important role in the life science now.
                   • Different high throughput sequence
                         technologies are competing to be able to
                         sequence an individual human genome for
                         less than $1,000 within a few years.



                                                2006.Mar.17 Vol.311 Science

Monday, March 21, 2011                                                        3
Motivation

                   • The amount of data produced by HTS
                         technologies creates significant
                         bioinformatics challenge to understand,
                         store and share data.




Monday, March 21, 2011                                             4
Workflow
              Evaluate       Analysis   Preliminary
             algorithms      datasets      result
             Golomb-Rice     Dataset1   For location
             Elias Gamma     Dataset2   For mismatch
             MOV             Dataset3   ...
             Huffman         ...
             ...




Monday, March 21, 2011                                 5
Coding Strategy

   Optimal encoding of these integers from a
   compression standpoint depends on their
   distribution in order to assign shorter
   binary codes to more probable symbols.
                         ~ Shannon’s Entropy Coding Theory

                                                             Claude Shannoon
                                                                1916~2001




Monday, March 21, 2011                                                         6
Encoding Strategies
                   • Fixed Codes
                    • Golomb-Rice Codes
                    • Elias Gamma Codes
                    • Monotone Value Codes
                   • Variable Codes
                    • Huffman Code
Monday, March 21, 2011                         7
Golomb-Rice Codes
                               Set m=10, and try to encode 42
                 Encoding of quotient part          Encoding of remainder part
                    q            output bits          r       binary   output bits
                    0                 0               0       0000        000
                    1                10               1       0001        001
                    2                110              2       0010        010
                    3               1110              3       0011        011
                    4               11110             4       0100        100
                    5              111110             5       0101        101
                    6              1111110            6       1100        1100
                    ..                ..              7       1101        1101
                   N         <N repetitions of 1>     8       1110        1110
              n=42, n/m q=4, r=2                      9       1111        1111
                         output is 11110010
Monday, March 21, 2011                                                               8
Elias Gamma Codes
         number           2^n      output
            1             20+0       1
            2             21+0      010
            3
            4
                          21+1
                          22+0
                                    011
                                   00100     Example
            5             22+1     00101
            6             22+2     00110
            7             22+3     00111      42=25+10
            8             23+0    0001000
            9             23+1    0001001
           10             23+2    0001010
           11
           12
                          23+3
                          23+4
                                  0001011
                                  0001100
                                              00000101010
           13             23+5    0001101
           14             23+6    0001110
           15             23+7    0001111
           16             24+0   000010000
           17             24+1   000010001
Monday, March 21, 2011                                      9
MOV Coding
         number          2^n      output
            1            20+0         1
            2
            3
                         21+0
                         21+1
                                     10
                                     11
                                           Beginning with Elias Gamma
            4            22+0       100    code’s significant 1-bit.
            5            22+1       101
            6            22+2       110
            7            22+3       111    Decode:
            8            23+0      1000           10001
            9            23+1      1001              {4bit}
           10            23+2      1010
           11            23+3      1011
           12            23+4      1100
           13            23+5      1101           24 + (0001)2
           14            23+6      1110
           15            23+7      1111
           16
           17
                         24+0
                         24+1
                                  10000
                                  10001
                                                      17
Monday, March 21, 2011                                              10
Huffman Codes
      “this is an example of a huffman tree”




Monday, March 21, 2011                         11
Workflow
              Evaluate       Analysis   Preliminary
             algorithms      datasets      result
             Golomb-Rice     Dataset1   For location
             Elias Gamma     Dataset2   For mismatch
             MOV             Dataset3   ...
             Huffman         ...
             ...




Monday, March 21, 2011                                 12
Dataset1
                   • Retrotransposon Ty3 insertion sites in the
                         yeast genome.
                   • 6,439,584 reads in 19 bp.
                   • Highly Clustered.               2
                                                    32%

                   • High degree of repetition.                    0
                                                                  54%

                   • Most two substitutions.          1
                                                     14%




Monday, March 21, 2011                                                  13
Dataset2

                   • In vivo binding site locations of the neuron-
                         restrictive silencer factor (NRSF)in humans.
                   • Mapped to hg18.                     1
                                                               2
                                                              6%

                   • 1,697,990 reads in 25 bp.          18%


                   • Most two substitutions.                        0
                                                                   76%




Monday, March 21, 2011                                                   14
Dataset2 Nucleotide Substitutions




Monday, March 21, 2011                              15
Dataset3
                   • Corresponds to a full diploid human
                         genome sequencing experiment for an
                         Asian individual.
                   • Large dataset. Only mapped to chr.22.
                   • 31,118,531 reads. 30~40bp.           2
                                                         19%


                                                    1
                                                                0
                                                   20%
                                                               61%



Monday, March 21, 2011                                               16
Workflow
              Evaluate       Analysis   Preliminary
             algorithms      datasets      result
             Golomb-Rice     Dataset1   For location
             Elias Gamma     Dataset2   For mismatch
             MOV             Dataset3   ...
             Huffman         ...
             ...




Monday, March 21, 2011                                 17
Alignment Result Example
  Name of read that aligned   Name of reference
                                                                  Read sequence                Value of celing
                               sequence occurs
                Strand                       0-bases offset into the                                        Mismatch descriptors
                                                                                Read quality
                                           forward reference strand




                                                           Bowtie
Monday, March 21, 2011                                                                                                             18
Encoding Location
                             Information
                   • Standalone: Encoding each column
                     independently.

                   • Combine: Combining column of then
                     chromosome, strand and mismatch
                         compressing together.




Monday, March 21, 2011                                   19
Apply the Algorithms

                   • Elias Gamma (EG) Absolute
                    • Sequence can’t be sort.
                    • Apply to Dataset3.


Monday, March 21, 2011                           20
Apply the Algorithms

                   • Elias Gamma Relative (REG)
                    • Sequence can be sort, compression
                           performance much better.
                         • Sorting the location address using relative
                           instead of absolute.



Monday, March 21, 2011                                                   21
Apply the Algorithms
                   • Relative Elias Gamma Indexed (REG Indexed)
                    • Sorting and creating index file.
                    • Combine chromosome, strand,
                           mismatches together. Compressing them
                           by relative location.
                         • Can’t apply to dataset 3.

Monday, March 21, 2011                                             22
Apply the Algorithms

                   • Monotone Value (MOV)
                    • Based on chromosome and location,
                           sorting the sequences.
                         • Coding the absolute address.


Monday, March 21, 2011                                    23
Apply the Algorithms

                   • Huffman codes
                    • Focused on “relative” start position.
                    • This algorithm has to storing the
                          Huffman tree for decompression.




Monday, March 21, 2011                                        24
Comments for
                           encoding location
                   • REG is suit for the three datasets.
                   • From dataset 1, using unique location of
                         chromosome and counting the frequencies
                         for coding. REG is an ideal solution for
                         highly repetitive dataset.
                   • Huffman code it’s not good for dataset 1.

Monday, March 21, 2011                                              25
Encoding Mismatch
                            Information
                   • Each read may contains 1 or 2 mismatch
                         and has the nucleotide value.
                   • Using one line to record the mismatch
                         information. If no mismatch leave the line
                         blank.




Monday, March 21, 2011                                                26
Mismatches of Dataset2
   If the mismatch at 23

   From start is 22.

                         10110
    From end is 2.
                         10
                                 Calculate the position from the end of the reads.




Monday, March 21, 2011                                                               27
Nucleotide Substitution
                         • Using number instead of characters.
               A: 65
               1000001
               C: 67
               1000011
               G: 71
               1000111
               T: 84
               1010100

                             A: 00 C:01 G:10 T:11
Monday, March 21, 2011                                           28
Combining Location
                           and Mismatch
                               19G       Count the frequencies,
                                         coding the location and
                                     30A mismatch together.

                                34T       19G: 00001010110
                                               { 11bit }

                                              19G: 10110
                                                   {5bit}
Monday, March 21, 2011                                         29
Final Encoding

                   • Dataset1: Mismatches dominates most of
                         space, because of it already be sorted.
                   • Dataset2: Location is sparse, it dominates
                         lots of storage.
                   • Dataset3: This dataset is balanced, because
                         of it has full coverage of genome.



Monday, March 21, 2011                                             30
Implementation

                   • Based on REG indexed for location
                         information and combined encoding for
                         mismatch information.
                   • Pass1: Counting the mismatches.
                   • Pass2: Actual encoding.

Monday, March 21, 2011                                           31
Result
                         Original                                                         1,030,333,440

               Best Compression          56,078,940

                     GenCompress         56,166,419

                             gzip       41,378,624

                           bzip2        42,233,336

                            7zip        30,651,664

                                    0          275,000,000   550,000,000   825,000,000 1,100,000,000
                                                                                                    (bytes)
                                                      Dataset1

Monday, March 21, 2011                                                                                    32
Result
                         Original                                                     353,181,920

               Best Compression         35,983,322

                     GenCompress        36,099,480

                             gzip                95,688,992

                           bzip2                 94,030,320

                            7zip               83,319,584

                                    0      100000000          200000000   300000000   400000000
                                                                                                    (bytes)
                                                 Dataset2

Monday, March 21, 2011                                                                                    33
Result
                         Original                                                 8,869,613,392

               Best Compression         390,541,330

                     GenCompress        390,541,330

                             gzip        618,818,824

                           bzip2           955,061,616

                            7zip        411,811,520

                                    0        2250000000 4500000000 6750000000 9000000000
                                                                                           (bytes)
                                                       Dataset3

Monday, March 21, 2011                                                                            34
Conclusion

                   • Any genome sequence can be used for
                         mapping the reads.
                   • From the view of time consuming,
                         GenCompress is worth to use.




Monday, March 21, 2011                                     35
Compression Time
                                         20
                                                                GenCompress            gzip
                                        10                      bzip2                  7zip
                         Dataset1              78
                                                    107

                                        5
                                         13
                         Dataset2         20
                                               77

                                                    111
                                               70
                         Dataset3                                             422
                                                                                 447

                                    0               125   250        375               500
                                                                                              (sec)



Monday, March 21, 2011                                                                                36
Decompression Time
                                         2
                                                                      GenCompress        gzip
                                         2                            bzip2              7zip
                         Dataset1                7
                                             4

                                        1
                                        1
                         Dataset2            4
                                         2

                                                       15
                                                     13
                         Dataset3                                                   53
                                                            21

                                    0                15          30         45           60
                                                                                                (sec)



Monday, March 21, 2011                                                                                  37
Conclusion
                   • Hard drive is not expensive, the cost is the
                         bandwidth.
                   • Doesn’t consider the quality score.
                   • Read identifier is also important.
                   • Maybe mismatches are contaminants, de
                         novo. Or the reference sequence is
                         unfinished.
                   • Only consider the best match.
Monday, March 21, 2011                                              38
Conclusion
                   • Huffman tree in dataset 1 and 2.




Monday, March 21, 2011                                  39
My Comments
            • They should open source.




            • Hardware configuration.
                              Why RAID1?

Monday, March 21, 2011                     40
Thanks for your attention!



Monday, March 21, 2011                       41

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Binary number system
Binary number systemBinary number system
Binary number system
 
Number system (Binary Number)
Number system (Binary Number)Number system (Binary Number)
Number system (Binary Number)
 
Chapter 6 slides part 1
Chapter 6 slides part 1Chapter 6 slides part 1
Chapter 6 slides part 1
 
La caja de pandora
La caja de pandoraLa caja de pandora
La caja de pandora
 
Technology + Singapore Math = Number Sense
Technology + Singapore Math = Number SenseTechnology + Singapore Math = Number Sense
Technology + Singapore Math = Number Sense
 
Golden ratio
Golden ratioGolden ratio
Golden ratio
 

Andere mochten auch

Introduction to 3rd sequencing
Introduction to 3rd sequencing Introduction to 3rd sequencing
Introduction to 3rd sequencing Eric Lee
 
Genome sequences as media files
Genome sequences as media filesGenome sequences as media files
Genome sequences as media filestparidae
 
Content-Driven Apps with React
Content-Driven Apps with ReactContent-Driven Apps with React
Content-Driven Apps with ReactNetcetera
 
Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)Arvados
 
Compact Genome Format
Compact Genome FormatCompact Genome Format
Compact Genome FormatArvados
 
Towards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processingTowards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processingWesley De Neve
 
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera
 
SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen Netcetera
 
COSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jCOSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jEric Lee
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Wesley De Neve
 
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Netcetera
 
SkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTSkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTNetcetera
 
Die Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-IndustrieDie Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-IndustrieNetcetera
 
Managers - The Missing Manual
Managers - The Missing ManualManagers - The Missing Manual
Managers - The Missing ManualNetcetera
 

Andere mochten auch (15)

Introduction to 3rd sequencing
Introduction to 3rd sequencing Introduction to 3rd sequencing
Introduction to 3rd sequencing
 
Genome sequences as media files
Genome sequences as media filesGenome sequences as media files
Genome sequences as media files
 
Content-Driven Apps with React
Content-Driven Apps with ReactContent-Driven Apps with React
Content-Driven Apps with React
 
Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)Curoverse Presentation at ICG-11 (November 2016)
Curoverse Presentation at ICG-11 (November 2016)
 
Compact Genome Format
Compact Genome FormatCompact Genome Format
Compact Genome Format
 
Towards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processingTowards using multimedia technology for biological data processing
Towards using multimedia technology for biological data processing
 
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & ExcitingNetcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
Netcetera Innovation Summit 2016: The Past 12 Months - What's New & Exciting
 
SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen SwissWallet - Die digitale Währung heisst Vertrauen
SwissWallet - Die digitale Währung heisst Vertrauen
 
COSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4jCOSCUP 2016 Workshop : 快快樂樂學Neo4j
COSCUP 2016 Workshop : 快快樂樂學Neo4j
 
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
Deep Machine Learning for Making Sense of Biotech Data - From Clean Energy to...
 
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...Authentication requirements and application of PSD2 in e-Commerce - Presentat...
Authentication requirements and application of PSD2 in e-Commerce - Presentat...
 
SkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoTSkopjePulse: Designing a better city with IoT
SkopjePulse: Designing a better city with IoT
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Die Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-IndustrieDie Herausforderungen in der Payment-Industrie
Die Herausforderungen in der Payment-Industrie
 
Managers - The Missing Manual
Managers - The Missing ManualManagers - The Missing Manual
Managers - The Missing Manual
 

Ähnlich wie Algorithm of NGS Data

Lecture4 binary-numbers-logic-operations
Lecture4  binary-numbers-logic-operationsLecture4  binary-numbers-logic-operations
Lecture4 binary-numbers-logic-operationsmarkme18
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer ArchitectureRavi Kumar
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer ArchitectureRavi Kumar
 
Cmp104 lec 2 number system
Cmp104 lec 2 number systemCmp104 lec 2 number system
Cmp104 lec 2 number systemkapil078
 
Digital Logic Design.pptx
Digital Logic Design.pptxDigital Logic Design.pptx
Digital Logic Design.pptxAminaZahid16
 
MIS - Chapter 02
MIS - Chapter 02MIS - Chapter 02
MIS - Chapter 02Lee Gomez
 
EASA Part 66 Module 5.2 : Numbering System
EASA Part 66 Module 5.2 : Numbering SystemEASA Part 66 Module 5.2 : Numbering System
EASA Part 66 Module 5.2 : Numbering Systemsoulstalker
 
Number systems presentation
Number systems presentationNumber systems presentation
Number systems presentationJiian Francisco
 
Binary reference guide csit vn1202
Binary reference guide csit vn1202Binary reference guide csit vn1202
Binary reference guide csit vn1202jrwalker2012
 
Number system
Number systemNumber system
Number systemaviban
 
Logic Design 2009
Logic Design 2009Logic Design 2009
Logic Design 2009lionking
 

Ähnlich wie Algorithm of NGS Data (20)

Lecture4 binary-numbers-logic-operations
Lecture4  binary-numbers-logic-operationsLecture4  binary-numbers-logic-operations
Lecture4 binary-numbers-logic-operations
 
ARITHMETIC FOR COMPUTERS
ARITHMETIC FOR COMPUTERS	  ARITHMETIC FOR COMPUTERS
ARITHMETIC FOR COMPUTERS
 
CA Unit ii
CA Unit iiCA Unit ii
CA Unit ii
 
Sistem bilangan
Sistem bilanganSistem bilangan
Sistem bilangan
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
 
Cmp104 lec 2 number system
Cmp104 lec 2 number systemCmp104 lec 2 number system
Cmp104 lec 2 number system
 
bit bin
bit binbit bin
bit bin
 
BITS,BINARY AND LOGIC
BITS,BINARY AND LOGICBITS,BINARY AND LOGIC
BITS,BINARY AND LOGIC
 
Digital Logic Design.pptx
Digital Logic Design.pptxDigital Logic Design.pptx
Digital Logic Design.pptx
 
5. Error Coding
5. Error Coding5. Error Coding
5. Error Coding
 
MIS - Chapter 02
MIS - Chapter 02MIS - Chapter 02
MIS - Chapter 02
 
EASA Part 66 Module 5.2 : Numbering System
EASA Part 66 Module 5.2 : Numbering SystemEASA Part 66 Module 5.2 : Numbering System
EASA Part 66 Module 5.2 : Numbering System
 
Number systems presentation
Number systems presentationNumber systems presentation
Number systems presentation
 
Number system
Number systemNumber system
Number system
 
Binary reference guide csit vn1202
Binary reference guide csit vn1202Binary reference guide csit vn1202
Binary reference guide csit vn1202
 
Booth Multiplier
Booth MultiplierBooth Multiplier
Booth Multiplier
 
Number system
Number systemNumber system
Number system
 
Logic Design 2009
Logic Design 2009Logic Design 2009
Logic Design 2009
 
Chpater 6
Chpater 6Chpater 6
Chpater 6
 

Mehr von Eric Lee

R3 Corda Simple Tutorial
R3 Corda Simple TutorialR3 Corda Simple Tutorial
R3 Corda Simple TutorialEric Lee
 
Python and Neo4j
Python and Neo4jPython and Neo4j
Python and Neo4jEric Lee
 
Neo4j: JDBC Connection Case Using LibreOffice
Neo4j: JDBC Connection Case Using LibreOfficeNeo4j: JDBC Connection Case Using LibreOffice
Neo4j: JDBC Connection Case Using LibreOfficeEric Lee
 
Introduction to Graph Database
Introduction to Graph DatabaseIntroduction to Graph Database
Introduction to Graph DatabaseEric Lee
 
SNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome SequencingSNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome SequencingEric Lee
 
Google MAP API
Google MAP APIGoogle MAP API
Google MAP APIEric Lee
 
深入淺出RoR
深入淺出RoR深入淺出RoR
深入淺出RoREric Lee
 

Mehr von Eric Lee (8)

R3 Corda Simple Tutorial
R3 Corda Simple TutorialR3 Corda Simple Tutorial
R3 Corda Simple Tutorial
 
Python and Neo4j
Python and Neo4jPython and Neo4j
Python and Neo4j
 
Neo4j: JDBC Connection Case Using LibreOffice
Neo4j: JDBC Connection Case Using LibreOfficeNeo4j: JDBC Connection Case Using LibreOffice
Neo4j: JDBC Connection Case Using LibreOffice
 
Introduction to Graph Database
Introduction to Graph DatabaseIntroduction to Graph Database
Introduction to Graph Database
 
SNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome SequencingSNP Detection for Massively Parallel Whole-genome Sequencing
SNP Detection for Massively Parallel Whole-genome Sequencing
 
Google MAP API
Google MAP APIGoogle MAP API
Google MAP API
 
BioLiveCd
BioLiveCdBioLiveCd
BioLiveCd
 
深入淺出RoR
深入淺出RoR深入淺出RoR
深入淺出RoR
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Kürzlich hochgeladen (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Algorithm of NGS Data

  • 1. Speaker: Eric C.Y., LEE Advisor: I-Fang Chung 2011.Mar.21 Monday, March 21, 2011 1
  • 2. Outline • Motivation • Workflow • Result • Conclusion • My Comment Monday, March 21, 2011 2
  • 3. Motivation • High throughput sequence technology play an important role in the life science now. • Different high throughput sequence technologies are competing to be able to sequence an individual human genome for less than $1,000 within a few years. 2006.Mar.17 Vol.311 Science Monday, March 21, 2011 3
  • 4. Motivation • The amount of data produced by HTS technologies creates significant bioinformatics challenge to understand, store and share data. Monday, March 21, 2011 4
  • 5. Workflow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ... Monday, March 21, 2011 5
  • 6. Coding Strategy Optimal encoding of these integers from a compression standpoint depends on their distribution in order to assign shorter binary codes to more probable symbols. ~ Shannon’s Entropy Coding Theory Claude Shannoon 1916~2001 Monday, March 21, 2011 6
  • 7. Encoding Strategies • Fixed Codes • Golomb-Rice Codes • Elias Gamma Codes • Monotone Value Codes • Variable Codes • Huffman Code Monday, March 21, 2011 7
  • 8. Golomb-Rice Codes Set m=10, and try to encode 42 Encoding of quotient part Encoding of remainder part q output bits r binary output bits 0 0 0 0000 000 1 10 1 0001 001 2 110 2 0010 010 3 1110 3 0011 011 4 11110 4 0100 100 5 111110 5 0101 101 6 1111110 6 1100 1100 .. .. 7 1101 1101 N <N repetitions of 1> 8 1110 1110 n=42, n/m q=4, r=2 9 1111 1111 output is 11110010 Monday, March 21, 2011 8
  • 9. Elias Gamma Codes number 2^n output 1 20+0 1 2 21+0 010 3 4 21+1 22+0 011 00100 Example 5 22+1 00101 6 22+2 00110 7 22+3 00111 42=25+10 8 23+0 0001000 9 23+1 0001001 10 23+2 0001010 11 12 23+3 23+4 0001011 0001100 00000101010 13 23+5 0001101 14 23+6 0001110 15 23+7 0001111 16 24+0 000010000 17 24+1 000010001 Monday, March 21, 2011 9
  • 10. MOV Coding number 2^n output 1 20+0 1 2 3 21+0 21+1 10 11 Beginning with Elias Gamma 4 22+0 100 code’s significant 1-bit. 5 22+1 101 6 22+2 110 7 22+3 111 Decode: 8 23+0 1000 10001 9 23+1 1001 {4bit} 10 23+2 1010 11 23+3 1011 12 23+4 1100 13 23+5 1101 24 + (0001)2 14 23+6 1110 15 23+7 1111 16 17 24+0 24+1 10000 10001 17 Monday, March 21, 2011 10
  • 11. Huffman Codes “this is an example of a huffman tree” Monday, March 21, 2011 11
  • 12. Workflow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ... Monday, March 21, 2011 12
  • 13. Dataset1 • Retrotransposon Ty3 insertion sites in the yeast genome. • 6,439,584 reads in 19 bp. • Highly Clustered. 2 32% • High degree of repetition. 0 54% • Most two substitutions. 1 14% Monday, March 21, 2011 13
  • 14. Dataset2 • In vivo binding site locations of the neuron- restrictive silencer factor (NRSF)in humans. • Mapped to hg18. 1 2 6% • 1,697,990 reads in 25 bp. 18% • Most two substitutions. 0 76% Monday, March 21, 2011 14
  • 16. Dataset3 • Corresponds to a full diploid human genome sequencing experiment for an Asian individual. • Large dataset. Only mapped to chr.22. • 31,118,531 reads. 30~40bp. 2 19% 1 0 20% 61% Monday, March 21, 2011 16
  • 17. Workflow Evaluate Analysis Preliminary algorithms datasets result Golomb-Rice Dataset1 For location Elias Gamma Dataset2 For mismatch MOV Dataset3 ... Huffman ... ... Monday, March 21, 2011 17
  • 18. Alignment Result Example Name of read that aligned Name of reference Read sequence Value of celing sequence occurs Strand 0-bases offset into the Mismatch descriptors Read quality forward reference strand Bowtie Monday, March 21, 2011 18
  • 19. Encoding Location Information • Standalone: Encoding each column independently. • Combine: Combining column of then chromosome, strand and mismatch compressing together. Monday, March 21, 2011 19
  • 20. Apply the Algorithms • Elias Gamma (EG) Absolute • Sequence can’t be sort. • Apply to Dataset3. Monday, March 21, 2011 20
  • 21. Apply the Algorithms • Elias Gamma Relative (REG) • Sequence can be sort, compression performance much better. • Sorting the location address using relative instead of absolute. Monday, March 21, 2011 21
  • 22. Apply the Algorithms • Relative Elias Gamma Indexed (REG Indexed) • Sorting and creating index file. • Combine chromosome, strand, mismatches together. Compressing them by relative location. • Can’t apply to dataset 3. Monday, March 21, 2011 22
  • 23. Apply the Algorithms • Monotone Value (MOV) • Based on chromosome and location, sorting the sequences. • Coding the absolute address. Monday, March 21, 2011 23
  • 24. Apply the Algorithms • Huffman codes • Focused on “relative” start position. • This algorithm has to storing the Huffman tree for decompression. Monday, March 21, 2011 24
  • 25. Comments for encoding location • REG is suit for the three datasets. • From dataset 1, using unique location of chromosome and counting the frequencies for coding. REG is an ideal solution for highly repetitive dataset. • Huffman code it’s not good for dataset 1. Monday, March 21, 2011 25
  • 26. Encoding Mismatch Information • Each read may contains 1 or 2 mismatch and has the nucleotide value. • Using one line to record the mismatch information. If no mismatch leave the line blank. Monday, March 21, 2011 26
  • 27. Mismatches of Dataset2 If the mismatch at 23 From start is 22. 10110 From end is 2. 10 Calculate the position from the end of the reads. Monday, March 21, 2011 27
  • 28. Nucleotide Substitution • Using number instead of characters. A: 65 1000001 C: 67 1000011 G: 71 1000111 T: 84 1010100 A: 00 C:01 G:10 T:11 Monday, March 21, 2011 28
  • 29. Combining Location and Mismatch 19G Count the frequencies, coding the location and 30A mismatch together. 34T 19G: 00001010110 { 11bit } 19G: 10110 {5bit} Monday, March 21, 2011 29
  • 30. Final Encoding • Dataset1: Mismatches dominates most of space, because of it already be sorted. • Dataset2: Location is sparse, it dominates lots of storage. • Dataset3: This dataset is balanced, because of it has full coverage of genome. Monday, March 21, 2011 30
  • 31. Implementation • Based on REG indexed for location information and combined encoding for mismatch information. • Pass1: Counting the mismatches. • Pass2: Actual encoding. Monday, March 21, 2011 31
  • 32. Result Original 1,030,333,440 Best Compression 56,078,940 GenCompress 56,166,419 gzip 41,378,624 bzip2 42,233,336 7zip 30,651,664 0 275,000,000 550,000,000 825,000,000 1,100,000,000 (bytes) Dataset1 Monday, March 21, 2011 32
  • 33. Result Original 353,181,920 Best Compression 35,983,322 GenCompress 36,099,480 gzip 95,688,992 bzip2 94,030,320 7zip 83,319,584 0 100000000 200000000 300000000 400000000 (bytes) Dataset2 Monday, March 21, 2011 33
  • 34. Result Original 8,869,613,392 Best Compression 390,541,330 GenCompress 390,541,330 gzip 618,818,824 bzip2 955,061,616 7zip 411,811,520 0 2250000000 4500000000 6750000000 9000000000 (bytes) Dataset3 Monday, March 21, 2011 34
  • 35. Conclusion • Any genome sequence can be used for mapping the reads. • From the view of time consuming, GenCompress is worth to use. Monday, March 21, 2011 35
  • 36. Compression Time 20 GenCompress gzip 10 bzip2 7zip Dataset1 78 107 5 13 Dataset2 20 77 111 70 Dataset3 422 447 0 125 250 375 500 (sec) Monday, March 21, 2011 36
  • 37. Decompression Time 2 GenCompress gzip 2 bzip2 7zip Dataset1 7 4 1 1 Dataset2 4 2 15 13 Dataset3 53 21 0 15 30 45 60 (sec) Monday, March 21, 2011 37
  • 38. Conclusion • Hard drive is not expensive, the cost is the bandwidth. • Doesn’t consider the quality score. • Read identifier is also important. • Maybe mismatches are contaminants, de novo. Or the reference sequence is unfinished. • Only consider the best match. Monday, March 21, 2011 38
  • 39. Conclusion • Huffman tree in dataset 1 and 2. Monday, March 21, 2011 39
  • 40. My Comments • They should open source. • Hardware configuration. Why RAID1? Monday, March 21, 2011 40
  • 41. Thanks for your attention! Monday, March 21, 2011 41