SlideShare a Scribd company logo
1 of 68
Genome Assembly Forensics and
                      Visualisation


                                  Nathan S. Watson-Haigh

                                 Fri 11th May 2012, ACPFG Journal Club




Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34.
Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome
              Biology, 9(3), p.R55.
Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in
              Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.
Overview


•   Genome Assembly
•   N50/N90/N95
•   Paired-end and Matepair Reads
•   Mis-assembly Signatures
•   Assembly Validation and Manual Editing
Genome Assembly – Shotgun Reads



                    DNA being sequenced

                    aligned shotgun reads
Genome Assembly – Repeats
Genome Assembly – Repeats
Genome Assembly – Repeats




reads from different
  double coverage
  repeats can’t be
     resolved
Genome Assembly – Repeats
Genome Assembly – Diploid
Assembly Metrics – N50


• The N50 is the most widely reported metric for de
  novo assemblies
• It is a single measure of the contig length size
  distribution of an assembly
  – If contigs are sorted into descending length order, the
    n50 is the size of the contig above which the assembly
    contains at least 50% of the total length of all the
    contigs
  – Commonly reported with the N90 and N95
Assembly Metrics – N50




                         + = N50
                         + = N90
                         + = N95
Assembly Metrics – N50


• The N50 is the most widely reported metric for de
  novo assemblies
• It is a single measure of the contig length size
  distribution of an assembly
  – If contigs are sorted into descending length order, the
    n50 is the size of the contig above which the assembly
    contains at least 50% of the total length of all the
    contigs
  – Commonly reported with the N90 and N95
• These stats DO NOT imply anything about
  assembly quality
  – Could simply concatenate contigs together to get a
    better N50!!
Paired-end Reads
Matepair Reads
Paired-end and Matepair Reads



Paired-end             Matepair




                           reverse
                           compliment
So, Why are Pairs so Useful?
So, Why are Pairs so Useful?
Pairs are Useful – Orientation and
Separation
Pairs are Useful – Orientation and
Separation
Pairs are Useful – Orientation and
Separation
Pairs are Useful – Orientation and
Separation
Pairs are Useful – Orientation and
     Separation




Incorrect orientation
Incorrect distance
Mis-assembly Signatures –
Collapsed Tandem Repeat

         Correct alignment




        Incorrect alignment
Mis-assembly Signatures –
Collapsed Tandem Repeat

         Correct assembly




          Mis-assembly
Mis-assembly Signatures –
Collapsed (small) Tandem Repeat

         Correct assembly




          Mis-assembly
Mis-assembly Signatures –
Collapsed Repeat

         Correct assembly




          Mis-assembly
Mis-assembly Signatures –
Rearrangement

         Correct assembly




          Mis-assembly
Automated Assemblies Are One
    Thing, Good Assemblies Are Another

• Given the computer resources you can generate
  an automated assembly in a few weeks
  – Not necessarily good
  – Need to optimise assembly parameters
• For small organisms (< ~15Mbases)
  – Commodity hardware
  – OLC assemblers
• For larger genomes
  – More RAM (10-100’s Gbytes) for OLC assemblers
  – De Bruijin Graph assemblers
  – Read Mapping step to generate contig read alignments
Automated Assemblies Are One
    Thing, Good Assemblies Are Another

• Automated assemblies need to be checked for
  mis-assemblies
  – Need paired-end/matepair reads
  – Need viewers to visualise paired-end data
  – Need editors to break/join/reassemble parts of the
    assembly deemed to be inconsistent with read pair info
  – Need enough computer hardware to allow all this data to
    be loaded – especially with large volumes of Illumina
    paired-end data
Automated Assemblies Are One
    Thing, Good Assemblies Are Another

• Very time consuming and laborious to check/edit
  – Small assemblies (< ~15Mbases)
     • Several weeks/few months to move 1 scaffold/contig at a
       time
  – Large assemblies need a team to do the same thing
     • Need enough RAM to load all the paired-end data
     • Need ways to identify regions requiring closer inspection
     • identify possible mis-assemblies
• Major hurdles
  – Software inadequacies
  – Time
  – File formats! Grrrr!
Software Inadequacies

Software   Contig   Scaffold   Editing   Reassemble   Clipping   Other
           View     View                              Info

SeqMan       9         9          6          6           6       $$, buggy, not for
Pro                                                              large assemblies
                                                                 (32bit), 1 template size


Gap5         6         NA         9          NA          8       Free, join editor, contig
                                                                 comparator, poor visual
                                                                 support for many
                                                                 contigs, shuffle pads,
                                                                 ACE, multiple template
                                                                 sizes


Consed       6         6          5          9           6       Free/US$2500/US$10k
                                                                 , poor visual support
                                                                 for many contigs,
                                                                 multiple templates
                                                                 sizes
Hawkeye      9         9         NA          NA          7       Leverages AMOS,
                                                                 automated detection of
                                                                 mis-assemblies, large
                                                                 assemblies, modular
SeqMan Pro – Strategy View
SeqMan Pro
Software Inadequacies

Software   Contig   Scaffold   Editing   Reassemble   Clipping   Other
           View     View                              Info

SeqMan       9         9          6          6           6       $$, buggy, not for
Pro                                                              large assemblies
                                                                 (32bit), 1 template size


Gap5         6         NA         9          NA          8       Free, join editor, contig
                                                                 comparator, poor visual
                                                                 support for many
                                                                 contigs, shuffle pads,
                                                                 ACE, multiple template
                                                                 sizes


Consed       6         6          5          9           6       Free/US$2500/US$10k
                                                                 , poor visual support
                                                                 for many contigs,
                                                                 multiple templates
                                                                 sizes
Hawkeye      9         9         NA          NA          7       Leverages AMOS,
                                                                 automated detection of
                                                                 mis-assemblies, large
                                                                 assemblies, modular
Gap5 – Template View
Gap5 – Contig Comparator
Gap5 – Join Editor
Gap5 – Contig Editor
Software Inadequacies

Software   Contig   Scaffold   Editing   Reassemble   Clipping   Other
           View     View                              Info

SeqMan       9         9          6          6           6       $$, buggy, not for
Pro                                                              large assemblies
                                                                 (32bit), 1 template size


Gap5         6         NA         9          NA          8       Free, join editor, contig
                                                                 comparator, poor visual
                                                                 support for many
                                                                 contigs, shuffle pads,
                                                                 ACE, multiple template
                                                                 sizes


Consed       6         6          5          9           6       Free/US$2500/US$10k
                                                                 , poor visual support
                                                                 for many contigs,
                                                                 multiple templates
                                                                 sizes
Hawkeye      9         9         NA          NA          7       Leverages AMOS,
                                                                 automated detection of
                                                                 mis-assemblies, large
                                                                 assemblies, modular
Consed – Assembly View
Consed – Contig Viewer/Editor
Software Inadequacies

Software   Contig   Scaffold   Editing   Reassemble   Clipping   Other
           View     View                              Info

SeqMan       9         9          6          6           6       $$, buggy, not for
Pro                                                              large assemblies
                                                                 (32bit), 1 template size


Gap5         6         NA         9          NA          8       Free, join editor, contig
                                                                 comparator, poor visual
                                                                 support for many
                                                                 contigs, shuffle pads,
                                                                 ACE, multiple template
                                                                 sizes


Consed       6         6          5          9           6       Free/US$2500/US$10k
                                                                 , poor visual support
                                                                 for many contigs,
                                                                 multiple templates
                                                                 sizes
Hawkeye      9         9         NA          NA          7       Leverages AMOS,
                                                                 automated detection of
                                                                 mis-assemblies, large
                                                                 assemblies, modular
Scaffold/Contig Length Distribution
Library Stats
Compression-Expansion (CE)
Statistic

            • A measure of the
              deviation of local
              distribution of insert sizes
              to the global distribution
              of insert sizes
               – 0 indicates no deviation
               – ≤ 3 indicates much
                 compression
               – ≥3 indicates much
                 expansion
Insert Coverage   Read Coverage
500bp inserts   3kb inserts



                          20kb inserts
AMOSvalidate


• An assembly analysis pipeline to identify possible
  mis-assemblies
   – Paired-end data
      • CE stats
      • Incorrect orientation
      • Missing mate
   – Coverage
   – SNP density
   – Singletons
Hawkeye Cons


• Poor support for correcting mis-assemblies once
  detected
Software Inadequacies

Software   Contig   Scaffold   Editing   Reassemble   Clipping   Other
           View     View                              Info

SeqMan       9         9          6          6           6       $$, buggy, not for
Pro                                                              large assemblies
                                                                 (32bit), 1 template size


Gap5         6         NA         9          NA          8       Free, join editor, contig
                                                                 comparator, poor visual
                                                                 support for many
                                                                 contigs, shuffle pads,
                                                                 ACE, multiple template
                                                                 sizes


Consed       6         6          5          9           6       Free/US$2500/US$10k
                                                                 , poor visual support
                                                                 for many contigs,
                                                                 multiple templates
                                                                 sizes
Hawkeye      9         9         NA          NA          7       Leverages AMOS,
                                                                 automated detection of
                                                                 mis-assemblies, large
                                                                 assemblies, modular
Closing Remarks


• Software exist to allow manual editing of
  assemblies
  – Time consuming
  – Different tools have different features
  – Most fall over with assemblies > ~15Mbases or with
    many contigs/scaffolds (10k-100k)
Closing Remarks


• Ideal Tool
  – Contig/scaffold viewer capable of displaying
    compressed/expanded mates, which contigs mates map
    to when they are off contig/scaffold (like SeqMan Pro
    and Hawkeye)
Closing Remarks


• Ideal Tool
  – Contig/scaffold viewer capable of displaying
    compressed/expanded mates, which contigs mates map
    to when they are off contig/scaffold (like SeqMan Pro
    and Hawkeye)
  – Contig join editor for manual alignment and editing of
    contigs (like Gap5)
Gap5 – Join Editor
Closing Remarks


• Ideal Tool
  – Contig/scaffold viewer capable of displaying
    compressed/expanded mates, which contigs mates map
    to when they are off contig/scaffold (like SeqMan Pro
    and Hawkeye)
  – Contig join editor for manual alignment and editing of
    contigs (like Gap5)
  – Visualise clipped regions with consensus mismatches
    (like Gap5)
Gap5 – Contig Editor
Closing Remarks


• Ideal Tool
  – Contig/scaffold viewer capable of displaying
    compressed/expanded mates, which contigs mates map
    to when they are off contig/scaffold (like SeqMan Pro
    and Hawkeye)
  – Contig join editor for manual alignment and editing of
    contigs (like Gap5)
  – Visualise clipped regions with consensus mismatches
    (like Gap5)
  – Automated analysis of assembly to identify regions
    requiring attention (like AMOSvalidate) and a way to
    navigate to those regions for editing
  – Minimise mouse-clicks and keyboard presses!!
Newbler Plant Genome Assemblies


• Pretty conservative in contig construction
• Seems to split out repetitive regions into their
  own contigs pretty well
• Heterozygsity issues
   – SNP alignment issues
   – Indels break contigs
   – Hidden in clipped regions
   – Manual joining of neighbouring contigs can reduce
     scaffolded contig numbers by 60-70%
   – Many unscaffolded contigs have high sequence similarity
     to scaffolded contigs – could collapse these and reduce
     the number of unscaffolded contigs by 50%
Gap5 – Contig Editor

More Related Content

Viewers also liked

Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
14cummke
 
State of the Cloud 2017
State of the Cloud 2017State of the Cloud 2017
State of the Cloud 2017
Bessemer Venture Partners
 

Viewers also liked (8)

IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent DataIonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
IonGAP - an Integrated Genome Assembly Platform for Ion Torrent Data
 
Bio153 microbial genomics 2012
Bio153 microbial genomics 2012Bio153 microbial genomics 2012
Bio153 microbial genomics 2012
 
Dna sequencing
Dna    sequencingDna    sequencing
Dna sequencing
 
DNA Sequencing
DNA SequencingDNA Sequencing
DNA Sequencing
 
DNA Sequencing : Maxam Gilbert and Sanger Sequencing
DNA Sequencing : Maxam Gilbert and Sanger SequencingDNA Sequencing : Maxam Gilbert and Sanger Sequencing
DNA Sequencing : Maxam Gilbert and Sanger Sequencing
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
 
State of the Cloud 2017
State of the Cloud 2017State of the Cloud 2017
State of the Cloud 2017
 

Similar to Genome Assembly Forensics

20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
sesejun
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
Golden Helix Inc
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
guest18a0f1
 
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
DevOpsDays Tel Aviv
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Cal Henderson
 
Domino server and application performance in the real world
Domino server and application performance in the real worldDomino server and application performance in the real world
Domino server and application performance in the real world
dominion
 

Similar to Genome Assembly Forensics (20)

20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
Webinar slides: The Holy Grail Webinar: Become a MySQL DBA - Database Perform...
 
Speed up sql
Speed up sqlSpeed up sql
Speed up sql
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
10 Devops-Friendly Database Must-Haves - Dor Laor, ScyllaDB - DevOpsDays Tel ...
 
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environmentWebinar slides: Top 9 Tips for building a stable MySQL Replication environment
Webinar slides: Top 9 Tips for building a stable MySQL Replication environment
 
Cassandra Anti-Patterns
Cassandra Anti-PatternsCassandra Anti-Patterns
Cassandra Anti-Patterns
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
Performance By Design
Performance By DesignPerformance By Design
Performance By Design
 
Mysql talk
Mysql talkMysql talk
Mysql talk
 
MarvinSketch and MarvinView: Tips And Tricks: US UGM 2008
MarvinSketch and MarvinView: Tips And Tricks: US UGM 2008MarvinSketch and MarvinView: Tips And Tricks: US UGM 2008
MarvinSketch and MarvinView: Tips And Tricks: US UGM 2008
 
Oslo bekk2014
Oslo bekk2014Oslo bekk2014
Oslo bekk2014
 
groovy and concurrency
groovy and concurrencygroovy and concurrency
groovy and concurrency
 
Template Building Workshop
Template Building WorkshopTemplate Building Workshop
Template Building Workshop
 
Bridging the gap between designers and developers at the Guardian
Bridging the gap between designers and developers at the GuardianBridging the gap between designers and developers at the Guardian
Bridging the gap between designers and developers at the Guardian
 
Domino server and application performance in the real world
Domino server and application performance in the real worldDomino server and application performance in the real world
Domino server and application performance in the real world
 

Recently uploaded

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
heathfieldcps1
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 

Genome Assembly Forensics

  • 1. Genome Assembly Forensics and Visualisation Nathan S. Watson-Haigh Fri 11th May 2012, ACPFG Journal Club Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34. Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9(3), p.R55. Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.
  • 2. Overview • Genome Assembly • N50/N90/N95 • Paired-end and Matepair Reads • Mis-assembly Signatures • Assembly Validation and Manual Editing
  • 3. Genome Assembly – Shotgun Reads DNA being sequenced aligned shotgun reads
  • 6. Genome Assembly – Repeats reads from different double coverage repeats can’t be resolved
  • 9. Assembly Metrics – N50 • The N50 is the most widely reported metric for de novo assemblies • It is a single measure of the contig length size distribution of an assembly – If contigs are sorted into descending length order, the n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs – Commonly reported with the N90 and N95
  • 10. Assembly Metrics – N50 + = N50 + = N90 + = N95
  • 11. Assembly Metrics – N50 • The N50 is the most widely reported metric for de novo assemblies • It is a single measure of the contig length size distribution of an assembly – If contigs are sorted into descending length order, the n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs – Commonly reported with the N90 and N95 • These stats DO NOT imply anything about assembly quality – Could simply concatenate contigs together to get a better N50!!
  • 14. Paired-end and Matepair Reads Paired-end Matepair reverse compliment
  • 15. So, Why are Pairs so Useful?
  • 16. So, Why are Pairs so Useful?
  • 17. Pairs are Useful – Orientation and Separation
  • 18. Pairs are Useful – Orientation and Separation
  • 19. Pairs are Useful – Orientation and Separation
  • 20. Pairs are Useful – Orientation and Separation
  • 21. Pairs are Useful – Orientation and Separation Incorrect orientation Incorrect distance
  • 22. Mis-assembly Signatures – Collapsed Tandem Repeat Correct alignment Incorrect alignment
  • 23. Mis-assembly Signatures – Collapsed Tandem Repeat Correct assembly Mis-assembly
  • 24. Mis-assembly Signatures – Collapsed (small) Tandem Repeat Correct assembly Mis-assembly
  • 25. Mis-assembly Signatures – Collapsed Repeat Correct assembly Mis-assembly
  • 26. Mis-assembly Signatures – Rearrangement Correct assembly Mis-assembly
  • 27. Automated Assemblies Are One Thing, Good Assemblies Are Another • Given the computer resources you can generate an automated assembly in a few weeks – Not necessarily good – Need to optimise assembly parameters • For small organisms (< ~15Mbases) – Commodity hardware – OLC assemblers • For larger genomes – More RAM (10-100’s Gbytes) for OLC assemblers – De Bruijin Graph assemblers – Read Mapping step to generate contig read alignments
  • 28. Automated Assemblies Are One Thing, Good Assemblies Are Another • Automated assemblies need to be checked for mis-assemblies – Need paired-end/matepair reads – Need viewers to visualise paired-end data – Need editors to break/join/reassemble parts of the assembly deemed to be inconsistent with read pair info – Need enough computer hardware to allow all this data to be loaded – especially with large volumes of Illumina paired-end data
  • 29. Automated Assemblies Are One Thing, Good Assemblies Are Another • Very time consuming and laborious to check/edit – Small assemblies (< ~15Mbases) • Several weeks/few months to move 1 scaffold/contig at a time – Large assemblies need a team to do the same thing • Need enough RAM to load all the paired-end data • Need ways to identify regions requiring closer inspection • identify possible mis-assemblies • Major hurdles – Software inadequacies – Time – File formats! Grrrr!
  • 30. Software Inadequacies Software Contig Scaffold Editing Reassemble Clipping Other View View Info SeqMan 9 9 6 6 6 $$, buggy, not for Pro large assemblies (32bit), 1 template size Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes Consed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizes Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  • 31. SeqMan Pro – Strategy View
  • 33. Software Inadequacies Software Contig Scaffold Editing Reassemble Clipping Other View View Info SeqMan 9 9 6 6 6 $$, buggy, not for Pro large assemblies (32bit), 1 template size Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes Consed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizes Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  • 35. Gap5 – Contig Comparator
  • 36. Gap5 – Join Editor
  • 37. Gap5 – Contig Editor
  • 38. Software Inadequacies Software Contig Scaffold Editing Reassemble Clipping Other View View Info SeqMan 9 9 6 6 6 $$, buggy, not for Pro large assemblies (32bit), 1 template size Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes Consed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizes Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  • 40. Consed – Contig Viewer/Editor
  • 41. Software Inadequacies Software Contig Scaffold Editing Reassemble Clipping Other View View Info SeqMan 9 9 6 6 6 $$, buggy, not for Pro large assemblies (32bit), 1 template size Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes Consed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizes Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 50. Compression-Expansion (CE) Statistic • A measure of the deviation of local distribution of insert sizes to the global distribution of insert sizes – 0 indicates no deviation – ≤ 3 indicates much compression – ≥3 indicates much expansion
  • 51. Insert Coverage Read Coverage
  • 52. 500bp inserts 3kb inserts 20kb inserts
  • 53. AMOSvalidate • An assembly analysis pipeline to identify possible mis-assemblies – Paired-end data • CE stats • Incorrect orientation • Missing mate – Coverage – SNP density – Singletons
  • 54.
  • 55. Hawkeye Cons • Poor support for correcting mis-assemblies once detected
  • 56. Software Inadequacies Software Contig Scaffold Editing Reassemble Clipping Other View View Info SeqMan 9 9 6 6 6 $$, buggy, not for Pro large assemblies (32bit), 1 template size Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes Consed 6 6 5 9 6 Free/US$2500/US$10k , poor visual support for many contigs, multiple templates sizes Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
  • 57. Closing Remarks • Software exist to allow manual editing of assemblies – Time consuming – Different tools have different features – Most fall over with assemblies > ~15Mbases or with many contigs/scaffolds (10k-100k)
  • 58. Closing Remarks • Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
  • 59.
  • 60.
  • 61. Closing Remarks • Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye) – Contig join editor for manual alignment and editing of contigs (like Gap5)
  • 62. Gap5 – Join Editor
  • 63. Closing Remarks • Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye) – Contig join editor for manual alignment and editing of contigs (like Gap5) – Visualise clipped regions with consensus mismatches (like Gap5)
  • 64. Gap5 – Contig Editor
  • 65. Closing Remarks • Ideal Tool – Contig/scaffold viewer capable of displaying compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye) – Contig join editor for manual alignment and editing of contigs (like Gap5) – Visualise clipped regions with consensus mismatches (like Gap5) – Automated analysis of assembly to identify regions requiring attention (like AMOSvalidate) and a way to navigate to those regions for editing – Minimise mouse-clicks and keyboard presses!!
  • 66.
  • 67. Newbler Plant Genome Assemblies • Pretty conservative in contig construction • Seems to split out repetitive regions into their own contigs pretty well • Heterozygsity issues – SNP alignment issues – Indels break contigs – Hidden in clipped regions – Manual joining of neighbouring contigs can reduce scaffolded contig numbers by 60-70% – Many unscaffolded contigs have high sequence similarity to scaffolded contigs – could collapse these and reduce the number of unscaffolded contigs by 50%
  • 68. Gap5 – Contig Editor

Editor's Notes

  1. The DNA shown doesn’t imply that we are doing mapping, but is show to exemplify what the correct answer should be for the de novo assembly
  2. A repeat is: DNA with almost identical sequenceA repeat can arise through ...
  3. Double coverage
  4. Double coverageRed line = consensus mismatches
  5. Complicated by higher levels of heterozygosity. If this gets high enough, assemblers tend to split the different allelles. This is further excasterbated by heterozygotindels.
  6. Some genome assemblers can output the actual read alignments used in the generation of contigs. However, most do not. To generate an approximation to the alignment the assembler might have used, you need to map the reads back to the contig consensus sequences that the assembler generated.So for repeat rich and highly heterozygote organisms it’s important to consider several alignments for any given read, as the distance constriant between it and its mate many not be satisfied by the best alignments for each of the reads independently.
  7. Mis-assembly signatures are determined by the ratio of consistent/inconsistent read pairs so coverage is important in identifying mis-assemblies over random artefacts.
  8. Incorrect orientation with varying stretched/shrunken/correct distance.Shrunken mates that span over the collapsed repeat.Small collapsed repeats only detectable by incorrect orientation
  9. Small collapsed repeats only detectable by incorrect orientation as the spanning mates may be within the expected size distribution of the mate library.
  10. Links to a contig outside the current scaffold.Compressed mates due to collapsed repeat.
  11. Compressed, expanded and incorrect orientation.