http://iongap.hpc.iter.es
Computer Engineer Degree Final Project.
Universidad de La Laguna, Spain, July 2014.
Ion Torrent technology allows genome sequencing with reduced costs; however, its major drawback is the lack of tools dedicated to processing and assembling Ion Torrent reads.
IonGAP is a free graphical integrated pipeline designed for the assembly and subsequent analysis of Ion Torrent sequencing data. Both its components and their configuration are based on a research process aimed to discover the optimal combination of tools for obtaining good results from single-end reads generated by the Ion Torrent PGM sequencer, mainly from bacterial genomic material.
2. Contents
1. Introduction
2. Objective of the project
3. State of the art
4. The genome assembler
5. A genome assembly and analysis pipeline
6. IonGAP Web service
7. Parallel assembly of large genomes
8. Conclusions
IonGAP 1
6. Objective of the project
The development of an easy-to-use integrated software
platform that offers an optimally configured processing and
de novo assembly of genomic data obtained by Ion Torrent
sequencing, also complemented with several result analysis
stages.
IonGAP 5
7. Most sequencing
technologies:
Paired-end short reads
IUETSPC’s sequencing
technology:
Single-end long reads
DNA DNA
5’ 3’ 5’ 3’
Gap25-250 bp 25-250 bp 200-400 bp
Genome sequencing
Genome fragments FASTQ file
State of the art
IonGAP 6
9. Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
State of the art
IonGAP 8
10. Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
Adapted from:
http://gcat.davidson.edu/phast
State of the art
IonGAP 9
11. Genome assembly
• Genome assembler
– Overlap-layout-consensus (OLC) assemblers
– De Bruijn graph (DBG) assemblers
State of the art
IonGAP 1
0
15. Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 14
16. Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 15
17. Genome finishing
• Scaffolding
• Correction of assembly errors
– Discrepancies with reads or reference genome
– Repeat correction
State of the art
IonGAP 16
19. The genome assembler
Data set
Streptococcus
agalactiae
(686,800 reads)
IonGAP 18
Source:
http://ngm.nationalgeographic.com/wallpaper/img/2013/01/08-streptococcus_1600.jpg
20. The genome assembler
Comparative study of assemblers
• OLC assemblers
– MIRA
– Celera Assembler
– SGA
IonGAP 19
• DBG assemblers
– ABySS
– Ray
– Velvet
– SparseAssembler
– Minia
21. Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 20
22. Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
50% of the genome is in contigs larger than N50
Source:
http://schatzlab.cshl.edu/teaching/2012/CSHL.Sequencing/Whole%20Genome%20Assembly%20and%20Alignment.pdf
The genome assembler
IonGAP 21
23. Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 22
24. Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
1
The genome assembler
IonGAP 23
25. Results
• Number of contigs ≥ 500 bp
• N50 length
Conclusions
• MIRA is the most suitable assembler
• DBG is not indicated for long-read assembly
The genome assembler
IonGAP 24
26. MIRA assembler
The genome assembler
IonGAP 25
1
Automatic
editing
Data
preprocessing
Fast read
comparison
Smith-Waterman
alignment
Contig
assembly
Finished
project
27. Assembly parameter optimization
• Number of assembly iterations
• Uniform read distribution
• Separation of long repeats in
different contigs
• Maximum number of times a contig
can be rebuilt during an iteration
• Minimum number of reads
per contig
Conclusion
The assembler is set by default in its optimal configuration
• Minimum size of a contig for
being considered as "large"
• Minimum read length
• Minimum repeat length
• Minimum overlap length
• Minimum overlap score
The genome assembler
IonGAP 26
Minimum size of a contig for
being considered as "large"
28. A genome assembly and analysis pipeline
IonGAP 27
Data preprocessing
Genome
assembly
Genome finishing
Genome analysis
30. A genome assembly and analysis pipeline
IonGAP 29
Genome assembly
Data
preprocessing
Genome finishing
Genome analysis
31. Data preprocessing
• Comparative study of trimmers
(PRINSEQ, ERNE-filter, Trimmomatic)
– Removing adapters → 5’ trimming
– Discarding useless reads → Minimum length
– Removing low-quality regions
• Internal quality control of MIRA
– Sliding window trimming
Maximum length
Sliding window trimming
Window length
Quality threshold
A genome assembly and analysis pipeline
IonGAP 30
32. A genome assembly and analysis pipeline
Data preprocessing
Mauve Assembly Metrics
IonGAP 31
33. Data preprocessing
Conclusion
Read preprocessing has negative effects on the assembly
• An extensive evaluation of read trimming effects on Illumina NGS data analysis
(Del Fabbro C, Scalabrin S, Morgante M, Giorgi FM. PLoS ONE 2013):
"For high quality values, trimmed datasets produce slightly more fragmented
assemblies, probably due to a more stringent trimming that reflects also on
lower computational needs."
• MIRA user manual (Chevreux B):
"For heavens' sake: do NOT try to clip or trim by quality yourself. Do NOT try to
remove standard sequencing adaptors yourself. Just leave the data alone!"
A genome assembly and analysis pipeline
IonGAP 32
34. A genome assembly and analysis pipeline
IonGAP 33
Data preprocessing
Genome
finishing
Genome assembly
Genome analysis
35. Genome finishing
• Scaffolding
– Impossible: no mate-pair reads
• Correction of assembly errors
– Simplifier: selective elimination of redundant
sequences
A genome assembly and analysis pipeline
IonGAP 34
36. Genome finishing
Simplifier
• Only eliminates complete redundant contigs
• Time expensive
• Natural repeats in genome → Risky
Conclusion
It is better to leave postprocessing in the user's hands
A genome assembly and analysis pipeline
IonGAP 35
37. A genome assembly and analysis pipeline
IonGAP 36
Data preprocessing
Genome
analysis
Genome assembly
Genome finishing
38. Genome analysis
• Quality analysis of reads and contigs (FastQC)
• Taxonomic classification (BLAST)
• Genome annotation (Prokka)
If reference sequence provided:
• Genome alignment and coverage analysis
(MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 37
39. Genome analysis
• Taxonomic classification (BLAST)
• Genome annotation (Prokka)
A genome assembly and analysis pipeline
IonGAP 38
40. Genome analysis
• Genome annotation (Prokka)
UGENE genome viewer
A genome assembly and analysis pipeline
IonGAP 39
41. Genome analysis
If reference sequence provided:
• Genome alignment and coverage analysis
(MUMmer, Circos, BLAST, Circoletto, Mauve, genoPlotR)
A genome assembly and analysis pipeline
IonGAP 40
43. Genome analysis
If reference sequence provided:
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 42
Mauve genome viewer
44. Genome analysis
If reference sequence provided:
• Contig reordering (Mauve)
A genome assembly and analysis pipeline
IonGAP 43
Mauve genome viewer
45. Functioning and implementation
• Web user interface
• Input Web form
• Two independent modules (daemons)
– Assembly module
– Analysis module
• User notification via email
IonGAP Web service
IonGAP 44
46. Functioning and implementation
• Hosting: ETSII’s Computing Center
– Virtual machine (Ubuntu 12.04)
– Dual core 64 bits processor
– 17 GB RAM
IonGAP Web service
IonGAP 45
53. Parallel assembly with Contrail
Conclusions
• Good performance
– Parallel computing is the future of assembly
• Bad results
– Contrail uses DBG → Not suitable for long reads
Parallel assembly of large genomes
IonGAP 52
54. • IonGAP solves the need for an automated tool for
the assembly and preliminary analysis of Ion
Torrent data suffered by IUETSPC
• Availability to the scientific community is
directed to stimulate low-cost genome research and
development of other customized solutions
• The S. agalactiae genome has been successfully
assembled, and a manuscript is been prepared for
publication in a scientific journal
Conclusions
IonGAP 53
55. Future work
• New options and features
• Cloud assembly with Amazon Web Services
• Parallel OLC assembly on Hadoop
• High performance computing
– ITER’s Teide HPC – September 2014
Conclusions
IonGAP 54
56. Conclusions
Multidisciplinary work is the way to tackle the new
science of the 21st century
IonGAP 55
Genomics
Instituto Universitario
de Enfermedades
Tropicales y Salud
Pública de Canarias
Computer
Science
Escuela Técnica
Superior de
Ingeniería Informática
Bioinformatics