This document provides an introduction to next generation sequencing (NGS) technologies. It begins with an outline of topics to be covered, including the evolution of NGS technologies, their descriptions and comparisons, bioinformatics challenges of NGS data analysis, and some aspects of NGS data analysis workflows and tools. The document then delves into explanations of specific NGS platforms, their performance characteristics, and the sequencing processes. It discusses the large computational infrastructure and data management needs of NGS, as well as quality control, preprocessing of NGS data, and popular analysis tools and workflows.
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Introduction to next generation sequencing
1. Introduction to
Next Generation Sequencing
Alex Sánchez
Statistics and Bioinformatics Research Group
Statistics department, Universitat de Barelona
Statistics and Bioinformatics Unit
Vall d’Hebron Institut de Recerca
Introduction to NGS http://ueb.ir.vhebron.net/NGS
2. Outline
Introduction, Presentation, Goals.
Next generation sequencing technologies.
Evolution, Description, Comparison.
Bioinformatics challenges.
Some aspects of NGS data analysis.
NGS data, and data preprocessing (QC)
Types of analyses, workflows, tools
Conclusions and perspectives
Introduction to NGS http://ueb.ir.vhebron.net/NGS
4. Introduction
Introduction to NGS http://ueb.ir.vhebron.net/NGS
5. Why is NGS revolutionary?
• NGS has brought high speed not only to genome
sequencing and personal medicine,
• it has also changed the way we do genome research
Got a question on genome organization?
SEQUENCE IT !!!
Ana Conesa, bioinformatics researcher at
Principe Felipe Research Center
Introduction to NGS http://ueb.ir.vhebron.net/NGS
6. Sequencing: the Sanger Method (1977)
Click here to see an animation
Introduction to NGS http://ueb.ir.vhebron.net/NGS
7. History of DNA sequencing is related to the combination of new technologies.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
8. The human genome project
Introduction to NGS http://ueb.ir.vhebron.net/NGS
10. Next generation Sequencing
• Improvements in enzymes, chemistry and image
analysis, mature by the middle of last decade
dramatically increased sequencing capabilities.
• The newest type of technology, called “next-generation
sequencing“, appeared with the potential to dramatically
accelerate biological and biomedical research
– by enabling the comprehensive analysis of genomes,
transcriptomes and interactomes,
– by tending to become inexpensive, routine and
widespread, rather than requiring very costly
production-scale efforts.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
11. NGS technologies
Introduction to NGS http://ueb.ir.vhebron.net/NGS
13. Next-generation DNA sequencing
Sanger sequencing Next-generation sequencing
Advantages of NGS
- Construction of a sequencing
library clonal amplification to
generate sequencing features
Introduction to NGS http://ueb.ir.vhebron.net/NGS
14. Next-generation DNA sequencing
Sanger sequencing Next-generation sequencing
Advantages:
- Construction of a sequencing
library clonal amplification to
generate sequencing features
No in vivo cloning,
transformation, colony picking...
Introduction to NGS http://ueb.ir.vhebron.net/NGS
15. Next-generation DNA sequencing
Sanger sequencing Next-generation sequencing
Advantages:
- Construction of a sequencing
library clonal amplification to
generate sequencing features
No in vivo cloning,
transformation, colony picking...
- Array-based sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
16. Next-generation DNA sequencing
Sanger sequencing Next-generation sequencing
Advantages:
- Construction of a sequencing
library clonal amplification to
generate sequencing features
No in vivo cloning,
transformation, colony picking...
- Array-based sequencing
Higher degree of parallelism
than capillary-based sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
17. NGS means high sequencing capacity
GS FLX 454 HiSeq 2000 5500xl SOLiD
(ROCHE) (ILLUMINA) (ABI)
GS Junior
Ion TORRENT
Introduction to NGS http://ueb.ir.vhebron.net/NGS
22. Comparison of 2nd NGS
Introduction to NGS http://ueb.ir.vhebron.net/NGS
23. Some numbers
Introduction to NGS http://ueb.ir.vhebron.net/NGS
24. The sequencing process, in detail
1 Library preparation 1 DNA
fragmentation
and in vitro
adaptor ligation
Introduction to NGS http://ueb.ir.vhebron.net/NGS
25. Next-generation DNA sequencing
1 Library preparation 1 DNA
2 Clonal amplification fragmentation
and in vitro
adaptor ligation
emulsion PCR
2
Introduction to NGS http://ueb.ir.vhebron.net/NGS
26. Next-generation DNA sequencing
1 Library preparation 1 DNA
2 Clonal amplification fragmentation
and in vitro
adaptor ligation
emulsion PCR bridge PCR
2
Introduction to NGS http://ueb.ir.vhebron.net/NGS
27. Next-generation DNA sequencing
1 Library preparation 1 DNA
2 Clonal amplification fragmentation
and in vitro
3 Cyclic array sequencing adaptor ligation
emulsion PCR bridge PCR
2
3 Pyrosequencing
454 sequencingIntroduction to NGS http://ueb.ir.vhebron.net/NGS
28. Next-generation DNA sequencing
1 Library preparation 1 DNA
2 Clonal amplification fragmentation
and in vitro
3 Cyclic array sequencing adaptor ligation
emulsion PCR bridge PCR
2
3 Pyrosequencing Sequencing-by-ligation
454 sequencingIntroduction to NGSplatform
SOLiD http://ueb.ir.vhebron.net/NGS
29. Next-generation DNA sequencing
1 Library preparation 1 DNA
2 Clonal amplification fragmentation
and in vitro
3 Cyclic array sequencing adaptor ligation
emulsion PCR bridge PCR
2
3 Pyrosequencing Sequencing-by-ligation Sequencing-by-synthesis
454 sequencingIntroduction to NGSplatform
SOLiD Solexa technology
http://ueb.ir.vhebron.net/NGS
30. Next next generation sequencing
• Pacific Biosystems
– Real time DNA
synthesis
– Up to 12000nt (?)
– 50 bases/second (?)
• Promises delivery of
human genome in
minutes?
– Company on track for
2013
Introduction to NGS http://ueb.ir.vhebron.net/NGS
32. I have my sequences/images. Now what?
Introduction to NGS http://ueb.ir.vhebron.net/NGS
33. NGS pushes (bio)informatics needs up
• Need for large amount of CPU power
– Informatics groups must manage compute clusters
– Challenges in parallelizing existing software or redesign of
algorithms to work in a parallel environment
– Another level of software complexity and challenges to
interoperability
• VERY large text files (~10 million lines long)
– Can’t do ‘business as usual’ with familiar tools such as
Perl/Python.
– Impossible memory usage and execution time
– Impossible to browse for problems
• Need sequence Quality filtering
Introduction to NGS http://ueb.ir.vhebron.net/NGS
34. Data management issues
• Raw data are large. How long should be kept?
• Processed data are manageable for most people
– 20 million reads (50bp) ~1Gb
• More of an issue for a facility: HiSeq recommends
32 CPU cores, each with 4GB RAM
• Certain studies much more data intensive than other
– Whole genome sequencing
• A 30X coverage genome pair (tumor/normal) ~500 GB
• 50 genome pairs ~ 25 TB
Introduction to NGS http://ueb.ir.vhebron.net/NGS
35. So what?
• In NGS we have to process really big amounts of data,
which is not trivial in computing terms.
• Big NGS projects require supercomputing infrastructures
• Or put another way: it's not the case that anyone can do
everything.
– Small facilities must carefully choose their projects to be scaled
with their computing capabilities.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
36. Computational infrastructure for NGS
• There is great variety but a good point to start with:
– Computing cluster
• Multiple nodes (servers) with multiple cores
• High performance storage (TB, PB level)
• Fast networks (10Gb ethernet, infiniband)
– Enough space and conditions for the equipment
("servers room")
– Skilled people (sysadmin, developers)
• CNAG, in Barcelona: 36 people, more than 50% of them
informaticians
Introduction to NGS http://ueb.ir.vhebron.net/NGS
37. Big computing infrastructure
• Distributed memory cluster
– Starting at 20 computing nodes
– 160 to 240 cores
– amd64 (x86_64) is the most used cpu architecture
– At least 48GB ram per node
• Fast networks
– 10Gbit
– Infiniband
• Batch queue system (sge, condor, pbs, slurm)
• Optional MPI and GPUs environment depending on
project requirements.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
38. Big infrastructure is expensive
• Starting at 200.000€
– 200.000€ is just the hardware
– Plus data center (computers room)
– Plus informaticians salary
• Not every partner knows about supercomputing.
– SGI
– Bull
– IBMHP
Introduction to NGS http://ueb.ir.vhebron.net/NGS
39. Middle size infrastructure
• "Small” distributed filesystem ( around 50TB).
• "Small” cluster (around 10 nodes, 80 to 120 cores).
• At least gigabit ethernet network.
• Price range: 50.000 – 100.000 € (just hardware)
– plus data center and informaticians salary
Introduction to NGS http://ueb.ir.vhebron.net/NGS
40. Small infrastructure
• Recommended at least 2 machines
– 8 or 12 cores each machine.
– 48Gb ram minimum each machine.
– BIG local disk. At least 4TB each machine
• As much local disks as we can afford
• Price range: starting at 8.000€ - 10.000€ (2 machines)
Introduction to NGS http://ueb.ir.vhebron.net/NGS
41. Alternatives (1): Cloud Computing
• Pros
– Flexibility.
– You pay what you use.
– Don´t need to maintain a data center.
• Cons
– Transfer big datasets over internet is
slow.
– You pay for consumed bandwidth.
That is a problem with big datasets.
– Lower performance, specially in disk
read/write.
– Privacy/security concerns.
– More expensive for big and long
term projects.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
42. Alternatives (2): Grid Computing
• Pros
– Cheaper.
– More resources available.
• Cons
– Heterogeneous
environment.
– Slow connectivity (specially
in Spain).
– Much time required to find
good resources in the grid.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
43. So what?
• Think before you NGS
• Decide what you …
– want to do,
– can afford
– know how to do
• Consider all alternatives
• Look for expert advice …
Introduction to NGS http://ueb.ir.vhebron.net/NGS
49. Transcriptomics by NGS: RNASeq
• Analog Signal • Digital Signal
• Easy to convey the signal’s
• Harder to achieve & interpret
information
• Continuous strength • Reads counts: discrete values
• Signal loss and distortion • Weak background or no noise
Introduction to NGS http://ueb.ir.vhebron.net/NGS
50. Which software for NGS (data) analysis?
• Answer is not straightforward.
http://seqanswers.com/wiki/Software/list
• Many possible classifications
– Biological domains
• SNP discovery, Genomics, ChIP-Seq, De-novo assembly, …
– Bioinformatics methods
• Mapping, Assembly, Alignment, Seq-QC,…
– Technology
• Illumina, 454, ABI SOLID, Helicos, …
– Operating system
• Linux, Mac OS X, Windows, …
– License type
• GPLv3, GPL, Commercial, Free for academic use,…
– Language
• C++, Perl, Java, C, Phyton
– Interface
• Web Based, Integrated solutions, command line tools, pipelines,…
Introduction to NGS http://ueb.ir.vhebron.net/NGS
51. Combining tools in a typical workflow
Introduction to NGS http://ueb.ir.vhebron.net/NGS
53. Quality control and preprocessing of
NGS data
Introduction to NGS http://ueb.ir.vhebron.net/NGS
54. Data types
Introduction to NGS http://ueb.ir.vhebron.net/NGS
55. Why QC and preprocessing
• Sequencer output:
– Reads + quality
• Natural questions
– Is the quality of my sequenced
data OK?
– If something is wrong can I fix it?
• Problem: HUGE files... How
do they look?
• Files are flat files and big...
tens of Gbs (even hard to
browse them)
Introduction to NGS http://ueb.ir.vhebron.net/NGS
57. How is quality measured?
• Sequencing systems use to assign quality scores to each peak
• Phred scores provide log(10)-transformed error probability values:
If p is probability that the base call is wrong the Phred score is
Q = .10·log10p
– score = 20 corresponds to a 1% error rate
– score = 30 corresponds to a 0.1% error rate
– score = 40 corresponds to a 0.01% error rate
• The base calling (A, T, G or C) is performed based on Phred scores.
• Ambiguous positions with Phred scores <= 20 are labeled with N.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
58. Data formats
• FastA format (everybody knows about it)
– Header line starts with “>” followed by a sequence ID
– Sequence (string of nt).
• FastQ format (http://maq.sourceforge.net/fastq.shtml)
– First is the sequence (like Fasta but starting with “@”)
– Then “+” and sequence ID (optional) and in the following line are
QVs encoded as single byte ASCII codes
• Different quality encode variants
• Nearly all downstream analysis take FastQ as input
sequence
Introduction to NGS http://ueb.ir.vhebron.net/NGS
59. The fastq format
• A FASTQ file normally uses four lines per sequence.
– Line 1 begins with a '@' character and is followed by a sequence
identifier and an optional description (like a FASTA title line).
– Line 2 is the raw sequence letters.
– Line 3 begins with a '+' character and isoptionally followed by the same
sequence identifier (and any description) again.
– Line 4 encodes the quality values for the sequence in Line 2, and must
contain the same number of symbols as letters in the sequence.
• Different encodings are in use
• Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126
@Seq description
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Introduction to NGS http://ueb.ir.vhebron.net/NGS
60. Some tools to deal with QC
• Use FastQC to see your starting state.
• Use Fastx-toolkit to optimize different datasets and then
visualize the result with FastQC to prove your success!
• Hints:
– Trimming, clipping and filtering may improve quality
– But beware of removing too many sequences…
Go to the tutorial and try the exercises...
Introduction to NGS http://ueb.ir.vhebron.net/NGS
61. Acknowledgements
Grupo de investigación en Estadística y Bioinformática del
departamento de Estadística de la Universidad de
Barcelona.
Xavier de Pedro and Ferran Briansó (but also Jose Luis
Mosquera and Israel Ortega) de la Unitat d’Estadística i
Bioinformàtica del VHIR (Vall d’Hebron Institut de
Recerca)
Unitat de Serveis Científico Tècnics (UCTS) del VHIR
(Vall d’Hebron Institut de Recerca)
People whose materials have been borrowed
Manel Comabella, Rosa Prieto, Paqui Gallego, Javier
Santoyo, Ana Conesa, Pablo Escobar, Thomas Girke
…
Introduction to NGS http://ueb.ir.vhebron.net/NGS
62. Gracias por la atención y la paciencia
Introduction to NGS http://ueb.ir.vhebron.net/NGS