Course: Bioinformatics for Biomedical Research (2014).
Session: 2.3- Introduction to NGS Variant Calling Analysis.
Statistics and Bioinformatisc Unit (UEB) & High Technology Unit (UAT) from Vall d'Hebron Research Institute (www.vhir.org), Barcelona.
1. Hospital Universitari Vall d’Hebron
Institut de Recerca - VHIR
Institut d’Investigació Sanitària de l’Instituto de Salud Carlos III (ISCIII)
Bioinformàtica per la
Recerca Biomèdica
http://ueb.vhir.org/2014BRB
Ferran Briansó
ferran.brianso@vhir.org
15/05/2014
INTRODUCTION TO NGS
VARIANT CALLING ANALYSIS
5. 5
Select target
Hybridization-based cature or PCR
Add adapters
Contain binding sequences
Barcodes
Primer sequences
Amplify material
A) Fragment DNA
B) End-repair
C) A-tailing, adapter ligation and PCR
D) Final library contains
• sample insert
• indices (barcodes)
• flowcell binding sequences
• primer binding sequences
LIBRARY PREPARATION2
17. READ MAPPING (BASIC ALIGNMENT)4
17
Comparison against
reference genome
(! not assembly !)
Many aligners
(short reads, longer reads, RNAseq...)
Examples: BWA, Bowtie
SAM/BAM files
18. BURROWS-WHEELER ALIGNMENT TOOL (BWA)
18
Popular tool for genomic sequence
data (not RNASeq!)
Li and Durbin 2009 Bioinformatics
Challenge:
compare billion of short sequence
reads (.fastq file) against human
genome (3Gb)
Burrows-Wheeler Transform to “index” the human genome and allow
memory-efficient and fast string matching between sequence read and
reference genome
4
Li & Durbin 2009 Bionformatics
23. SEQUENCE VARIANTS
23
Sanger: is it real??
NGS: read count
Provides confidence (statistics!)
Sensitivity tune-able parameter
(dependent on coverage)
4
24. VARIANT CALLING: GATK
24
Genome Analysis Toolkit (BROAD Institute)
• Initially developed for 1000 Genomes Project
• Single or multiple sample analysis (cohort)
• Popular tool for germline variant calling
• Evaluates probability of genotype given read data
4
see http://www.broadinstitute.org/gatk/
and McKenna et al. Genome Research 2010
25. SOMATIC VARIANT CALLING
25
Somatic mutations can occur at low freq. (<10%) due to:
• Tumor heterogeneity (multiple clones)
• Low tumor purity (% normal cells in tumor sample)
Requires different thresholds than
germline variant calling when
evaluating signal vs noise
Trade-off between sensitivity
(ability to detect mutation) and
specificity (rate of false positives)
Nature Reviews Cancer 12, 323-334 (May 2012)
4
30. EVALUATING VARIANT QUALITY
30
TAKING INTO ACCOUNT:
• Coverage at position
• Number independent reads supporting variant
• Observed allele fraction vs expected (somatic / germline)
• Strand bias
• Base qualities at variant position
• Mapping qualities of reads supporting variant
• Variant position within reads (near ends or at centre)
4
31. VCF FILES
31
Variant Call Format
Standard for reporting variants from NGS
Describes metadata of analysis and variant calls
Text file format (open in Text Editor or Excel)
!!! Not a MS Office vCard !!!
see
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format
-version-41
4
41. CONCLUSIONS7
41
NGS data - the new currency of (molecular) biology
Broad applications (ecology, evolution, ag sciences, medical research and
clinical diagnostics...).
Rapidly evolving (sequencing technologies, library preparation methods,
analysis approaches, software).
Different tools/pipelines/parametrization gives different results,
(more standards needed).
Bioinformatics pipelines typically combine vendor software, third-party
tools and custom scripts.
Requires skills in scripting, Linux/Unix, HPC.
Requires advanced hardware (not always available).
Understanding of data (SE, PE, RNA-Seq) important for successful analysis.