1. WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and
structural variants
Thomas Keane
Sequence Variation Infrastructure Group
WTSI
Today's slides: ftp://ftp-mouse.sanger.ac.uk/other/tk2/WTAC-2014/Lecture2.pdf
2. WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and structural
variants
➢ VCF Format
➢ SNP/indel Identification
➢ Structural Variation
3. WTAC NGS Course, Hinxton 10th
April 2014
VCF: Variant Call Format
VCF is a standardised format for storing DNA polymorphism data
● SNPs, insertions, deletions and structural variants
● With rich annotations (e.g. context, predicted function, sequence data support)
Indexed for fast data retrieval of variants from a range of positions
Store variant information across many samples
Record meta-data about the site
● dbSNP accession, filter status, validation status
Very flexible format
● Arbitrary tags can be introduced to describe new types of variants
● No two VCF files are necessarily the same
● User extensible annotation fields supported
● Same event can be expressed in multiple ways by including different numbers
● Recommendation on VCF format website to ensure consistency
4. WTAC NGS Course, Hinxton 10th
April 2014
VCF Format
Header section and a data section
Header
● Arbitrary number of meta-data information lines
● Starting with characters ‘##’
● Column definition line starts with single ‘#’
Mandatory columns
● Chromosome (CHROM)
● Position of the start of the variant (POS)
● Unique identifiers of the variant (ID)
● Reference allele (REF)
● Comma separated list of alternate non-reference alleles (ALT)
● Phred-scaled quality score (QUAL)
● Site filtering information (FILTER)
● User extensible annotation (INFO)
5. WTAC NGS Course, Hinxton 10th
April 2014
Example VCF (SNPs/indels)
6. WTAC NGS Course, Hinxton 10th
April 2014
VCF Trivia 1
What version of the human reference genome was used?
What does the DB INFO tag stand for?
What does the ALT column contain?
At position 17330, what is the total depth? What is the depth for sample NA00002?
At position 17330, what is the genotype of NA00002?
Which position is a tri-allelic SNP site?
What sort of variant is at position 1234567? What is the genotype of NA00002?
7. WTAC NGS Course, Hinxton 10th
April 2014
Functional Annotation
VCF can store arbitrary
● INFO tags per site
● Genotype FORMAT tags
Use tags to describe
● Genomic context of the variant (e.g. coding, intronic, non-coding, UTR,
intergenic)
● Predicted functional consequence of the variant (e.g. synonymous/non-
synonymous, protein structure change)
● Presence of the variant in other large resequencing studies
Several tools for annotating a VCF
● SnpEff: http://snpeff.sourceforge.net/
● Ensembl VEP: http://www.ensembl.org/info/docs/tools/vep/script/index.html
● FunSeq: http://funseq.gersteinlab.org/
8. WTAC NGS Course, Hinxton 10th
April 2014
Ensembl - VEP
"VEP determines the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants)
on genes, transcripts, and protein sequence, as well as regulatory regions."
Species must be included in either Ensembl OR Ensembl genomes
Sequence ontology (SO) terms to describe genomic context
Pubmed IDs for variants cited
Output only the most severe consequence per variation.
Online or off-line mode
● Off-line recommended for large numbers of variants (download relevant cache)
Human specific annotations
● Sift - predicts whether an amino acid substitution affects protein function
● Polyphen - predicts impact of an amino acid substitution on the structure of human proteins
● 1000 genomes frequencies - global or per population
9. WTAC NGS Course, Hinxton 10th
April 2014
VEP VCF
VEP INFO tag:
● ##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as
predicted by VEP. Format:
Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Prote
in_position|Amino_acids|Codons|Existing_variation|AA_MAF|EA_MAF|DISTANCE|S
TRAND|CLIN_SIG|SYMBOL|SYMBOL_SOURCE|SIFT|PolyPhen|AFR_MAF|AMR_
MAF|ASN_MAF|EUR_MAF">
Example
● CSQ=T|ENSG00000238962|ENST00000458792|Transcript|upstream_gene_variant|
|||||rs72779452|||3789|-1||RNU7-176P|HGNC|||0.02|0.10|0.07|0.17,
T|ENSG00000143870|ENST00000404824|Transcript|synonymous_variant|474|102|
34|A|gcC/gcA|rs72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17,
T|ENSG00000143870|ENST00000381611|Transcript|5_prime_UTR_variant|264|||||r
s72779452||||-1||PDIA6|HGNC|||0.02|0.10|0.07|0.17
11. WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and structural
variants
➢ VCF Format
➢ SNP/indel Identification
➢ Structural Variation
12. WTAC NGS Course, Hinxton 12th
April 2014
SNP Identification
SNP - single nucleotide polymorphisms
● Examine the bases aligned to position and look for differences
SNP discovery vs genotyping
● Finding new variant sites
● Determining the genotype at a set of already known sites
Factors to consider when calling SNPs
● Base call qualities of each supporting base
● Proximity to
○ Small indel
○ Homopolymer run (>4-5bp for 454 and >10bp for illumina)
● Mapping qualities of the reads supporting the SNP
○ Low mapping qualities indicates repetitive sequence
● Read length
○ Possible to align reads with high confidence to larger portion of the genome with
longer reads
● Paired reads
● Sequencing depth
17. WTAC NGS Course, Hinxton 12th
April 2014
Evaluating SNPs
Specificity vs sensitivity
● False positives vs. false negatives
Desirable to have high sensitivity and specificity
Sensitivity
● External sources of validation
Specificity
● Test a random selection of snps by another technology
● e.g. Sequenom, Sanger sequencing…
Receiver operator curves to investigate effects of varying parameters
18. WTAC NGS Course, Hinxton 12th
April 2014
Known Systematic Biases
Many biases can be introduced in either sample preparation, sequencing
process, computational alignment steps etc.
● Can generate false positive SNPs/indels
Potential biases
● Strand bias
● End distance bias
● Consistency across replicates/libraries
● Variant distance bias
VCF Tools
● Soft filter variants file for these biases
● Variants kept in the file - just annotated with potential bias affecting the
variant
23. WTAC NGS Course, Hinxton 12th
April 2014
Future of Variant Calling?
Current approaches
● Rely heavily on the supplied alignment
● Largely site based, don't examine local haplotype
Local denovo assembly based variant callers
● Calls SNP, INDEL, MNP and small SV
simultaneously
● Can removes mapping artifacts
● e.g. GATK haplotype caller
24. WTAC NGS Course, Hinxton 12th
April 2014
Haplotype Based Calling - GATK
25. WTAC NGS Course, Hinxton 12th
April 2014
Lecture 2: Identification of SNPs, Indels, and structural
variants
➢ VCF Format
➢ SNP/indel Identification
➢ Structural Variation
26. WTAC NGS Course, Hinxton 12th
April 2014
Genomic Structural Variation
Large DNA rearrangements (>100bp)
Frequent causes of disease
● Referred to as genomic disorders
● Mendelian diseases or complex traits such as behaviors
● E.g. increase in gene dosage due to increase in copy number
● Prevalent in cancer genomes
Many types of genomic structural variation (SV)
● Insertions, deletions, copy number changes, inversions, translocations & complex events
Comparative genomic hybridization (CGH) traditionally used to for copy number discovery
● CNVs of 1-50 kb in size have been under-ascertained
Next-gen sequencing revolutionised field of SV discovery
● Parallel sequencing of ends of large numbers of DNA fragments
● Examine alignment distance of reads to discover presence of genomic rearrangements
● Resolution down to ~100bp
27. WTAC NGS Course, Hinxton 12th
April 2014
Human Disease
Stankiewicz and Lupski (2010) Ann. Rev. Med.
28. WTAC NGS Course, Hinxton 12th
April 2014
Structural Variation
Several types of structural variations (SVs)
● Large Insertions/deletions
● Inversions
● Translocations
Read pair information used to detect these events
● Paired end sequencing of either end of DNA
fragment
● Observe deviations from the expected fragment
size
● Presence/absence of mate pairs
34. WTAC NGS Course, Hinxton 12th
April 2014
Mobile Element Insertions
Transposons are segments of DNA that can move within the genome
● A minimal ‘genome’ - ability to replicate and change location
● Relics of ancient viral infections
Dominate landscape of mammalian genomes
● 38-45% of rodent and primate genomes
● Genome size proportional to number of TEs
Class 1 (RNA intermediate) and 2 (DNA intermediate)
Potent genetic mutagens
● Disrupt expression of genes
● Genome reorganisation and evolution
● Transduction of flanking sequence
Species specific families
● Human: Alu, L1, SVA
● Mouse: SINE, LINE, ERV
Many other families in other species
39. WTAC NGS Course, Hinxton 12th
April 2014
Detecting Mobile Element Insertions
Most algorithms for locating non-reference mobile elements operate in a similar manner
Goal: Detect all read pairs where one-end is flanking the insertion point and mate is in the
inserted sequence
Pseudo algorithm
● Read through BAM file and make list of all discordant read pairs
● Filter the reads where one end is similar to your library of mobile elements
● Remove anchor reads with low mapping quality
● Cluster the anchor reads and examine breakpoint
● Filter out any clusters close to annotated elements of the same type
40. WTAC NGS Course, Hinxton 12th
April 2014
1000 Genomes CEU Trio
Typical human sample ~900-1000 non-reference mobile elements
● ~800 Alu elements, ~100 L1
Why are there 44 calls private to the child?
41. WTAC NGS Course, Hinxton 12th
April 2014
Mobile Element Software
RetroSeq: https://github.com/tk2/RetroSeq
VariationHunter: http://compbio.cs.sfu.ca/strvar.
htm
T-LEX: http://petrov.stanford.edu/cgi-bin/Tlex.
html
Tea: http://compbio.med.harvard.edu/Tea/