VAAST: Deciphering Genetic Disease with Next-Generation Sequencing
1. VAAST
Deciphering Genetic Disease with Next-Generation
Sequencing
Barry Moore, M.S.
Research Scientist
Department of Human Genetics
Department of Biomedical Informatics
2. Outline
 The VAAST Analysis Pipeline
 Ogden Syndrome: Application of VAAST to a Genetic Disease
of Unknown Cause
 The Future of VAAST Development
10. Key Features of VAAST
• Probabilistic
• Feature Based
• Both Allele and AAS Frequencies
• Considers Inheritance Model
• Fast
• Standardized Ontology Based Format
• Modular and Flexible in Design
11. VAAST Uses Variant Frequencies in a
Probabilistic Fashion
Likelihood Ratio Test
Maximum Likelihood
of the Null Model
(No Difference)
Maximum Likelihood
of the Alternate Model
(There is Difference)
13. VAAST Uses Variant Frequencies in a
Probabilistic Fashion
• VAAST gives us the likelihood of the composite genotype
at GENE X in the target given the background.
• Do allele frequencies differ between Background and
Target genomes within a given gene or feature?
• Composite likelihood calculation assumes independence
across sites. To control for LD, statistical significance is
estimated by permutation test.
• Multiple test correction for number of features (~20,000)
is two orders of magnitude better than for the number of
variants (~3,500,000).
19. Alleles Responsible for Miller
Syndrome in Utah Kindred
CHR 16: DHODH CHR 5: DNAH5
Mom Dad Mom Dad
G:R R:Q
G:A R:
*
Son Daughter Son Daughter
G:R G:R R:Q R:Q
R: R:
G:A G:A
* *
•Ng et al, Nature Genetics 42, 30–35 (2010) doi:10.1038/ng.499
•Roach, et al, Science , 328 636, 2101
20. Schematic of VAAST Analysis of Utah
Miller Kindred Using a Single Quartet
DHODH
DNAH5
21. Average Rank for 100 Dominant and
Recessive Diseases
1300
Ave. rank genome-wide
SIZE OF CASE COHORT
1100
2 allele copies
900
4 allele copies
700
6 allele copies
500
300
156 132
100 21 9 8 3
-100
DOMINANT RECESSIVE
-300
-500 443 genomes in background
22. Impact of Missing Data
4000
3500
2 of 6 allele copies
Ave. rank genome-wide
3000
4 of 6 allele copies
2500
6 of 6 allele copies
2000
1500
1000
639
500 373
61
21
9 3
0
-500
DOMINANT RECESSIVE
443 genomes in background
23. Outline
 The VAAST Analysis Pipeline
 Ogden Syndrome: Application of VAAST to a Genetic
Disease of Unknown Cause
 The Future of VAAST Development
24. An Rare X-linked Mendelian Disorder
• A Utah family coming to the
University Hospital for 20+
years
• About half of the male offspring
die around 1 year of age
• Aged appearance
• Craniofacial anomalies
• Hypotonia
• Global developmental delays
• Cardiac arrhythmias
26. Exome Sequencing
• Agilent SureSelect In-Solution X Chromosome Capture
• Covaris S series Sonication (150-200 bp)
• 76 bp single-end reads on one lane each of the
IlluminaGAIIx
Variant Calling
• Sequence alignment with bwa
• Remove duplicate reads with PICARD
• Realign indel regions with GATK
• Variant calling with Samtools, GATK
27. Identifying Candidate Genes
VAAST Identifies NAA10 as Candidate Gene
• About 20 min. run time
• 3 candidate genes (NAA10 ranked 2) proband only
• 1 candidate gene (NAA10) with pedigree
28. Additional Analyses
• Microarray based CNV analysis
• No likely causal variants found
• Sanger sequencing confirmation
• Variant segregates perfectly with disease in 13
family members
• Haplotype sharing (STR genotyping)
• ~11 MB shared between two affected boys
• A second family discovered – same mutation
• IBD relatedness analysis – independent mutational
events
29. N(alpha)-acetyltransferase
• N-alpha-acetylation is one of the most common protein
modifications that occurs during protein synthesis.
• NatA (catalytic subunit NAA10 (hARD1)
• Eight exons, Crick strand, highly conserved
• A:G transition causes p.Ser37Pro
30. Functional Analyses
• Quantitative in vitro N-terminal acetylation assay (RP-
HPLC).
• Four peptide substrates previously shown to be
acetylated by NatA (NAA10)
• Assays indicate loss-of-function allele.
33. VAAST in Summary
• Probabilistic Disease Gene Finder
• Feature Based not Variant Based
• Both Allele and AAS Frequencies
• Considers Inheritance Model
• As few as two target genomes can be sufficient to
identify causative gene.
• Background Genomes are “Reusable”
• Not Limited to Human Analyses
34. VAAST: Future Directions
• Indel support
• Splice-site
• No-call support
• Pedigree support
• Phylogenetic conservation
35.
36. Acknowledgements
VAAST Development Ogden
•Chad Huff Syndrome •Thomas Arnesen
•HaoHu •John Carey •Rune Evjenth
•Lynn Jorde •Steven Chin •Johan R. Lillehaug
•Barry Moore •Heidi Deborah Fain
•Martin Reese •Gholson Lyon •Leslie G. Biesecker
•Marc Singleton •John Optiz •Jennifer J.
•Jinchuan Xing •Theodore J. Pysher Johnston
•Mark Yandell •Alan Rope •Cathy A. Stevens
Yandell Lab •Reid Robison
•Sarah T. South •Brian Dalley
•Michael Campbell •Tao Jiang
•Daniel Ence •JeffereySwensen
•Chad Huff
•Guozhen Fan
•Evan Johnson
•Steven Flygare •HakonHakonarson
•Barry Moore
•HaoHu •Lynn B. Jorde
•Christa Schank
•Zev Kronenberg •Mark Yandell
•Kai Wang
•Barry Moore
•Jinchuan Xing
•Marc Singleton
•Robert Ross
•Mark Yandell
I’m going to begin the discussion of VAAST with a simple description of how the pipeline runs
Numerator = Null Model (No Difference)Denominator = Alternate Model (Difference)
The maximum likelihood of the null model over the maximum likelihood of the alternate model - weighted by the frequency of the AAS in the healthy dataset over the frequency of that AAS in a disease datasetn=frequency of that AAS in the background p=estimated probability of...B=T=Y=X=a=frequency of this AAS in OMIM
The maximum likelihood of the null model over the maximum likelihood of the alternate model - weighted by the frequency of the AAS in the healthy dataset over the frequency of that AAS in a disease datasetn=frequency of that AAS in the background p=estimated probability of...B=T=Y=X=a=frequency of this AAS in OMIM