FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
20181016 grc presentation-pa
1. Using long-read data to reveal
variation and advance the
human reference genome
Peter Audano
2. Motivation
• GRCh38 – 3.10 Gbp
• 3.26 Gbp with patches/ALTs
• Needs diversity: 70% RP11
• Structural variants (SVs) affect many bases
• Indels ≥ 50 bp and inversions
• 11 Mbp / genome (7x more than SNPs/indel)
• More likely to be an eQTL
• Illumina cannot capture all SVs
• 53% DEL, 22% INS (Chaisson 2018)
Alkan (2011)
3. Project goals
1. Sequence-resolve common structural variation
2. Correct errors and minor alleles in the reference
3. Build an alternate reference to support SV analysis
with Illumina data
4. Constructing a diversity panel GRC Sequenced
• CHM1 (Mole)
• CHM13 (Mole)
• HG00514 (CHB)
• HG00733 (PUR)
• NA19240 (YRI)
• HG02818 (GWD)
• NA19434 (LWK)
• HG01352 (CLM)
• HG02059 (KHV)
• NA12878 (CEU)
• HG04217 (ITU)
• HG02106 (PEL)
• HG00268 (FIN)
Public
• AK1 (Korean)
• Seo, 2016
• HX1 (Chinese)
• Shi 2016
• New PacBio data on 11 genomes.
• 7 are new long-read biological samples.
• Selected females to balance X
5. Building a non-redundant discovery set
• 99,604 SVs
• 21.3 Mbp INS
• 18.5 Mbp DEL
• 2,238 shared
• 1.2 Mbp INS
• 0.1 Mbp DEL
• 5 coding
• 160 regulatory
11. VNTRs distribution is non-random
• VNTR enrichment in subtelomeres
• 4.8-fold enrichment (Wilcoxon p = 2.9 × 10-9)
• Correlates with male meiotic recombination and
double-strand breaks
Credit: Arvis Sulovari
12. Patching GRCh38
• Add SVs to reference on alternate contigs
• Map reads with an ALT-aware aligner
• Recovers 2.62% unmapped reads
• Improves mapping quality for 25.68% of SV-insertion mapped reads
• 2,228 SNPs and indels per sample within SV insertions (GQ 20+)
13. Genotyping SVs in Illumina samples
GRCh38
primary contig
SV contig
• Extract features around SVs
• Train a machine learning
model to predict genotypes
• 91-95% accuracy
• 15% no-call
15. Genotyping enables eQTL and sQTL analysis
• 376 samples (avg. 6-fold)
• 379 SV eQTLs (411 genes)
• 34 significant after accounting for
SNP eQTLs
• 244 SV sQTLs (197 genes)
Credit: Yang Li and Ankeeta Shah
16. Resources available soon
• Variant calls
• VCF of SVs linked to contig breakpoints
• BAM of contigs
• SMRT-SV v2 genotyper
• Contigs (PRJN481779)
• Variants in dbVar (ntsd163)
• Patched reference
• ALTs for BWA-MEM
• Graph for vg (Garrison 2018)
17. Future work
• Sequence 50 additional genomes
• Phase genomes
• 10X and Strand-Seq
• Phased-SV (Chaisson 2017, bioRxiv)
• Genotype additional genomes
• 2,500 high-coverage 1000 Genomes samples
• 10,000 autism genomes
• Improve the human reference
• Patch GRCh38
• Build a human pan genome reference
18. Acknowledgments
• Evan Eichler
• Tonia Brown
• Arvis Sulovari
• David Gordon
• Benson Hsieh
• Zev Kronenberg
• Tina Graves-Lindsay
• Susan Dutcher
• Wesley Warren
• Vince Magrini
• Sean McGrath
• Richard Wilson
• Yang Li
• Ankeeta Shah
25. An augmented reference reveals hidden variation
Type Count AC SVs
DEL 3,582 8,835 206
INS 2,407 11,008 35
SNV 15,980 48,813 0
All 21,969 68,656 241