Presentation by Benedict Paten at GRC/GIAB ASHG 2017 workshop "Getting the most from the reference assembly and reference materials" on updates to the human reference assembly, GRCh38.
👉 Chennai Sexy Aunty’s WhatsApp Number 👉📞 7427069034 👉📞 Just📲 Call Ruhi Colle...
Variation graphs and population assisted genome inference copy
1. Human Genome Variation Graphs
Benedict Paten - UC Santa Cruz Genomics Institute
benedict@soe.ucsc.edu
https://cgl.genomics.ucsc.edu/
Twitter: @BenedictPaten
2. Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
3. Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
4. Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
5. Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
6. Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the reference genome represents only a single
instance among billions of unique human genomes...
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
7. Triumph of the reference human genome
• The publication of the human reference genome unleashed
the field of large-scale human genomics
• It offers a coordinate system to:
• describe gene sequences
• display annotations
• interpret molecular assays
• However, the primary ref genome represents only a single
instance among billions of unique germline human genomes...
Figure UCSC Browser of gTEX data from: Vivian et al. Nature Biotechnology 35, 314–
316 (2017) doi:10.1038/nbt.3772
Supplementary Figure 2 – Browser
Window Position Human Feb. 2009 (GRCh37/hg19) chr17:37,783,223-37,826,720 (6,701 bp)
100 vertebrates Basewise Conservation by PhyloP
UCSC Genes (RefSeq, GenBank, CCDS, Rfam, tRNAs & Comparative Genomics)
Transcription Levels Assayed by RNA-seq on 9 Cell Lines from ENCODE
GTEx RNA-signal from male Brain - Caudate (basal ganglia) (GTEX-P44G-0011-R5A-SM-2I3FA)
GTEx RNA-seq read coverage from Brain - Caudate (basal ganglia)
GTEx RNA-seq read coverage from Brain - Cortex
GTEx RNA-signal from male Muscle - Skeletal (GTEX-11DXW-0726-SM-5H12J)
GTEx RNA-seq read coverage from Muscle - Skeletal
GTEx RNA-seq read coverage from Skin - Sun Exposed (Lower leg)
GTEx RNA-seq read coverage from Thyroid
PPP1R1B
STARD3
TCAP
PNMT
100 Vert. Cons
7.76614 _
-1.84367 _
Transcription
ln(x+1) 8 _
0 _
brainCauda M P44G
127 _
0 _
brainCauda M NPJ8
brainCauda M R55F
brainCauda M S7SE
brainCauda M T6MN
brainCauda M WL46
brainCauda M WVLH
brainCauda M WZTO
brainCauda M XOTO
brainCauda M Z93S
brainCauda M ZUA1
brainCorte M NPJ8
brainCorte M R55F
brainCorte M T6MN
brainCorte M XOTO
brainCorte M WL46
brainCorte M WVLH
brainCorte M WZTO
brainCorte M ZUA1
brainCorte M Z93S
muscleSkel M 11DXW
127 _
0 _
muscleSkel M NPJ8
muscleSkel M OOBK
muscleSkel M Q2AH
muscleSkel M Q2AI
muscleSkel M R55C
muscleSkel M U3ZM
muscleSkel M U4B1
muscleSkel M WFON
muscleSkel M WZTO
muscleSkel M X5EB
skinExpose M ZAB4
thyroid M ZAB5
Supplementary Figure 2 | 6700 bp exon-focused view of a 43 Kbp region of human chromosome 17 where GTEx
RNA-seq highlights tissue-specific expression of two genes. The TCAP (titin cap protein) is highly expressed in
muscle tissue, while PP1R1B (a therapeutic target for neurologic disorders) shows expression in brain basal ganglia
but not muscle (or brain cortex). In this UCSC Genome Browser view, 33 samples from 5 tissues were selected for
display, from the total 7304 (in 53 tissues) available on the GTEx public track hub. The hub is available on both hg19
(GRCh37) and hg38 (GRCh38) human genome assemblies. The hg19 tracks were generated using the UCSC liftOver
tool to transform coordinates from the hg38 bedGraph files generated by STAR2 in the Toil pipeline. The browser
display was configured to use the Multi-region exon view.
.CC-BY-NC 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/062497doi:bioRxiv preprint first posted online Jul. 7, 2016;
8. The problem with the reference
• Avg. 4-5 m point variations /
individual
• 80 m point variants w/>= 0.1%
freq.
• Avg. > 10 megabases (MB) in copy-
number variants (CNVs) / individual
• 350-400 MB in CNVs w/ >=
0.1% freq.
• Avg. > 6 MB in large indels /
individual
• > 100 MB in large indels w/>=
0.1% freq.
9. The problem with the reference
• Avg. 4-5 m point variations /
individual
• 80 m point variants w/>= 0.1%
freq.
• Avg. > 10 megabases (MB) in copy-
number variants (CNVs) / individual
• 350-400 MB in CNVs w/ >=
0.1% freq.
• Avg. > 6 MB in large indels /
individual
• > 100 MB in large indels w/>=
0.1% freq.
ANRV285-GG07-17 ARI 3 August 2006 8:58
Structural Variation of the
Human Genome
Andrew J. Sharp, Ze Cheng, and Evan E. Eichler
Department of Genome Sciences, University of Washington, Howard Hughes
Medical Institute, Seattle, Washington 98195; email: eee@gs.washington.edu
edfromwww.annualreviews.org
.Forpersonaluseonly.
Characterization of Missing Human Genome Sequences and
Copy-number Polymorphic Insertions
Jeffrey M. Kidd1, Nick Sampas2, Francesca Antonacci1, Tina Graves3, Robert Fulton3,
Hillary S. Hayden1, Can Alkan1, Maika Malig1, Mario Ventura4, Giuliana Giannuzzi4, Joelle
Kallicki3, Paige Anderson2, Anya Tsalenko2, N. Alice Yamada2, Peter Tsang2, Rajinder
Kaul1, Richard K. Wilson3, Laurakay Bruhn2, and Evan E. Eichler1,5,6
1Department of Genome Sciences, University of Washington School of Medicine, Seattle,
Washington 98195, USA
2Agilent Laboratories, Santa Clara, California 95051, USA
3Washington University Genome Sequencing Center, School of Medicine, St. Louis, Missouri
63108, USA
4Department of Genetics and Microbiology, University of Bari, Bari 70126, Italy
5Howard Hughes Medical Institute, Seattle, Washington 98195, USA
Abstract
NIH Public Access
Author Manuscript
Nat Methods. Author manuscript; available in PMC 2010 November 1.
Published in final edited form as:
Nat Methods. 2010 May ; 7(5): 365–371.
NIH-PAAuthorManuscriptNIH-PAAuthor
10. The problem with the reference
• These differences create a failure of
representation, for example:
• Some functional (transcribed) genes
are either present in disabled form or
absent from the current reference (e.g.
some HLA genes)
• Reference Allele Bias: Mapping
algorithms are intrinsically biased
towards ignoring evidence of variants
• The current reference is largely derived
from one individual, making it less
suitable for the study of genomes that
derive from other subpopulations
• In summary: the current reference genome
has become an impediment to personal
genomics
17. Human Genome Variation Graph Project
• Goals:
• Develop next generation human genetic reference that
includes known variation from all human ethnic
populations
• Provide tools to map, call, phase and represent genomes
Figure courtesy Kiran Garimella & Gil McVean
18. Existing Variation is Fragmented
Variants associated with phenotype
Genome- and locus-specific variation databases
Sequencing projects
Human reference genome
23. Variation Graphs – The Essentials
GTCCCAA
ACGTGG
ACTACCA
TTACTAC
Set of sequences (nodes)
Joins (edges) connect sides of sequences.
24. Variation Graphs – The Essentials
GTCCCAAACGTGG TTACTAC
Joins can connect either side of a sequence (bidirected edges)
Walks encode DNA strings, with side of entry determining strand
25. Essential operations on variation graphs
• To switch to
variation graphs a
complete
ecosystem must be
redeveloped
• “rebooting
genomics” - Erik
Garrison
“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.”
Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)
variation
graph
another
variation
graph
26. variation
graph
another
variation
graph
Essential operations on variation graphs
• To switch to
variation graphs a
complete
ecosystem must be
redeveloped
“Adapted from Computational Pan-Genomics: Status, Promises and Challenges.”
Computational Pan-Genomics Consortium. Briefings in Bioinformatics (2016)
https://github.com/vgteam/vg
33. Mapping improvements differ by population
1000 Genomes Super Population
MHC
%Diff.inperfectmap.
primaryvs.1KG
34. 1: 82 bp
2: A
3: G
4: 38 bp
5: C
6: T
7: 24 bp
1: 82 bp
2: A
3: G 4': 38 bp
5: C
6: T
7: 24 bp
4: 38 bp
Embedding Haplotypes
• Genome graphs do not encode linkage
• To restrict linkage, natural solution is to duplicate paths:
• But duplication creates mapping ambiguity
35. Embedding Haplotypes
1: 82 bp
2: A
3: G
4: 38 bp
5: C
6: T
7: 24 bp
1': 82 bp
2: A
3: G 4': 38 bp
5: C
6: T
7: 24 bp4: 38 bp1: 82 bp
7': 24 bp
• Instead maintain projection from haplotypes to graph:
• The question then becomes how to encode this projection?
36. Embedding Haplotypes
• The Graph Positional Burrows Wheeler Transform
(gPBWT)
From “Novak et al, A Graph Extension of the Positional Burrows-Wheeler Transform and its Applications (PBWT), WABI 2016”
3
counting of the number of threads in T that contain a given new thread as a
subthread. Figure 2 and Table 1 give a worked example.
1
2
3
2
1
3
1
1
2
2
B0
· · ·
· · ·
· · ·
· · ·
· · ·
· · ·
Fig. 1. An illustration of the B0[] array for a single side numbered 0. Threads visiting
this side may enter their next nodes on sides 1, 2, or 3. The B0[] array records, for each
visit of a thread to side 0, the side on which it enters its next node. This determines
through which of the available edges it should leave the current node. Because threads
tend to be similar to each other, they are likely to run in “ribbons” of multiple threads
.CC-BY 4.0 International licensepeer-reviewed) is the author/funder. It is made available under a
The copyright holder for this preprint (which was not.http://dx.doi.org/10.1101/051409doi:bioRxiv preprint first posted online May. 2, 2016;
gPBWTk[]
• Reversible, compressible, enables efficient indexed queries
37. gPBWT Performance
• Experiment:
• chr22
• 50,818,468 bp
• 5004 Haplotypes
• Result:
• 356 MB gPBWT + vg graph
• 0.011 bits per base -
200x compression
• ~336 GB for whole
genome w/80 million
point variants @ 100,000
diploid genomes
40. Haplotype Probabilities
• Li & Stephens: Efficiently compute P(h|H), where h is
haplotype and H is population
nd Stephens” on sequence graphs
Stephens: sequences h are generated by walks x across the space of all haplotyp
H
x
h
41. Haplotype Probabilities
• Graph Li & Stephens: Efficiently compute P(x|H), where x
is haplotype walk in a genome graph
nd Stephens: sequences h are generated by walks x across the space of all hap
model: sequences h are generated by walks x through G which follow segmen
otypes in H
h
x c/w h
g1
, g2
, g3
ε H
43. What’s a site and an allele in a genome graph?
What’s a site and an allele in a variation graph?
Bubble: Superbubble:
• Use subgraph decomposition to find single source/sink
subgraphs, set of paths are the alleles
A T
C
A
T C A T
C
A
T C A T
44. A haplotype phasing pipeline
Read
mapping
Variant
calling
Haplotype
phasing
Known population
information
Population Assisted Variant Calling
h
Haplotype
likelihood
Read
likelihood
genome posterior
probability
Haplotype
likelihood
Read
likelihood
A haplotype phasing pipeline
Read
mapping
Variant
calling
Haplotype
phasing
Known population
information
45. Genome Variation Graphs Summary
• A shared reference graph will provide a single canonical naming scheme
for human variants: either it is already a (named) path in the graph, or it is a
new canonically named augmentation
• A better prior: Clear benefits for simplifying and improving read
mapping and variant calling - could ultimately lower cost of genome
inference
• Additional haplotype data can be embedded (gPBWT)
• The natural reference is a population cohort - we should build a public
cohort for hundreds of thousands of individuals - let’s change the
culture of de-identified sharing
• True population assisted genome inference is coming
• Still many open problems: repeatome, annotations, RNA
46. Thanks!
UCSC
Adam Novak
Glenn Hickey
Sean Blum
Yohei Rosen
Jordan Eizenga
Wolfgang Beyer
Karen Hayden
David Haussler
Team VG:
Erik Garrison
Eric Dawson
Mike Lin
Jouni Siren
(and many more)
GA4GH ref-var group:
Andres Kahles
Ben Murray
Goran Rakocevic
Alex Dilthey
Sarah Guthrie
Jerome Kelleher
Heng Li
Stephen Keenan
Richard Durbin
Gil McVean
Opportunities: https://cgl.genomics.ucsc.edu/ benedict@soe.ucsc.edu