There are a number of genetics and genomics initiatives underway in Australia, including the Australian node of the Human Variome Project (HVPA), as well as many active research collaborations including familial cancer, endocrine disease, and developmental delay. Most of these projects work with disease-specific databases on a research basis, with the risk that such archives may be ephemeral. HVPA is the only database that is directly integrated with accredited clinical reporting of variants. As such it is designed to capture variants that have passed scrutiny as diagnostically robust, and have therefore already been curated by qualified staff. Registered users access the HVPA database via a secure Internet portal.
I will describe three recent developments of the HVPA database and portal: the upgraded search interface, linkage to other datasets via BioGrid using hash-based de-identified case matching, and the introduction of a genome wide database using LOVD3. Finally I will discuss the future direction of the HVPA and the questions of utility, quality control and sustainability of genetic variation databases.
Search interface
The search interface has to provide useful tools for clinicians and lab scientists so that the HPVA project offers them direct benefits and incentivises them to participate. Following a request for feedback from users, a series of improvements were implemented, initially on a demonstration server and then on the live server following review by the Steering Committee. The highest priorities were for more information about numbers of times particular variants were
recorded, the ability to search by range and to filter by pathogenicity. There was also interest in enabling direct uploading of VCF files and the automated calculation of pathogenicity scores. Many of these features are now implemented and examples will be presented.
Linkage to other datasets
We have implemented the hash key algorithm and work is in progress with BioGrid to link variation data to clinical data sets.
Genome wide database
We have established an HVPA LOVD3 database and are working with the Human Genetics Society of Australasia on a pilot study to sequence the exomes of two trios and review the data using this database.
The Human Variome Database in Australia in 2014 - Graham Taylor
1.
2. Acknowledgments
Genomic Medicine & Translational Pathology, University of Melbourne:
Arthur Lian Chi Hsu, Renate Marquis-Nicholson, Sebastian Lunke, Clare Love, Kym Pham,
Olga Kondrashova, Matt Wakefield, Tiffany Cowie, Barney Rudzki and Paul Waring
Human Variome Project
Tim Smith, Alan Lo, Melvyn Leong, David Perkins, Heather Howard, Rania Horaitis
Dick Cotton
BioGrid
Maureen Turner, Leon Heffer
Royal College of Pathologists of Australasia
Vanessa Tyrrell
Peter MaCallum Cancer Centre
Ken Doig, Andrew Fellowes
Victorian Clinical Genetics Service
John-Paul Plazzer, Desiree Du Sart
3. Human Variome Project (Australasia)
• The bigger picture
• Infrastructure and search interface
• Linkage to other datasets
• Panel, exome and genome testing
• Database accreditation
• Next steps
4. The big picture
• Rediscovery at the genomics community level
that data sharing is win-win
• The Genomic Alliance, HGVS, HUGO
– Data standards
– Nomenclature
– Infrastructure
5. Nature (Perspective) 508 469-475 2014
Guidelines for investigating causality of sequence variants in human disease
D. G. MacArthur, T. A. Manolio, D. P. Dimmock, H. L. Rehm, J. Shendure, G. R. Abecasis, D. R. Adams, R. B. Altman, S. E. Antonarakis, E.
A. Ashley, J. C. Barrett, L. G. Biesecker, D. F. Conrad, G. M. Cooper, N. J. Cox, M. J. Daly, M. B. Gerstein, D. B. Goldstein, J. N. Hirschhorn,
S. M. Leal, L. A. Pennacchio, J. A. Stamatoyannopoulos, S. R. Sunyaev, D. Valle, B. F. Voight, W. Winckler & C. Gunter.
Priorities for research and infrastructure development
1. Improved public databases of human genetic variants incorporating explicit, up-to-date supporting
evidence for variant implication in disease and audit trails recording changes in interpretation.
2. Improved incentives, and ethical and logistical solutions, for sharing of genetic and phenotypic data from
both research and clinical diagnostic laboratories.
3. Public databases of variant and allele frequency data from large sets of population reference samples
from a wide range of ancestries.
4. Large-scale genotyping of reported human disease-causing variants in large, well-phenotyped
population cohorts, reducing biases in the assessment of the associated penetrance and phenotypic
heterogeneity.
5. Development and benchmarking of standardized, quantitative statistical approaches for objectively
assigning probability of causation to new candidate disease genes and variants.
Déjà vu all over again?
6. Nature Genetics 46, 107–115 (2014)
Application of a 5-tiered scheme for standardized classification of 2,360 unique mismatch
repair gene variants in the InSiGHT locus-specific database
Bryony A Thompson, Amanda B Spurdle, John-Paul Plazzer, Marc S Greenblatt, Kiwamu Akagi, Fahd Al-Mulla, Bharati Bapat, Inge
Bernstein, Gabriel Capellá, Johan T den Dunnen, Desiree du Sart, Aurelie Fabre, Michael P Farrell, Susan M Farrington, Ian M
Frayling, Thierry Frebourg, David E Goldgar, Christopher D Heinen, Elke Holinski-Feder, Maija Kohonen-Corish, Kristina Lagerstedt
Robinson, Suet Yi Leung, Alexandra Martins, Pal Moller, Monika Morak, Minna Nystrom, Paivi Peltomaki, Marta Pineda, Ming Qi,
Rajkumar Ramesar, Lene Juel Rasmussen, Brigitte Royer-Pokora, Rodney J Scott, Rolf Sijmons, Sean V Tavtigian, Carli M Tops,
Thomas Weber, Juul Wijnen, Michael O Woods, Finlay Macrae & Maurizio Genuardi, on behalf of InSiGHT.
Nature Genetics 46, 107–115 (2014)
1. Leiden Open Variation Database (LOVD)
2. Micro- attribution using Open Researcher & Contributor Identification (ORCID)
3. Variant Interpretation Committee (VIC) apply a 5-tiered scheme developed by the
International Agency for Research on Cancer (IARC) classification system
4. Endorsed by the Human Variome Project (HVP)
7. Not everything in the Nature portfolio is gold
It is good to supplement your pocket money
9. Translation into diagnostic practice
• 15 years ago Cotton predicted that the
majority of human genetic variants will be
detected in a diagnostic context
• As NGS moves into a service setting this
transition will become even clearer
• Genetic variants will become part of a
patient’s medical record
10. HVPA database
• Primarily for and of diagnostics
• Diagnostic services are busy
• And cash and time limited
• We have to make it easy for them
• And secure
• And useful
• Maybe even essential
11. HVPA Objective
A national data sharing facility for improving
clinical genetic testing services and supporting
medical research
Constitutional, not somatic, mutations
NECTAR project grant UoM FE31082
“Clinical and Molecular Data Linkage Tools”,
completion date 30th June 2014
12. Infrastructure and search interface
• Data repository (“the database”)
• Data handling tools that support data upload
from laboratories
• Portal though which the database can be
browsed
• Website for news and notifications
13. Human Variome Project Australian
Node
What We’ve Done
• NeAT Funding (2010-2011)
– Pilot Phase
– 4 labs, 3 diseases
• Breast Cancer
• Colon Cancer
• Huntington’s
– Portal Launched April 2011
– Molecular Data Only
– Collaboration with Mawson
• NeCTAR Funding (2012-2014)
– 12 more labs + all genes they test
for
– Configuration Tool
– Clinical Data/Phenotype Linkage
– Transfer data internationally
What We Built
• Collection Tool
• Portal
• Data Model
• Ethics Processes
• Access & Usage Policy
• Data Sharing Agreements
14. How it works
• Software to interface with existing LIMS (or lack thereof)
• Collection occurs after report has been issued
• Data types:
– All classified variants reported by a lab
– Benign variants
– NGS/Incidental findings
– Not collecting negative results
• Secure data link between lab and Node
• (Semi)-automatic transfer of data
• Portal to allow interrogation of all Australian data
– http://www.hvpaustralia.org.au
• Linkage key generator
• Submission to BioGrid Platform
15. Open-Source Solutions
• HVP Portal (v1.0, r512) - A web application which features the basic
interface for browsing and querying a HVP node.
– Open source – MIT License
– Python/django
• HVP Exporter (v1.0, r512) - Basic HVP exporting tool for
laboratories. Features simple GUI and error checking interface,
plug-in architecture for customisation between sites and common
libraries for working with MS Access and MS Excel data sources
– Open source – MIT License
– .NET C#, python/ironpython
• HVP Importer (v1.0, r512) - A series of tools and web services that
receive, decrypt and process information by submitting laboratories
using the standard transaction XML format
– Open source – MIT License
– python
17. HVPA Status at November 2013
Strengths
1. Database available on demand
for diagnostic labs
2. Tools for data sharing
3. Community engagement with
RCPA (QUUP), SA/Mawson,
BioGrid, VCGS
4. National reach with
international connections via
HVPI, WHO & UNESCO
Weaknesses
1. Performance of the existing
HVPA database is limited
2. Laboratory buy-in to the
database across Australia is
limited
3. The database itself has been
hard to access because of low
server bandwidth
4. The project has not anticipated
the likely impact of next
generation sequencing and risks
missing inclusion in genomic-
scale initiatives now underway.
19. Developments proposed in November
ID Area Idea Priority
1 B. Presentation Statistics of number of variants for that gene as table or bar graph (# unique, # instances, top 5
qty submitted)
1
15 D. Feedback Raise a concern about an instance's interpretation 1
2 A. Search Search by range 2
3 A. Search Search by genomic position 2
4 A. Search Filter by pathogenicity 2
5 B. Presentation Sort by ... (pathogenicity, other fields) 2
6 C. Relevant Info Display links to related database for gene by referencing genenames.org 2
7 A. Search Wildcard search of variants 2
9 A. Search Search by disease which shows multiple genes and variant results 2
10 E. NGS VCF data imports into HVP Australia 2
13 B. Presentation VarVis - visualisation of gene and variants reported 2
11 B. Presentation VCF data export from HVP Australia of a set of results 3
12 B. Presentation At instance level - see other variants from this test/patient 3
14 C. Relevant Info Capture & display SIFT score 3
16 D. Feedback Notify labs the general concensus of pathogencity of something they submitted has
changed/updated. i.e They submitted benign and its now likely pathogenic or submited
unknown and know its something else
3
17 B. Presentation Integration with EBI/NCBI tools for queries and displays 3
19 B. Presentation Display last date uploaded for this variant (or last 10 dates) 3
20. Accessing the test database
http://115.146.85.61/
Username:
lab_tester
Password:
hvpaustralia2013
21. Search Interface
• The search interface has to provide useful tools for
clinicians and lab scientists so that the HPVA project offers
them direct benefits and incentivises them to participate.
Following a request for feedback from users, a series of
improvements were implemented, initially on a
demonstration server and then on the live server following
review by the Steering Committee. The highest priorities
were for more information about numbers of times
particular variants were recorded, the ability to search by
range and to filter by pathogenicity. There was also interest
in enabling direct uploading of VCF files and the automated
calculation of pathogenicity scores. Many of these features
are now implemented and examples will be presented.
22. Purpose of the HVPA Database
• Working database
– Record and share diagnostic quality data genetic variation
data
– Integrate with clinical phenotype data
– Integrate with international efforts
– Heads up for NGS gene panel data sets
• Test database
– Showcase enhancements
– Real world testing and feedback
– Uses data edited from actual database
– Not accurate or reliable: some parameters edited for test
purposes
27. Direct Import from Results Lists
• Can recover historical data sets
• Reformat on the fly
• Useful as low-overhead catch up to enable labs to
transition to using uplaoding toals as their IT
permits
– PathWest (John Bielby)
– Institute of Health and Biomedical Innovation,
Queensland (Lyn Griffiths)
– Kconfab (Heather Thorne)
– Peter MaCallum Cancer Centre (Ken Doig)
28. Variant Fields Mandatory
GeneName RefSeqName RefSeqVer cDNA mRNA Genomic Protein Location
Official HGNC
Symbol
Name of
reference
sequence (NCBI's
RefSeq project)
Version of
reference
sequence
(RefSeq)
HGVS variant
name (c.)
HGVS variant
name (m.)
HGVS variant
name (g.)
HGVS variant
name (g.)
Exon or intron
number
VARCHAR(20) VARCHAR(20) VARCHAR(20) VARCHAR(255) VARCHAR(255) VARCHAR(255) VARCHAR(255) VARCHAR(255)
Mandatory Mandatory Mandatory At least one required
Pathogenicity PatientID TestID InstanceDate GenomicRefSeq GenomicRefSeqVer
Level of pathogenicity
(1=Pathogenic, 2=Possibly
Pathogenic, 3=Unknown,
4=Possible benign,
5=Certainly Benign)
Internal ID for
the patient
used within
the lab
Internal ID
for the test
used within
the lab
Date instance
was tested
Genomic
reference
sequence
Genomic reference
sequence version
VARCHAR(20) DateTime VARCHAR(255) VARCHAR(255)
Mandatory Mandatory Mandatory Mandatory Mandatory Mandatory
29. Variant Fields (Optional)
PatientAge TestMethod SampleTissue SampleSource Justification
Age of patient
when test was
taken
The name of the
test method used
Type of sample
taken
The source of the
sample e.g.: DNA,
g.DNA, RNA...
Justification by medical
scientist
INT32 VARCHAR(20) VARCHAR(20) VARCHAR(20) VARCHAR(65535)
Optional Optional Optional Optional Optional
PubMed RecordedInDatabase SampleStored
VariantSegregatesWi
thDisease HistologyStored
PedigreeA
vailable SIFTScore
PubMed
Identifier/Data
Object Identifier
Whether it is
recorded in disease
specific or gene
specific
Whether lab still
has sample left
Whether pedigreee
was consideed during
diagnosis of
pathogenicity
Whether
histograms are
stored
Whether
organisati
on has
pedigree
data
Calculated
SIFT Score
VARCHAR(255) Boolean Boolean Boolean Boolean Boolean INT32
Optional Optional Optional Optional Optional Optional Optional
30. Linkage to other datasets
• HVPA have implemented the hash key
algorithm and work is in progress with BioGrid
to link variation data to clinical data sets.
• More details from Maureen Turner, BioGrid
CEO who is speaking at this meeting
31. Cost and performance will force
diagnostic labs to adopt NGS as front-line approach
cost per base Illumina share price
Hype cycle
32. HVPA LOVD3 database pilot
• Established an HVPA LOVD3 database and
working with the Human Genetics Society of
Australasia on a pilot study to sequence the
exomes of two trios and review the data using
this database.
• Includes exome-scale data
• Open access to Coriell cases with no “consent”
issues
• Explore staging of variant “credibility
classification” and access
33. Relationship to Gene Panel Databases?
e.g. http://genomics.bio21.unimelb.edu.au/lovd/
35. • Clinically led, rather than technology driven
• Fostering ‘end use’ of genomic data
• Common clinical repository
• Prospective : first tier test
• Evaluation to inform implementation
• Engineering collaboration
• Fostering system change
• A/Prof Clara Gaff: Program Leader
PARADIGM FOR IMPLEMENTING GENOMIC MEDICINE
35
Melbourne Genomics Health Alliance
37. How many variants per exome?
SNP count Study
20,000 Choi et al. PNAS 2009
142,000 Mullikin NIH, unpublished 2010
50,000 Clark et al. Nature biotechnology 2011
125,000 Smith et al. Genome Biology 2011
100,000 Johnston & Biesecker Human Molecular Genetics 2013
200,000 to 400,000 Yang et al.N Engl J Med 2013
• 20-fold range
• Exome designs vary
• Likely to be higher variant count in African populations as the
reference sequence is non-African
38. Low concordance of multiple
variant-calling pipelines
Rawe et al Genomic Medicine 2013
• 15 exomes
• 4 families
• HiSeq 2000
• Agilent SureSelect v.2
• ~120X mean coverage
• SOAP, BWA-GATK, BWA-SNVer,
GNUMAP, and BWA- SAMTools
• SNV concordance between five Illumina
pipelines across all 15 exomes was 57.4%
• 0.5-5.1% variants were called as unique to
each pipeline
• Indel concordance was only 26.8% between
three indel calling pipelines
• 11% of CG variants that fall within targeted
regions in exome sequencing were not called
by any of the Illumina-based exome analysis
pipelines
• 97.1%, 60.2% and 99.1% of the GATK-only,
SOAP-only and shared SNVs can be validated
• 54.0%, 44.6% and 78.1% of the GATK-only,
SOAP-only and shared indels can be validated
• Additional accuracy gained in variant
discovery by having access to genetic data
from a multi- generational family
39. Low concordance of multiple variant-calling pipelines
O’Rawe et al. Genome Medicine 2013, 5:28
SNV concordance: 57.4% Indel concordance 26.8%
40. Venn diagrams of selected CNV detection
methods in real data processing
Duan J, Zhang J-G, Deng H-W, Wang Y-P (2013) Comparative Studies of Copy Number Variation Detection Methods for Next-Generation Sequencing
Technologies. PLoS ONE 8(3): e59128. doi:10.1371/journal.pone.0059128
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0059128
45. • Known SNV concordance 100%, all assays
• Known indel <6bp concordance 100%, all assays
• Not able to detect c9orf72 hexanucleotide expansion or PRNP
octapeptide region repeat with standard pipeline
• Diagnostic yield within appropriate clinical context (based on
very limited sample size)
- NimbleGen SeqCap EZ Neuro: 33% (2/6)
- Nextera Neuro: 23% (6/26)
Results – detection of variants
46. Filtering Variants
All variants None Qual Not in Blood
Blood 9828 8551 NA
Frozen 9920 8736 126
FFPE 9709 8163 199
Variants in Gene List None Qual Not in Blood
Blood 27 18 NA
Frozen 27 23 2 (EGFR)
FFPE 25 19 3 (EGFR, ROS)
56. Can capture coverage report dosage to
diagnostic standards?samples
targets
samples
autosomaltargetschrXtargets
Inter-sample
variation is low,
But low coverage
prevents dosage
estimation
Chr X is a good first pass test for dosage
57. XX vs. XY
8 Female cases and 16 Male cases showing reproducibility of coverage of X loci
within each group. Loci with higher SDs were associated with reduced coverage.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80
Average XX
Average XY
-0.5
0
0.5
1
1.5
2
2.5
3
0 10 20 30 40 50 60 70 80
AVGE XX
AVGE XY
870
160
59. Sharing Experience with TruSight One
• In partnership with Illumina, RCPA and the HGSA
Kim Flintoff (Wellington Regional Genetics
Laboratory) is leading an evaluation of exon
sequencing using Illumina’s True Sight One
panel. Two Coriell family trios will be sequenced
by New Zealand Genomics Limited and the data
will be shared on a HVPA database
• The VCF file will be available on the HVPA LOVD
database and performance stats will also be
made available.
60. Next Steps
• Robust standards for genomic medicine
• Databases and data content
– Access to identified and de-identified data (consent
and confidentiality)
– Database accreditation process in prep with RCPA
– Defining the performance of various aligners, variant
callers and annotation programs
– Clinical grade Variant Call Format (VCF)
– Metafile covering data trail: what was tested, what
was not tested
61. Standards for Accreditation of DNA
Sequence Variation Databases
Quality Use of Pathology Program (QUPP), a national project for the Development of Standards for
Accreditation of DNA Sequence Variation Data Bases has been jointly initiated by the Royal College of
Pathologists of Australasia (RCPA), and the Human Variome Project (HVP).
Background
• There is a rapidly increasing volume, spectrum, and complexity of genetic tests emerging within
diagnostic pathology laboratories. In particular, high throughput sequencing methods such as
targeted panel, exome (WES), and whole genome sequencing (WGS), are producing an increasing
quantity of genetic data requiring analysis and interpretation, forming a substantial proportion of
the workload.
• Currently, there is a plethora of online mutation databases to refer to, however there is a distinct
lack of such databases that meet the stringent accuracy and reproducibility that the clinical
diagnostic environment demands. Additionally, The current databases are “Fractured”, with varied
access and sharing of the data within; and variable quality due to errors / inaccurate data posting,
all of which is a clear risk to the quality of patient care. With more widespread, secure sharing of
variants and associated phenotypes, the value of cumulative variant information will accelerate the
delivery of accurate, actionable, and efficient clinical reports.
• There are currently no standards or equivalent mechanisms for accreditation of databases to
ensure the accuracy and quality of uploaded data into any central repository to meet the needs of
the clinical diagnostics environment.
62. Data quality classes
Differentiate between three classes of data:
The Clinically Reported data label would denote the class of data that the HVP
Australian Node was originally designed to collect and share: data that has been
generated in a NATA accredited Australian diagnostic laboratory and is able to be
included in a clinical report.
Unreported Clinical quality data would denote data that has been generated in a
NATA accredited diagnostic laboratory, but is not capable of being included in a
clinical report. This class would comprise, primarily, of next-generation
sequencing (NGS) type data.
Unaccredited data would be used to denote data that has been generated by an
Australian laboratory that has not been NATA accredited
A new filtering option would be made available to allow users to view only data
of a certain class
63. Beyond the NeCTAR funding
• Academic or charitable funding required
• Integrate NGS data resource into the HVPA
portfolio
• Move database development into a medical
academic centre of excellence
• Seek active partnerships with current and
future collaborators with investment and risk
sharing