Introduction to Jackson Labs, JMCRS, Clinical Services and Scientific Services at the Jackson Labs. Differences between long and short read sequencing. FAIR Data Action Plan. Metadata needs. Data Commons and the need to capture sample specific gene models discovered.
5. 11/29/2018 BioDataWorld Congress - Basel
This presentation was prepared by Anne Deslattes Mays, PhD in her personal capacity.
The opinions expressed in this presentation are the author's own and not necessarily
the views and opinions of the Jackson Laboratory
Disclaimer
6. Introduction to the Jackson Laboratory
What is Next Generation Sequencing Data used for today?
How do we handle disruptions new measurement technologies bring?
What is Proper Data Stewardship for Data Science?
What does this mean for Data Commons ?
How does we capture the context and precision of measurements?
11/29/2018 BioDataWorld Congress - Basel
Talk Overview
1
2
3
4
5
6
7. 11/29/2018 BioDataWorld Congress - Basel
The Jackson Laboratory (https://www.jax.org/)
To discover precise genomic solutions for disease and empower the
global biomedical community in the shared quest to improve
human health.
10. JAX® Mice, Clinical and Research Services
11/29/2018 BioDataWorld Congress - Basel
> 10,000 mice strains supporting biomedical research
> 80% research publications citing mice strains use JAX® Mice
> 30,000 peer-reviewed publications cite use of JAX® Mice
> 22,000 genetically diverse background strains cryopreserved
> 2,500 strains successfully cryorecovered by JAX each year
> 75 new models CRISPR created on different genetic backgrounds
Every month hundreds publications reference JAX® Mice strains
1
2
3
4
5
6
7
12. Who DoWe Serve?
Clinicians
Pharma +
Academia
Biotech +
JAX PIs
- CLIA validated tests - CLIA validated tests
- Research Assays
- Research Assays
- Custom Assay
Development
13. Assays for Confirmation of variants
Types of variants
Confirmatory
Technology
Nucleic Acid
Research or
Orthogonal
technology
Variant(s)
Identified
DNA
ddPCR
($570/sample)
SNPs,
CNVs
Sanger
($400/sample)
SNPs,
InDels
RNA
RT-PCR
($342/sample)
Fusions
48-60 samples/run, TAT of ~6 days if primer/probes available in-house
14. Research Assays: PDX
A suite of assays for mutational and expression
analysis of PDX tissue, includes PDX filtering
16. Scientific Services at JAX
11/29/2018 BioDataWorld Congress - Basel
JAX-GM Cellular Engineering
Microbial Genomics
Single Cell Biology
Genome Technologies
Center for Biometric Analysis
PDX Research and Development
Microscopy Services
Flow Cytometry
Mass Spectrometry and Protein Chemistry
Monoclonal Antibody Services
1
2
3
4
5
6
7
8
9
10
17. Scientific Services at JAX
11/29/2018 BioDataWorld Congress - Basel
JAX-GM Cellular Engineering ✔️
Microbial Genomics
Single Cell Biology
Genome Technologies
Center for Biometric Analysis
PDX Research and Development
Microscopy Services
Flow Cytometry
Mass Spectrometry and Protein Chemistry
Monoclonal Antibody Services
1
2
3
4
5
6
7
8
9
10
✔️- Using NGSTechnologies
18. Gordon Bell Prize Super Computing 2018
11/29/2018 BioDataWorld Congress - Basel
19. Gordon Bell Prize Super Computing 2018
11/29/2018 BioDataWorld Congress - Basel
750,000 human genome types, associated with more
than a billion medical records over a 20-year period.
25. 11/29/2018 BioDataWorld Congress - Basel
Workman, Rachael E., et al. "Nanopore native RNA sequencing of a human
poly (A) transcriptome." bioRxiv (2018): 459529.
Human poly (A) transcriptome
26. 11/29/2018 BioDataWorld Congress - Basel
Workman, Rachael E., et al. "Nanopore native RNA sequencing of a human
poly (A) transcriptome." bioRxiv (2018): 459529.
Human poly (A) transcriptome
27. 11/29/2018 BioDataWorld Congress - Basel
https://blog.genohub.com/2017/06/16/pacbio-vs-
oxford-nanopore-sequencing/
PacBio vs Oxford Nanopore Sequencing
28. 11/29/2018 BioDataWorld Congress - Basel
PacBio Concensus Accuracy > 99%
raw PacBio reads also differ in error types (more indels than mismatches) and
have a much higher abundance (∼13–15%,Table 1), though they are spread
randomly across the reads (25,26).This randomness enables highly accurate
consensuses (>99%) to be build up rapidly by sequencing multiple times the
same molecule (CCS reads)
Simon Ardui,AdamAmeur, Joris RVermeesch, Matthew S Hestand; Single molecule real-time (SMRT) sequencing comes of
age: applications and utilities for medical diagnostics, NucleicAcids Research,Volume 46, Issue 5, 16 March 2018, Pages
2159–2168, https://doi.org/10.1093/nar/gky066
29. 11/29/2018 BioDataWorld Congress - Basel
All measurements taken on biological samples are made within the context of
instrument limitations, procedures followed in preparing samples for
measurement and the condition and the context of the samples being
measured.
Raw result data, quality data, metadata and procedures used to transform
measurement data from the instrument and/or the experimental procedures
are best captured at the time of experimental design to aid in primary and
secondary processing.
Biological Samples Details Need Metadata
Library Construction Details Need Metadata
Instrument Details Need Metadata
30. 11/29/2018 BioDataWorld Congress - Basel
How do we handle disruptions new measurement
technologies bring?
Long Reads Sequence unfragmented cDNA libraries
Short Reads are sequenced on fragmented cDNA libraries
Capturing the full length (5’ UTR to 3’ UTR) open reading frames at the
transcript level
Measuring theTranscriptome allows us to peer into the Proteome
Validation can occur with peptides
This Sample SpecificTranscriptome containsAlternatively SplicedTranscripts
Specific to the SampleCollected – altering the gene model for that sample
We need to capture the gene model in Data Commons
for future reuse
31. FAIR Data Action Plan (Preliminary Steps)
Interim recommendations and actions from the European Commission Expert
Group on FAIR data
11/29/2018 BioDataWorld Congress - Basel
32. FAIR Data Action Plan (Preliminary Steps)
Interim recommendations and actions from the European Commission Expert
Group on FAIR data
11/29/2018 BioDataWorld Congress - Basel
Define and apply FAIR appropriately
Develop and support a sustainable FAIR data ecosystem
Ensure FAIR data and certified services to support FAIR
1
2
3
34. BioDataWorld Congress - Basel11/29/2018
Genome
Technologies
Imaging
Services
Single Cell
Services
Grant
Award
Data
Analysis
Repeat
Google Cloud Platform
Docker
TCGA
JAX Pipelines API
Analysis
Program
URL RESULTS
ISB-CGC
/mnt/input
/mnt/output
- ISB-CGC
- JAX-pipelines
- Analysis Program
- Google Cloud
ATypical Researcher’s Path
Paper Writing &
Acceptance
TIER 1
TIER 3
TIER 2
SRA
GEO
35. BioDataWorld Congress - Basel11/29/2018
Genome
Technologies
Imaging
Services
Single Cell
Services
Grant
Award
Data
Analysis
Repeat
Google Cloud Platform
Docker
TCGA
JAX Pipelines API
Analysis
Program
URL RESULTS
ISB-CGC
/mnt/input
/mnt/output
- ISB-CGC
- JAX-pipelines
- Analysis Program
- Google Cloud
Where is the metadata and where is it captured?
Paper Writing &
Acceptance
TIER 1
TIER 3
TIER 2
SRA
GEO
BioProject:
What was the
question being
asked? Experimental
Design:
What tissue is being
measured?
How was the library
constructed?
At what time points
were the data
collected?
SRA:
BioSample:
Raw FASTQ files
stored - controlled
access data?
Matrices:
Junction Count by Sample
Instrument Details:
which version of the
instrument?
What chemistries
Sample Collection
Details – affects
quality – when and
where were the
samples collected
Library Construction
Details: fragmented or
unfragmented libraries?
38. Data management plans needed for data produced
We need metadata (data about our data) including instruments
We need to adhere to W3C standards, RDF, data catalogs, publish data
Ontologies should be used everywhere
More metadata need to be captured
Data need to be FAIR by man and machine
11/29/2018 Data Stewardship
Data Commons Data Management for Data Stewardship:
1
2
3
4
5
6
| 38
To discover precise genomic solutions for disease and empower the global biomedical community in the shared quest to improve human health.
Founded in 1929, The Jackson Laboratory (JAX) is an independent, nonprofit biomedical research institution with more than 2,200 employees who are passionate about our mission.
The Laboratory is a world leader in mammalian genetics and human genomics and is developing scientific breakthroughs and improved therapies with ever-greater precision and speed. We also educate current and future scientists and provide critical resources, data, tools, and services to researchers worldwide.
JAX has its mammalian genetics headquarters in Bar Harbor, Maine including a National Cancer Institute-designated Cancer Center; a genomic medicine facility in Farmington, Conn. enabling translation of fundamental research into the clinic; and facilities in Ellsworth, Maine and Sacramento, Calif.
Although both PacBio and Oxford Nanopore generate longer reads compared to short read Illumina or Ion sequencing, the higher error rate of both the PacBio and Oxford Nanopore sequencers remain an issue needs addressing. Whereas PacBio reads a molecule multiple times to generate high-quality consensus data, Oxford Nanopore can only sequence a molecule twice. As a result, PacBio generates data with lower error rates compared to Oxford Nanopore. PacBio has a slightly better overall performance for applications such as the discovery of transcriptome complexity and sensitive identification of isoforms. On the other hand, MinION provides higher throughput as nanopores can sequence multiple molecules simultaneously. Hence, it is best suited for applications that require a larger amount of data9
Although both PacBio and Oxford Nanopore generate longer reads compared to short read Illumina or Ion sequencing, the higher error rate of both the PacBio and Oxford Nanopore sequencers remain an issue needs addressing. Whereas PacBio reads a molecule multiple times to generate high-quality consensus data, Oxford Nanopore can only sequence a molecule twice. As a result, PacBio generates data with lower error rates compared to Oxford Nanopore. PacBio has a slightly better overall performance for applications such as the discovery of transcriptome complexity and sensitive identification of isoforms. On the other hand, MinION provides higher throughput as nanopores can sequence multiple molecules simultaneously. Hence, it is best suited for applications that require a larger amount of data9
The European Union FAIR data action plan published June 2018 outlines the core bits of information that should be collected on data to make data meaningful. These include the need for persistent and unique identifiers, open and documented formats for the transformation of that data, using data object identifiers (DOIs) or unique resource identifiers, to enable stable links to objects and support for citations and reuse. Authors should be identified with unique identifiers (such as ORCIDs), projects (RAIDs), funders and associated research resources (RRIDs). The action plan goes on to state that open and documented formats for standards and code should employed and that while minimum metadata and documentation is necessary to accompany these core data bits, enabling basic data discovery, richer information and provenance is necessary to understand why, when and by whom the data were created and accompanied with an appropriate data usage license
One researchers path
Get a grant – order data generating services, PDX, Single Cell, Other Genomic Technology services, do some data analysis, you might do some analysis in the cloud, the data are archived, likely with metadata embedded in the filestructure of the data – usually, project, sample, sequencing data, upon paper acceptance or ahead of that, your sequencing data, along with appropriate metadata may need to be uploaded to the Sequence Read Archive (SRA) or to the Geomnibus (GEO) which then loads it up to the SRA and then this process is repeated.
Repeating this process, different Researchers arrange their work in different ways, the data and the metadata may be embedded in the directory structure, or there maybe different MySQL databases around that contain each of the individual projects
It could be that the data is arranged in a way that all a researchers project information is available to the researcher, sharing it with others is laborious and time consuming
To be data driven, we need to access data across silos
One researchers path
Get a grant – order data generating services, PDX, Single Cell, Other Genomic Technology services, do some data analysis, you might do some analysis in the cloud, the data are archived, likely with metadata embedded in the filestructure of the data – usually, project, sample, sequencing data, upon paper acceptance or ahead of that, your sequencing data, along with appropriate metadata may need to be uploaded to the Sequence Read Archive (SRA) or to the Geomnibus (GEO) which then loads it up to the SRA and then this process is repeated.
Repeating this process, different Researchers arrange their work in different ways, the data and the metadata may be embedded in the directory structure, or there maybe different MySQL databases around that contain each of the individual projects
It could be that the data is arranged in a way that all a researchers project information is available to the researcher, sharing it with others is laborious and time consuming
To be data driven, we need to access data across silos
RDF
Ontologies
Data Catalogs
Unique Identifiers for protective Name Spaces
SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns.
We will build a Linked Data Layer - using tools where they make sense
SPARQL - SPARQL (pronounced "sparkle", a recursive acronym[2] for SPARQL Protocol and RDF Query Language) is an RDF query language, that is, a semantic query language for databases, able to retrieve and manipulate data stored in Resource Description Framework (RDF) format.[3][4]