Describes how the Joint Genome Institute (JGI) is addressing the challenges it faces in storing and managing the rapidly growing volume of -omics data. Presented at the GlobusWorld 2021 conference by Kjiersten Fagnan.
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute
1. Managing the genomics data deluge
at the DOE Joint Genome Institute
Kjiersten Fagnan
CIO, JGI
2. The DOE Joint Genome Institute at a glance
JGI MISSION:
To provide the global research community
with free access to the most advanced
integrative genome science capabilities in
support of the DOE energy &
environmental research mission
Integrative Genomics Building
(IGB)
U.S. Department of Energy Office of Science User Facility
● JGI established in 1997, User facility from 2004
● Located at Lawrence Berkeley National Laboratory
● ~285 staff; ~$80M annual funding
● 2,038 Global Primary Users in FY20; >10,000 Data
Users
5. Environmental genomics will enable the Bioeconomy
Genetic “Circuit”
Gene Enzyme Microbial Factory
DNA
2 NH4
2+
CO3
2-
6. FY 2020 Users: 2,038 Worldwide
6
Users on the Map: 2,038
Academic 1,504 74%
Government 183 9%
DOE (national labs only) 161 8%
Industry 29 1%
Other 161 8%
9. DOE Office of Science Public Reusable
Research Data (PuRe Data)
https://science.osti.gov/Initiatives/PuRe-
Data/Resources-at-a-Glance
10. Deluge of Large, Complex Data Sets
10
JGI manages a 10+ PB data repository
11. Mega – Giga – Tera – Peta – Exa – Zetta – Yotta
5/19/2021 https://www.theatlantic.com/technology/archive/2011/05/infographic-how-big-is-a-yottabyte/239034/ 11
The cost to store 1 Yottabyte of data - $100 trillion*
This is just genomics data… we also want
metabolomes, transcriptomes, proteomes, image
data
12. The Immense Scale of Omics Data
5/19/2021 12
Advances in sequencing and omics technologies have far outpaced data infrastructure
How do we remove the barriers to
data access and analysis at scale?
13. Data Management is Critical
5/19/2021 13
PMO
S
DM
Q
AQ
C
/ RQ
C
G
AAG Plant MEP RnD Fungal
G
enome
Portal
IMG MG
M
External
C
ollaborators
Web S
ervices ( Mycocosm,
Phytozome, IMGM/ ER)
In 2013, JGI deployed a hierarchical data management
system to deal with the exponetial growth in sequence
data and analysis products
14. JGI Archive and Metadata Organizer (JAMO)
5/19/2021 14
G
AAG Plant MEP RnD Fungal IMG MG
M
S
DM
Q
AQ
C
/ RQ
C
Web S
ervices ( Mycocosm,
Phytozome, IMGM/ ER)
G
enome
Portal
External
C
ollaborators
PMO
16. JAMO Enabled Increased Automation Between Groups
• JGI’s core pipelines connect with JAMO and provide metadata through
templates
• Once data is available for processing, the workflows are triggered
automatically
• Data that fails QC is flagged for review
5/19/2021 16
17. JAMO is the Backbone of JGI’s Data Portal
5/19/2021 17
All the metadata used to populate the Data Portal
comes from JAMO’s Mongo DB
18. Code for America Summit Talk on JGI’s New Data Portal
Aligning Data Across Siloed Departments
Many government sectors have been collecting data digitally for decades often
in uncoordinated ways. In this talk we’ll explore how Truss and Joint Genome
Institute partnered to break down data silos and start conversations across
departments to align metadata across the organization. From establishing
baseline agreements, to finding common outcomes everyone could agree upon,
to bringing old data sets into the present, this talk will provide useful tools for
practitioners facing challenges of data misalignment across multiple
departments.
It's Thursday later in the day 2:00-3:00 pm PST
https://summit.codeforamerica.org/agenda/
5/19/2021 18
19. Improving Search Across JGI
5/19/2021 19
Metadata in one place makes search across all JGI programs possible
JGI-KBase
RESTful
Service
JGI Data and Metadata
system including LIMS,
GOLD, sequence,
assemblies, annotations
Metadata and file types
User Query
Response
Data sets
21. Berkeley Lab is on a Major Fault Line
5/19/2021 21
NERSC is
here!
Most samples used to generate data at JGI
are unique and irreplaceable
22. Backing up Irreplaceable Data
• Moved 1 PB of data to ORNL for safe-keeping
• Data migration completed in 5 days using Globus
• Enables access to the data – but only useful with the right metadata
5/19/2021 22
Main JGI
Data
Repository
API
HPSS
Archive
JAMO light
DTN
DTN
SUMMIT
API
23. What can you do with all that data and a supercomputer?
A Gordon Bell Prize (Supercomputing) winner in 2018 used all the well-
characterized publicly available data to look at genetic underpinnings of
opioid addiction.
Wayne Joubert, et al. 2018. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures
for chronic pain and opioid addiction. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC ’18). IEEE Press, Article 57, 1–14.
Access to large amounts of ‘omics data
enables scientists to explore a broad range of
hypotheses!
24. CA has Earthquakes and Fires!
5/19/2021 24
We need to distribute Data and Analysis to
maintain scientific productivity
25. JGI’s Centralized Workflow System
● JGI Analysis Workflow Service (JAWS)
● Need to be able to compute at multiple centers: NERSC, LBL IT, others
● Need to have more readily reusable and modifiable bioinformatics
pipelines
● Need workflows to support FAIR* guidelines
● Objective: Portable, Reusable, Traceable workflows on a Robust platform
*Findable, Accessible, Interoperable, Reusable
25
26. Distributed Computing is Hard
• Managing multiple user accounts
• Different facilities have different policies
– Batch schedulers
– File system availability and data retention
• Different architectures
– CPU vs GPU
– Local disk vs parallel file systems
– Memory size and footprint
• Portability is a lot of work
5/19/2021 26
27. JGI is Running Analyses Across the West Coast
JGI Centralized
Workflow
System
Cromwell
Workflow
Manager
Additional
resources
(cloud, ORNL,
ANL, etc)
Common interface to
access resources
initial
testing
future
Workflow Description Language
28. JGI is Running Analyses Across the West Coast
JGI Centralized
Workflow
System
Workflow Description Language
1. Find the data for
analysis in the data
management system
2. Authenticate with
Globus and transfer
the data to the remote
computing resource
3. Work is
executed, results
are generated
4. Transfer data back
to the home
repository with
Globus
5. Register the data
and metadata with
JAMO
Application tokens are accepted by the
facilities we are using making it possible to
transfer data on behalf of the user
29. Data Movement Between Resources – Globus!
• JGI has been using Globus since ~2012 to move data around
–One time we broke the service by trying to move millions of tiny files that
were all in the same directory :D
• Globus enables JGI collaborators to download large amounts of data
–Biggest customers are the Bioenergy Research Centers – DOE funded
facilities investigating biofuels
–Some JGI Users are still willing to wait 9+ days for a
download to complete via the browser – education opportunity!
• Globus is an integral part of JAWS
–Enables the application to move data between computing
resources on behalf of the user
5/19/2021 29
30. Summary
• JGI is a DOE User Facility that produces a lot of complex, unique data
for the scientific community
• As instruments improve, the data is higher quality – *metadata can still
be problematic
• We’d be lost without a good data management system
• JGI is turning to distributed computing for processing and large-scale
analyses
• Data movement made much easier and faster with Globus
5/19/2021 30
31. Upcoming Virtual Annual Meeting/Resource Calls
● Aug 30 – Sept 1: 3 x 6-hour days, 2 sessions/day
– Exploring the Universe of Specialized Metabolites
– From Microbial Sequence to Environmental Function
– The Many Facets of Plant-Microbial Interactions
– Machine Learning and Artificial Intelligence for Biology
– Integrative Omics-Inspired Plant and Microbe Engineering
– Technology Innovations
● Community Science Program (CSP) Functional Genomics
proposal deadline: July 31
– Genes/Pathway synthesis
– Strain engineering
– Data mining
– Metabolomics
– RNA-seq
● Call New Investigator Call proposal deadline: Sept 15
– Bacterial and archaeal isolates and single cell draft genomes
– Metagenomes/metatranscriptomes
– DNA synthesis- and Metabolomics-based functional analysis
bit.ly/JGI-User-Programs
bit.ly/JGI-Meeting2021
jgi-comms@lbl.gov