SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Managing the genomics data deluge
at the DOE Joint Genome Institute
Kjiersten Fagnan
CIO, JGI
The DOE Joint Genome Institute at a glance
JGI MISSION:
To provide the global research community
with free access to the most advanced
integrative genome science capabilities in
support of the DOE energy &
environmental research mission
Integrative Genomics Building
(IGB)
U.S. Department of Energy Office of Science User Facility
● JGI established in 1997, User facility from 2004
● Located at Lawrence Berkeley National Laboratory
● ~285 staff; ~$80M annual funding
● 2,038 Global Primary Users in FY20; >10,000 Data
Users
JGI History
3
Environmental genomics will enable the Bioeconomy
Genetic “Circuit”
Gene Enzyme Microbial Factory
DNA
2 NH4
2+
CO3
2-
FY 2020 Users: 2,038 Worldwide
6
Users on the Map: 2,038
Academic 1,504 74%
Government 183 9%
DOE (national labs only) 161 8%
Industry 29 1%
Other 161 8%
Projects Completed/Scientific Publications
7
Cumulative Number of
Projects Completed
Cumulative Number of
Scientific Publications
Sequence Output
8
Massively Parallel Short Read Sequencing
Basepairs (GB)
Single Molecule Long Read Sequencing
Basepairs (GB)
DOE Office of Science Public Reusable
Research Data (PuRe Data)
https://science.osti.gov/Initiatives/PuRe-
Data/Resources-at-a-Glance
Deluge of Large, Complex Data Sets
10
JGI manages a 10+ PB data repository
Mega – Giga – Tera – Peta – Exa – Zetta – Yotta
5/19/2021 https://www.theatlantic.com/technology/archive/2011/05/infographic-how-big-is-a-yottabyte/239034/ 11
The cost to store 1 Yottabyte of data - $100 trillion*
This is just genomics data… we also want
metabolomes, transcriptomes, proteomes, image
data
The Immense Scale of Omics Data
5/19/2021 12
Advances in sequencing and omics technologies have far outpaced data infrastructure
How do we remove the barriers to
data access and analysis at scale?
Data Management is Critical
5/19/2021 13
PMO
S
DM
Q
AQ
C
/ RQ
C
G
AAG Plant MEP RnD Fungal
G
enome
Portal
IMG MG
M
External
C
ollaborators
Web S
ervices ( Mycocosm,
Phytozome, IMGM/ ER)
In 2013, JGI deployed a hierarchical data management
system to deal with the exponetial growth in sequence
data and analysis products
JGI Archive and Metadata Organizer (JAMO)
5/19/2021 14
G
AAG Plant MEP RnD Fungal IMG MG
M
S
DM
Q
AQ
C
/ RQ
C
Web S
ervices ( Mycocosm,
Phytozome, IMGM/ ER)
G
enome
Portal
External
C
ollaborators
PMO
JAMO’s Back-end Infrastructure
5/19/2021 15
JAMO Enabled Increased Automation Between Groups
• JGI’s core pipelines connect with JAMO and provide metadata through
templates
• Once data is available for processing, the workflows are triggered
automatically
• Data that fails QC is flagged for review
5/19/2021 16
JAMO is the Backbone of JGI’s Data Portal
5/19/2021 17
All the metadata used to populate the Data Portal
comes from JAMO’s Mongo DB
Code for America Summit Talk on JGI’s New Data Portal
Aligning Data Across Siloed Departments
Many government sectors have been collecting data digitally for decades often
in uncoordinated ways. In this talk we’ll explore how Truss and Joint Genome
Institute partnered to break down data silos and start conversations across
departments to align metadata across the organization. From establishing
baseline agreements, to finding common outcomes everyone could agree upon,
to bringing old data sets into the present, this talk will provide useful tools for
practitioners facing challenges of data misalignment across multiple
departments.
It's Thursday later in the day 2:00-3:00 pm PST
https://summit.codeforamerica.org/agenda/
5/19/2021 18
Improving Search Across JGI
5/19/2021 19
Metadata in one place makes search across all JGI programs possible
JGI-KBase
RESTful
Service
JGI Data and Metadata
system including LIMS,
GOLD, sequence,
assemblies, annotations
Metadata and file types
User Query
Response
Data sets
Most of JGI’s Infrastructure is @NERSC
5/19/2021 20
Berkeley Lab is on a Major Fault Line
5/19/2021 21
NERSC is
here!
Most samples used to generate data at JGI
are unique and irreplaceable
Backing up Irreplaceable Data
• Moved 1 PB of data to ORNL for safe-keeping
• Data migration completed in 5 days using Globus
• Enables access to the data – but only useful with the right metadata
5/19/2021 22
Main JGI
Data
Repository
API
HPSS
Archive
JAMO light
DTN
DTN
SUMMIT
API
What can you do with all that data and a supercomputer?
A Gordon Bell Prize (Supercomputing) winner in 2018 used all the well-
characterized publicly available data to look at genetic underpinnings of
opioid addiction.
Wayne Joubert, et al. 2018. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures
for chronic pain and opioid addiction. In Proceedings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis (SC ’18). IEEE Press, Article 57, 1–14.
Access to large amounts of ‘omics data
enables scientists to explore a broad range of
hypotheses!
CA has Earthquakes and Fires!
5/19/2021 24
We need to distribute Data and Analysis to
maintain scientific productivity
JGI’s Centralized Workflow System
● JGI Analysis Workflow Service (JAWS)
● Need to be able to compute at multiple centers: NERSC, LBL IT, others
● Need to have more readily reusable and modifiable bioinformatics
pipelines
● Need workflows to support FAIR* guidelines
● Objective: Portable, Reusable, Traceable workflows on a Robust platform
*Findable, Accessible, Interoperable, Reusable
25
Distributed Computing is Hard
• Managing multiple user accounts
• Different facilities have different policies
– Batch schedulers
– File system availability and data retention
• Different architectures
– CPU vs GPU
– Local disk vs parallel file systems
– Memory size and footprint
• Portability is a lot of work
5/19/2021 26
JGI is Running Analyses Across the West Coast
JGI Centralized
Workflow
System
Cromwell
Workflow
Manager
Additional
resources
(cloud, ORNL,
ANL, etc)
Common interface to
access resources
initial
testing
future
Workflow Description Language
JGI is Running Analyses Across the West Coast
JGI Centralized
Workflow
System
Workflow Description Language
1. Find the data for
analysis in the data
management system
2. Authenticate with
Globus and transfer
the data to the remote
computing resource
3. Work is
executed, results
are generated
4. Transfer data back
to the home
repository with
Globus
5. Register the data
and metadata with
JAMO
Application tokens are accepted by the
facilities we are using making it possible to
transfer data on behalf of the user
Data Movement Between Resources – Globus!
• JGI has been using Globus since ~2012 to move data around
–One time we broke the service by trying to move millions of tiny files that
were all in the same directory :D
• Globus enables JGI collaborators to download large amounts of data
–Biggest customers are the Bioenergy Research Centers – DOE funded
facilities investigating biofuels
–Some JGI Users are still willing to wait 9+ days for a
download to complete via the browser – education opportunity!
• Globus is an integral part of JAWS
–Enables the application to move data between computing
resources on behalf of the user
5/19/2021 29
Summary
• JGI is a DOE User Facility that produces a lot of complex, unique data
for the scientific community
• As instruments improve, the data is higher quality – *metadata can still
be problematic
• We’d be lost without a good data management system
• JGI is turning to distributed computing for processing and large-scale
analyses
• Data movement made much easier and faster with Globus
5/19/2021 30
Upcoming Virtual Annual Meeting/Resource Calls
● Aug 30 – Sept 1: 3 x 6-hour days, 2 sessions/day
– Exploring the Universe of Specialized Metabolites
– From Microbial Sequence to Environmental Function
– The Many Facets of Plant-Microbial Interactions
– Machine Learning and Artificial Intelligence for Biology
– Integrative Omics-Inspired Plant and Microbe Engineering
– Technology Innovations
● Community Science Program (CSP) Functional Genomics
proposal deadline: July 31
– Genes/Pathway synthesis
– Strain engineering
– Data mining
– Metabolomics
– RNA-seq
● Call New Investigator Call proposal deadline: Sept 15
– Bacterial and archaeal isolates and single cell draft genomes
– Metagenomes/metatranscriptomes
– DNA synthesis- and Metabolomics-based functional analysis
bit.ly/JGI-User-Programs
bit.ly/JGI-Meeting2021
jgi-comms@lbl.gov

Weitere ähnliche Inhalte

Was ist angesagt?

Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
Dan Taylor
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
Ian Foster
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
Ian Foster
 

Was ist angesagt? (20)

Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2Internet2 Bio IT 2016 v2
Internet2 Bio IT 2016 v2
 
Glasgow University Geo Metadata Workshop
Glasgow University Geo Metadata WorkshopGlasgow University Geo Metadata Workshop
Glasgow University Geo Metadata Workshop
 
Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)Architectures for Data Commons (XLDB 15 Lightning Talk)
Architectures for Data Commons (XLDB 15 Lightning Talk)
 
Geospatial Metadata Workshop
Geospatial Metadata WorkshopGeospatial Metadata Workshop
Geospatial Metadata Workshop
 
Big Data, Beyond the Data Center
Big Data, Beyond the Data CenterBig Data, Beyond the Data Center
Big Data, Beyond the Data Center
 
Study on potential capabilities of a nodb system
Study on potential capabilities of a nodb systemStudy on potential capabilities of a nodb system
Study on potential capabilities of a nodb system
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
What Are Science Clouds?
What Are Science Clouds?What Are Science Clouds?
What Are Science Clouds?
 
Networking Materials Data
Networking Materials DataNetworking Materials Data
Networking Materials Data
 
BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)BioCASE web services for germplasm data sets, at FAO, Rome (2006)
BioCASE web services for germplasm data sets, at FAO, Rome (2006)
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)
 
User Engagement in Research Data Curation
User Engagement in Research Data CurationUser Engagement in Research Data Curation
User Engagement in Research Data Curation
 
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
Collaboration to Curation: The High Rise Project meets Edinburgh DataShare
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
Geoservices Activities at EDINA
Geoservices Activities at EDINAGeoservices Activities at EDINA
Geoservices Activities at EDINA
 
Participatory Web
Participatory WebParticipatory Web
Participatory Web
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 

Ähnlich wie GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute

Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
Alex Hardisty
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECAProject
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
Phil Cryer
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Barry Smith
 

Ähnlich wie GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute (20)

Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
 
Open Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality DataOpen Access as a Means to Produce High Quality Data
Open Access as a Means to Produce High Quality Data
 
GBIF: An infrastructure for infrastructures
GBIF: An infrastructure for infrastructures GBIF: An infrastructure for infrastructures
GBIF: An infrastructure for infrastructures
 
Big Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud ComputingBig Data in Bioinformatics & the Era of Cloud Computing
Big Data in Bioinformatics & the Era of Cloud Computing
 
2016 09 cxo forum
2016 09 cxo forum2016 09 cxo forum
2016 09 cxo forum
 
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
Big Data Handling Technologies ICCCS 2014_Love Arora _GNDU
 
GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016GBIF BIFA mentoring, Day 5a Data management, July 2016
GBIF BIFA mentoring, Day 5a Data management, July 2016
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
Dealing with Semantic Heterogeneity in Real-Time Information
Dealing with Semantic Heterogeneity in Real-Time InformationDealing with Semantic Heterogeneity in Real-Time Information
Dealing with Semantic Heterogeneity in Real-Time Information
 
Global Network Advancement Group - Next Generation Network-Integrated Systems
Global Network Advancement Group - Next Generation Network-Integrated SystemsGlobal Network Advancement Group - Next Generation Network-Integrated Systems
Global Network Advancement Group - Next Generation Network-Integrated Systems
 
Global Network Advancement Group Next Generation Network-Integrated Sys...
      Global Network Advancement GroupNext Generation Network-Integrated Sys...      Global Network Advancement GroupNext Generation Network-Integrated Sys...
Global Network Advancement Group Next Generation Network-Integrated Sys...
 
Keynote, Oman Geospatial Expo, Dec 2013
Keynote, Oman Geospatial Expo, Dec 2013Keynote, Oman Geospatial Expo, Dec 2013
Keynote, Oman Geospatial Expo, Dec 2013
 
Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...Data as a research output and a research asset: the case for Open Science/Sim...
Data as a research output and a research asset: the case for Open Science/Sim...
 
Shifting the goal post – from high impact journals to high impact data
 Shifting the goal post – from high impact journals to high impact data Shifting the goal post – from high impact journals to high impact data
Shifting the goal post – from high impact journals to high impact data
 
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
GBIF (Global Biodiversity Information Facility) Position Paper: Data Hosting ...
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_creaData bio d6.2-data-management-plan_v1.0_2017-06-30_crea
Data bio d6.2-data-management-plan_v1.0_2017-06-30_crea
 
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityOntology Tutorial: Semantic Technology for Intelligence, Defense and Security
Ontology Tutorial: Semantic Technology for Intelligence, Defense and Security
 
A Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical ResearchA Data Biosphere for Biomedical Research
A Data Biosphere for Biomedical Research
 

Mehr von Globus

Mehr von Globus (20)

Advanced Globus System Administration Topics
Advanced Globus System Administration TopicsAdvanced Globus System Administration Topics
Advanced Globus System Administration Topics
 
Instrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a FlowInstrument Data Automation: The Life of a Flow
Instrument Data Automation: The Life of a Flow
 
Building Research Applications with Globus PaaS
Building Research Applications with Globus PaaSBuilding Research Applications with Globus PaaS
Building Research Applications with Globus PaaS
 
Reliable, Remote Computation at All Scales
Reliable, Remote Computation at All ScalesReliable, Remote Computation at All Scales
Reliable, Remote Computation at All Scales
 
Best Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using GlobusBest Practices for Data Sharing Using Globus
Best Practices for Data Sharing Using Globus
 
An Introduction to Globus for Researchers
An Introduction to Globus for ResearchersAn Introduction to Globus for Researchers
An Introduction to Globus for Researchers
 
Introduction to Research Automation with Globus
Introduction to Research Automation with GlobusIntroduction to Research Automation with Globus
Introduction to Research Automation with Globus
 
Globus for System Administrators
Globus for System AdministratorsGlobus for System Administrators
Globus for System Administrators
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for ResearchersIntroduction to Data Transfer and Sharing for Researchers
Introduction to Data Transfer and Sharing for Researchers
 
Introduction to the Globus Platform for Developers
Introduction to the Globus Platform for DevelopersIntroduction to the Globus Platform for Developers
Introduction to the Globus Platform for Developers
 
Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)Introduction to the Command Line Interface (CLI)
Introduction to the Command Line Interface (CLI)
 
Automating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and ComputeAutomating Research Data with Globus Flows and Compute
Automating Research Data with Globus Flows and Compute
 
Automating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus PlatformAutomating Research Data Flows and Introduction to the Globus Platform
Automating Research Data Flows and Introduction to the Globus Platform
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 
Introduction to Globus for System Administrators
Introduction to Globus for System AdministratorsIntroduction to Globus for System Administrators
Introduction to Globus for System Administrators
 
Introduction to Globus for New Users
Introduction to Globus for New UsersIntroduction to Globus for New Users
Introduction to Globus for New Users
 
Working with Globus Platform Services and Portals
Working with Globus Platform Services and PortalsWorking with Globus Platform Services and Portals
Working with Globus Platform Services and Portals
 
Globus Automation
Globus AutomationGlobus Automation
Globus Automation
 
Advanced Globus System Administration
Advanced Globus System AdministrationAdvanced Globus System Administration
Advanced Globus System Administration
 

Kürzlich hochgeladen

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Kürzlich hochgeladen (20)

Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 

GlobusWorld 2021: Managing Genomics Data at the DOE Joint Genomics Institute

  • 1. Managing the genomics data deluge at the DOE Joint Genome Institute Kjiersten Fagnan CIO, JGI
  • 2. The DOE Joint Genome Institute at a glance JGI MISSION: To provide the global research community with free access to the most advanced integrative genome science capabilities in support of the DOE energy & environmental research mission Integrative Genomics Building (IGB) U.S. Department of Energy Office of Science User Facility ● JGI established in 1997, User facility from 2004 ● Located at Lawrence Berkeley National Laboratory ● ~285 staff; ~$80M annual funding ● 2,038 Global Primary Users in FY20; >10,000 Data Users
  • 4.
  • 5. Environmental genomics will enable the Bioeconomy Genetic “Circuit” Gene Enzyme Microbial Factory DNA 2 NH4 2+ CO3 2-
  • 6. FY 2020 Users: 2,038 Worldwide 6 Users on the Map: 2,038 Academic 1,504 74% Government 183 9% DOE (national labs only) 161 8% Industry 29 1% Other 161 8%
  • 7. Projects Completed/Scientific Publications 7 Cumulative Number of Projects Completed Cumulative Number of Scientific Publications
  • 8. Sequence Output 8 Massively Parallel Short Read Sequencing Basepairs (GB) Single Molecule Long Read Sequencing Basepairs (GB)
  • 9. DOE Office of Science Public Reusable Research Data (PuRe Data) https://science.osti.gov/Initiatives/PuRe- Data/Resources-at-a-Glance
  • 10. Deluge of Large, Complex Data Sets 10 JGI manages a 10+ PB data repository
  • 11. Mega – Giga – Tera – Peta – Exa – Zetta – Yotta 5/19/2021 https://www.theatlantic.com/technology/archive/2011/05/infographic-how-big-is-a-yottabyte/239034/ 11 The cost to store 1 Yottabyte of data - $100 trillion* This is just genomics data… we also want metabolomes, transcriptomes, proteomes, image data
  • 12. The Immense Scale of Omics Data 5/19/2021 12 Advances in sequencing and omics technologies have far outpaced data infrastructure How do we remove the barriers to data access and analysis at scale?
  • 13. Data Management is Critical 5/19/2021 13 PMO S DM Q AQ C / RQ C G AAG Plant MEP RnD Fungal G enome Portal IMG MG M External C ollaborators Web S ervices ( Mycocosm, Phytozome, IMGM/ ER) In 2013, JGI deployed a hierarchical data management system to deal with the exponetial growth in sequence data and analysis products
  • 14. JGI Archive and Metadata Organizer (JAMO) 5/19/2021 14 G AAG Plant MEP RnD Fungal IMG MG M S DM Q AQ C / RQ C Web S ervices ( Mycocosm, Phytozome, IMGM/ ER) G enome Portal External C ollaborators PMO
  • 16. JAMO Enabled Increased Automation Between Groups • JGI’s core pipelines connect with JAMO and provide metadata through templates • Once data is available for processing, the workflows are triggered automatically • Data that fails QC is flagged for review 5/19/2021 16
  • 17. JAMO is the Backbone of JGI’s Data Portal 5/19/2021 17 All the metadata used to populate the Data Portal comes from JAMO’s Mongo DB
  • 18. Code for America Summit Talk on JGI’s New Data Portal Aligning Data Across Siloed Departments Many government sectors have been collecting data digitally for decades often in uncoordinated ways. In this talk we’ll explore how Truss and Joint Genome Institute partnered to break down data silos and start conversations across departments to align metadata across the organization. From establishing baseline agreements, to finding common outcomes everyone could agree upon, to bringing old data sets into the present, this talk will provide useful tools for practitioners facing challenges of data misalignment across multiple departments. It's Thursday later in the day 2:00-3:00 pm PST https://summit.codeforamerica.org/agenda/ 5/19/2021 18
  • 19. Improving Search Across JGI 5/19/2021 19 Metadata in one place makes search across all JGI programs possible JGI-KBase RESTful Service JGI Data and Metadata system including LIMS, GOLD, sequence, assemblies, annotations Metadata and file types User Query Response Data sets
  • 20. Most of JGI’s Infrastructure is @NERSC 5/19/2021 20
  • 21. Berkeley Lab is on a Major Fault Line 5/19/2021 21 NERSC is here! Most samples used to generate data at JGI are unique and irreplaceable
  • 22. Backing up Irreplaceable Data • Moved 1 PB of data to ORNL for safe-keeping • Data migration completed in 5 days using Globus • Enables access to the data – but only useful with the right metadata 5/19/2021 22 Main JGI Data Repository API HPSS Archive JAMO light DTN DTN SUMMIT API
  • 23. What can you do with all that data and a supercomputer? A Gordon Bell Prize (Supercomputing) winner in 2018 used all the well- characterized publicly available data to look at genetic underpinnings of opioid addiction. Wayne Joubert, et al. 2018. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18). IEEE Press, Article 57, 1–14. Access to large amounts of ‘omics data enables scientists to explore a broad range of hypotheses!
  • 24. CA has Earthquakes and Fires! 5/19/2021 24 We need to distribute Data and Analysis to maintain scientific productivity
  • 25. JGI’s Centralized Workflow System ● JGI Analysis Workflow Service (JAWS) ● Need to be able to compute at multiple centers: NERSC, LBL IT, others ● Need to have more readily reusable and modifiable bioinformatics pipelines ● Need workflows to support FAIR* guidelines ● Objective: Portable, Reusable, Traceable workflows on a Robust platform *Findable, Accessible, Interoperable, Reusable 25
  • 26. Distributed Computing is Hard • Managing multiple user accounts • Different facilities have different policies – Batch schedulers – File system availability and data retention • Different architectures – CPU vs GPU – Local disk vs parallel file systems – Memory size and footprint • Portability is a lot of work 5/19/2021 26
  • 27. JGI is Running Analyses Across the West Coast JGI Centralized Workflow System Cromwell Workflow Manager Additional resources (cloud, ORNL, ANL, etc) Common interface to access resources initial testing future Workflow Description Language
  • 28. JGI is Running Analyses Across the West Coast JGI Centralized Workflow System Workflow Description Language 1. Find the data for analysis in the data management system 2. Authenticate with Globus and transfer the data to the remote computing resource 3. Work is executed, results are generated 4. Transfer data back to the home repository with Globus 5. Register the data and metadata with JAMO Application tokens are accepted by the facilities we are using making it possible to transfer data on behalf of the user
  • 29. Data Movement Between Resources – Globus! • JGI has been using Globus since ~2012 to move data around –One time we broke the service by trying to move millions of tiny files that were all in the same directory :D • Globus enables JGI collaborators to download large amounts of data –Biggest customers are the Bioenergy Research Centers – DOE funded facilities investigating biofuels –Some JGI Users are still willing to wait 9+ days for a download to complete via the browser – education opportunity! • Globus is an integral part of JAWS –Enables the application to move data between computing resources on behalf of the user 5/19/2021 29
  • 30. Summary • JGI is a DOE User Facility that produces a lot of complex, unique data for the scientific community • As instruments improve, the data is higher quality – *metadata can still be problematic • We’d be lost without a good data management system • JGI is turning to distributed computing for processing and large-scale analyses • Data movement made much easier and faster with Globus 5/19/2021 30
  • 31. Upcoming Virtual Annual Meeting/Resource Calls ● Aug 30 – Sept 1: 3 x 6-hour days, 2 sessions/day – Exploring the Universe of Specialized Metabolites – From Microbial Sequence to Environmental Function – The Many Facets of Plant-Microbial Interactions – Machine Learning and Artificial Intelligence for Biology – Integrative Omics-Inspired Plant and Microbe Engineering – Technology Innovations ● Community Science Program (CSP) Functional Genomics proposal deadline: July 31 – Genes/Pathway synthesis – Strain engineering – Data mining – Metabolomics – RNA-seq ● Call New Investigator Call proposal deadline: Sept 15 – Bacterial and archaeal isolates and single cell draft genomes – Metagenomes/metatranscriptomes – DNA synthesis- and Metabolomics-based functional analysis bit.ly/JGI-User-Programs bit.ly/JGI-Meeting2021 jgi-comms@lbl.gov