SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Data Automation
at Light Sources
Ian Foster
Argonne National Laboratory & The University of Chicago
1
Advanced Photon Source
Argonne Leadership
Computing Facility
1 km
5 μsec
2
von Laszewski et al., Real-time
analysis, visualization, and
steering of microtomography
experiments at photon
sources, SIAM Parallel
Processing, 1999
I have been working with light sources for some time!
“the data rates and
compute power
required ... are
prodigious, easily
reaching one gigabit per second
and a teraflop per second [respectively]”
Ptychography: Use GPU cluster for 360x speedup,
from 7 hours to 72 s
[Deng, Vine, Chen, Nashed, Philips, Jin,
Peterka, Ross, Jacobsen]
 Enable online analysis and use of fly scans
Microtomography: Use 32K Mira BG/Q nodes to
reduce reconstruction time from days to 2 mins
[Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]
 Identify and correct experimental
misconfiguration
High-energy diffraction microscopy: 10K BG/Q
nodes to reconstruct in 10 minutes
[Sharma, Almer, Wozniak, Wilde, Foster]
 Zoom in on crack locations (switch far field  near field)
Coherence
Brightness
High Energy
Micrometer porosity structure of shale samples
Microstructure of a copper wire, 0.2mm diameter
Work on high-speed analysis continues
We face a data crisis (and opportunity)
New instrumentation means that data rates
are growing much faster than Moore’s Law
 Neither humans nor computers can cope by
using current methods
We need new methods for designing
experiments, managing data, analyzing data,
and creating and delivering software
 “A knowledge-based society, connected by the
Internet and powered by AI …”
— Chen Chien-jen
6https://bit.ly/2l4gfgu
How industry deals with scale
7https://bit.ly/2l4gfgu
How industry deals with scale
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
8
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
9
Petrel online store
petrel.alcf.anl.gov
94 Gbit/s Petrel—Blue Waters
2 petabytes
100 Gbps
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
10
2 petabytes
100 Gbps
Globus APIs
globus.org
Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
Automate:
(c) DMagic script uses Globus
APIs to transfer data and
configure permissions
12
http://dmagic.readthedocs.io
Francesco de Carlo
Given an experiment date:
• Retrieve user info from APS scheduler
• Create Globus “shared endpoint” and
configure permissions
• Monitor directory at beamline and use
Globus to copy new files to endpoint
• Email link to shared endpoint for data
retrieval
Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1313
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1414
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
Automate and outsource:
(2) Publication and discovery
1515
Programmatic access (REST, Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository, …
Building a different custom pipeline for every situation is impractical
Automate and outsource:
(3) End-to-end data pipelines
For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository
Building a different custom pipeline for every situation is impractical
Automate: Trigger-action programming (“if this happens, then do that)
Outsource: Cloud-based trigger-action service for reliability,
scalability, ease of use, security, sustainability
Automate and outsource:
(3) End-to-end data pipelines
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive1
1
Rules
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Globus Transfer
Archive
• Set sharing ACLs
• Set timer for publication
to Materials Data Facility
Data publication
1
2
1
Rules
2
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
• IF new files THEN run feature
extraction
• IF feature detected THEN
transfer data to archival storage
• IF time since ingest > 6 months
THEN publish dataset to
Materials Data Facility
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
Data
Source
Collector Storage and
Compute
• Capture dataset creation
• Review center position
APS beamline 32-ID
ALCF Cooley Cluster
• Generate preview and
center images
• Reconstruct image
• Extract metadata
Ingest in Globus
Search
Set sharing ACLs
Data publication
1
2
1
Rules
2
• IF new HDF5 files THEN
transfer to Cooley
• IF new center_pos
THEN initiate
reconstruction
• IF transfer complete
THEN execute preview
and center finding
• IF results THEN return
data to APS
• IF reconstruction THEN
transfer data to Petrel
AND publish dataset
ALCF Petrel
Archive
Visualize with Neuroglancer
Another example: Mosaic tomography for neurocartography
(N. Kasthuri, R. Chard, et al.)
globus.org
Automate and outsource:
(4) Data transformation and analysis
“beam misaligned”
“…”
Say you want to use a deep neural network for online identification
of problems when running diffraction experiments
Automate and outsource:
(4) Data transformation and analysis
https://doi.org/10.1109/NYSDS.2017.8085045
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
Data and Learning Hub (DLHub): Overview
• Collect, publish, categorize models/code/weights/data from many sources
• Serve models via API to foster sharing, consumption, and access to data,
training sets, and models
• Automate training of models (using HPC as needed) as new data are available
• Enable new science through reuse and synthesis of existing models
TrainCollect Serve
DLHub: Collect, serve, train community models
DLHub
Collect
Data
1) Register a model
Train
Model
Register
Model Model /
transform
containers
Receive DOI
Send to DLHub
DLHub
Collect
Data
Receive
predicted
Properties
Send
compositions
Call
DLHub
Find
Model
2) Run a model
Model /
transform
containers
DLHub: Collect, serve, train community models
Collect
Data
Receive DOI
1) Register a model
Train
Model
Register
Model
Send to DLHub
32
33
Invoke model on data
Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
Thanks also to:
• Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer,
Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source
• Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility
In summary
More data demands new methods for designing experiments,
managing data, analyzing data, and creating and delivering software
We must automate and outsource to manage data, run pipelines,
and train and run (machine learning) models
I presented examples that illustrate what can be done:
• High-speed storage services for data staging and distribution: Petrel
• Cloud-based services for data transfer and sharing: Globus Transfer
• Data publication and discovery services: Materials Data Facility
• Cloud-based automation services: Globus Automate
• Model and transformation services to encapsulate software: DLHub
There are many opportunities, and great need, for collaboration
To follow up: foster@anl.gov

Weitere ähnliche Inhalte

Was ist angesagt?

NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasIan Foster
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013Kirill Osipov
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryIan Foster
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010Ian Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015Tanu Malik
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Robert Grossman
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesTanu Malik
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduriRavi Madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitwarebigdataviz_bay
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualizationbigdataviz_bay
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)Robert Grossman
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3Robert Grossman
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 

Was ist angesagt? (20)

NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015
 
Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)Health & Status Monitoring (2010-v8)
Health & Status Monitoring (2010-v8)
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Big data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at KitwareBig data visualization frameworks and applications at Kitware
Big data visualization frameworks and applications at Kitware
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)An Overview of Bionimbus (March 2010)
An Overview of Bionimbus (March 2010)
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3OCC Overview OMG Clouds Meeting 07-13-09 v3
OCC Overview OMG Clouds Meeting 07-13-09 v3
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 

Ähnlich wie Data Automation at Light Sources

Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Globus
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Globus
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Databricks
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Databricks
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationIan Foster
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASAIan Foster
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Ian Foster
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011Ian Foster
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Ian Foster
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudAdianto Wibisono
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Amazon Web Services
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobus
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Ian Foster
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudAmazon Web Services
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudJamie Kinney
 
CPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the CloudCPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the CloudCameron Craddock
 

Ähnlich wie Data Automation at Light Sources (20)

Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
Gladier: The Globus Architecture for Data Intensive Experimental Research (AP...
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)Working with Instrument Data (GlobusWorld Tour - UMich)
Working with Instrument Data (GlobusWorld Tour - UMich)
 
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
Deep Learning on Apache Spark at CERN’s Large Hadron Collider with Intel Tech...
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
Deep Learning Pipelines for High Energy Physics using Apache Spark with Distr...
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)Globus Integrations (CHPC 2019 - South Africa)
Globus Integrations (CHPC 2019 - South Africa)
 
Rpi talk foster september 2011
Rpi talk foster september 2011Rpi talk foster september 2011
Rpi talk foster september 2011
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Grid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the CloudGrid is Dead ? Nimrod on the Cloud
Grid is Dead ? Nimrod on the Cloud
 
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
Globus Genomics: How Science-as-a-Service is Accelerating Discovery (BDT310) ...
 
GlobusWorld 2020 Keynote
GlobusWorld 2020 KeynoteGlobusWorld 2020 Keynote
GlobusWorld 2020 Keynote
 
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
Science Services and Science Platforms: Using the Cloud to Accelerate and Dem...
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
Accelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the CloudAccelerating Time to Science: Transforming Research in the Cloud
Accelerating Time to Science: Transforming Research in the Cloud
 
CPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the CloudCPAC Connectome Analysis in the Cloud
CPAC Connectome Analysis in the Cloud
 

Mehr von Ian Foster

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxIan Foster
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionIan Foster
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumIan Foster
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsIan Foster
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryIan Foster
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptxIan Foster
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceIan Foster
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryIan Foster
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryIan Foster
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterIan Foster
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon SummaryIan Foster
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperabilityIan Foster
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFIan Foster
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...Ian Foster
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformIan Foster
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformIan Foster
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchIan Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceIan Foster
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudIan Foster
 

Mehr von Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Globus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management PlatformGlobus Auth: A Research Identity and Access Management Platform
Globus Auth: A Research Identity and Access Management Platform
 
Streamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer researchStreamlined data sharing and analysis to accelerate cancer research
Streamlined data sharing and analysis to accelerate cancer research
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
Accelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy ScienceAccelerating Data-driven Discovery in Energy Science
Accelerating Data-driven Discovery in Energy Science
 
building global software/earthcube->sciencecloud
building global software/earthcube->sciencecloudbuilding global software/earthcube->sciencecloud
building global software/earthcube->sciencecloud
 

Kürzlich hochgeladen

Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and AnnovaMansi Rastogi
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11GelineAvendao
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docxkarenmillo
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 

Kürzlich hochgeladen (20)

Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
linear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annovalinear Regression, multiple Regression and Annova
linear Regression, multiple Regression and Annova
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 

Data Automation at Light Sources

  • 1. Data Automation at Light Sources Ian Foster Argonne National Laboratory & The University of Chicago 1
  • 2. Advanced Photon Source Argonne Leadership Computing Facility 1 km 5 μsec 2
  • 3. von Laszewski et al., Real-time analysis, visualization, and steering of microtomography experiments at photon sources, SIAM Parallel Processing, 1999 I have been working with light sources for some time! “the data rates and compute power required ... are prodigious, easily reaching one gigabit per second and a teraflop per second [respectively]”
  • 4. Ptychography: Use GPU cluster for 360x speedup, from 7 hours to 72 s [Deng, Vine, Chen, Nashed, Philips, Jin, Peterka, Ross, Jacobsen]  Enable online analysis and use of fly scans Microtomography: Use 32K Mira BG/Q nodes to reduce reconstruction time from days to 2 mins [Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]  Identify and correct experimental misconfiguration High-energy diffraction microscopy: 10K BG/Q nodes to reconstruct in 10 minutes [Sharma, Almer, Wozniak, Wilde, Foster]  Zoom in on crack locations (switch far field  near field) Coherence Brightness High Energy Micrometer porosity structure of shale samples Microstructure of a copper wire, 0.2mm diameter Work on high-speed analysis continues
  • 5. We face a data crisis (and opportunity) New instrumentation means that data rates are growing much faster than Moore’s Law  Neither humans nor computers can cope by using current methods We need new methods for designing experiments, managing data, analyzing data, and creating and delivering software  “A knowledge-based society, connected by the Internet and powered by AI …” — Chen Chien-jen
  • 8. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable 8
  • 9. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution 9 Petrel online store petrel.alcf.anl.gov 94 Gbit/s Petrel—Blue Waters 2 petabytes 100 Gbps
  • 10. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution (b) Globus service for data transfer and sharing 10 2 petabytes 100 Gbps Globus APIs
  • 12. Automate and outsource: (1) Data distribution Needs: Usable, efficient, reliable, secure, sustainable Outsource: (a) Petrel data store to hold data prior to/during distribution (b) Globus service for data transfer and sharing Automate: (c) DMagic script uses Globus APIs to transfer data and configure permissions 12 http://dmagic.readthedocs.io Francesco de Carlo Given an experiment date: • Retrieve user info from APS scheduler • Create Globus “shared endpoint” and configure permissions • Monitor directory at beamline and use Globus to copy new files to endpoint • Email link to shared endpoint for data retrieval
  • 13. Automate and outsource: (2) Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1313 2 petabytes 100 Gbps Globus APIs
  • 14. Automate and outsource: (2) Publication and discovery Move to permanent location (or publish in place) Compute and record checksums Obtain and record metadata Assign persistent identifier Index for discovery 1414 Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 15. Automate and outsource: (2) Publication and discovery 1515 Programmatic access (REST, Python, Jupyter) Web browse and search Data Publication Indexing materialsdatafacility.org 2 petabytes 100 Gbps Globus APIs
  • 16. For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository, … Building a different custom pipeline for every situation is impractical Automate and outsource: (3) End-to-end data pipelines
  • 17. For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository Building a different custom pipeline for every situation is impractical Automate: Trigger-action programming (“if this happens, then do that) Outsource: Cloud-based trigger-action service for reliability, scalability, ease of use, security, sustainability Automate and outsource: (3) End-to-end data pipelines
  • 18. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Archive Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 19. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument • Email / SMS notification Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Archive1 1 Rules • IF new files THEN run quality control scripts • IF quality is good THEN send email and transfer data to CSC Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 20. National Facility Local Storage and Compute • Quality Control • Assign Handle Beamline Instrument • Email / SMS notification Globus Transfer Central Storage and Compute (CSC) • Feature extraction • Aggregate and convert format Globus Transfer Archive • Set sharing ACLs • Set timer for publication to Materials Data Facility Data publication 1 2 1 Rules 2 • IF new files THEN run quality control scripts • IF quality is good THEN send email and transfer data to CSC • IF new files THEN run feature extraction • IF feature detected THEN transfer data to archival storage • IF time since ingest > 6 months THEN publish dataset to Materials Data Facility Automate and outsource: (3) End-to-end pipelines with trigger-action programming
  • 21. Data Source Collector Storage and Compute • Capture dataset creation • Review center position APS beamline 32-ID ALCF Cooley Cluster • Generate preview and center images • Reconstruct image • Extract metadata Ingest in Globus Search Set sharing ACLs Data publication 1 2 1 Rules 2 • IF new HDF5 files THEN transfer to Cooley • IF new center_pos THEN initiate reconstruction • IF transfer complete THEN execute preview and center finding • IF results THEN return data to APS • IF reconstruction THEN transfer data to Petrel AND publish dataset ALCF Petrel Archive Visualize with Neuroglancer Another example: Mosaic tomography for neurocartography (N. Kasthuri, R. Chard, et al.)
  • 23. Automate and outsource: (4) Data transformation and analysis “beam misaligned” “…” Say you want to use a deep neural network for online identification of problems when running diffraction experiments
  • 24. Automate and outsource: (4) Data transformation and analysis https://doi.org/10.1109/NYSDS.2017.8085045
  • 25. Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 26. DLHub [“beam off image”, …] model/xray/batch_predict Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 27. DLHub [“beam off image”, …] model/xray/batch_predict Automate and outsource: (4) Data transformation and analysis ▪ Where are the model and trained weights? ▪ How do I run the model on my data? ▪ Should I run the model on my data? ▪ How can I retrain the model on new data? https://doi.org/10.1109/NYSDS.2017.8085045
  • 28. Data and Learning Hub (DLHub): Overview • Collect, publish, categorize models/code/weights/data from many sources • Serve models via API to foster sharing, consumption, and access to data, training sets, and models • Automate training of models (using HPC as needed) as new data are available • Enable new science through reuse and synthesis of existing models TrainCollect Serve
  • 29. DLHub: Collect, serve, train community models DLHub Collect Data 1) Register a model Train Model Register Model Model / transform containers Receive DOI Send to DLHub
  • 30. DLHub Collect Data Receive predicted Properties Send compositions Call DLHub Find Model 2) Run a model Model / transform containers DLHub: Collect, serve, train community models Collect Data Receive DOI 1) Register a model Train Model Register Model Send to DLHub
  • 31.
  • 32. 32
  • 34. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan Ryan Chard Mike Papka Rick Wagner I reported on the work of many talented people Thanks also to: • Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer, Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source • Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing We are grateful to our sponsors DLHub Globus IMaD Petrel Argonne Leadership Computing Facility
  • 35. In summary More data demands new methods for designing experiments, managing data, analyzing data, and creating and delivering software We must automate and outsource to manage data, run pipelines, and train and run (machine learning) models I presented examples that illustrate what can be done: • High-speed storage services for data staging and distribution: Petrel • Cloud-based services for data transfer and sharing: Globus Transfer • Data publication and discovery services: Materials Data Facility • Cloud-based automation services: Globus Automate • Model and transformation services to encapsulate software: DLHub There are many opportunities, and great need, for collaboration To follow up: foster@anl.gov