Data Automation at Light Sources

Data Automation
at Light Sources
Ian Foster
Argonne National Laboratory & The University of Chicago
1

Advanced Photon Source
Argonne Leadership
Computing Facility
1 km
5 μsec
2

von Laszewski et al., Real-time
analysis, visualization, and
steering of microtomography
experiments at photon
sources, SIAM Parallel
Processing, 1999
I have been working with light sources for some time!
“the data rates and
compute power
required ... are
prodigious, easily
reaching one gigabit per second
and a teraflop per second [respectively]”

Ptychography: Use GPU cluster for 360x speedup,
from 7 hours to 72 s
[Deng, Vine, Chen, Nashed, Philips, Jin,
Peterka, Ross, Jacobsen]
 Enable online analysis and use of fly scans
Microtomography: Use 32K Mira BG/Q nodes to
reduce reconstruction time from days to 2 mins
[Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]
 Identify and correct experimental
misconfiguration
High-energy diffraction microscopy: 10K BG/Q
nodes to reconstruct in 10 minutes
[Sharma, Almer, Wozniak, Wilde, Foster]
 Zoom in on crack locations (switch far field  near field)
Coherence
Brightness
High Energy
Micrometer porosity structure of shale samples
Microstructure of a copper wire, 0.2mm diameter
Work on high-speed analysis continues

We face a data crisis (and opportunity)
New instrumentation means that data rates
are growing much faster than Moore’s Law
 Neither humans nor computers can cope by
using current methods
We need new methods for designing
experiments, managing data, analyzing data,
and creating and delivering software
 “A knowledge-based society, connected by the
Internet and powered by AI …”
— Chen Chien-jen

6https://bit.ly/2l4gfgu
How industry deals with scale

7https://bit.ly/2l4gfgu
How industry deals with scale

Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
8

secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
9
Petrel online store
petrel.alcf.anl.gov
94 Gbit/s Petrel—Blue Waters
2 petabytes
100 Gbps

secure, sustainable
Outsource:
(b) Globus service for data
transfer and sharing
10
2 petabytes
100 Gbps
Globus APIs

secure, sustainable
Outsource:
(b) Globus service for data
transfer and sharing
Automate:
(c) DMagic script uses Globus
APIs to transfer data and
configure permissions
12
http://dmagic.readthedocs.io
Francesco de Carlo
Given an experiment date:
• Retrieve user info from APS scheduler
• Create Globus “shared endpoint” and
configure permissions
• Monitor directory at beamline and use
Globus to copy new files to endpoint
• Email link to shared endpoint for data
retrieval

(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1313
2 petabytes
100 Gbps
Globus APIs

Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1414
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs

1515
Programmatic access (REST, Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs

For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository, …
Building a different custom pipeline for every situation is impractical
(3) End-to-end data pipelines

For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository
Building a different custom pipeline for every situation is impractical
Automate: Trigger-action programming (“if this happens, then do that)
Outsource: Cloud-based trigger-action service for reliability,
scalability, ease of use, security, sustainability
(3) End-to-end data pipelines

National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive
(3) End-to-end pipelines with trigger-action programming

National
Facility
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Archive1
1
Rules
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC

National
Facility
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Globus Transfer
Archive
• Set sharing ACLs
• Set timer for publication
to Materials Data Facility
Data publication
1
2
1
Rules
2
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
• IF new files THEN run feature
extraction
• IF feature detected THEN
transfer data to archival storage
• IF time since ingest > 6 months
THEN publish dataset to
Materials Data Facility

Data
Source
Collector Storage and
Compute
• Capture dataset creation
• Review center position
APS beamline 32-ID
ALCF Cooley Cluster
• Generate preview and
center images
• Reconstruct image
• Extract metadata
Ingest in Globus
Search
Set sharing ACLs
Data publication
1
2
1
Rules
2
• IF new HDF5 files THEN
transfer to Cooley
• IF new center_pos
THEN initiate
reconstruction
• IF transfer complete
THEN execute preview
and center finding
• IF results THEN return
data to APS
• IF reconstruction THEN
transfer data to Petrel
AND publish dataset
ALCF Petrel
Archive
Visualize with Neuroglancer
Another example: Mosaic tomography for neurocartography
(N. Kasthuri, R. Chard, et al.)

(4) Data transformation and analysis
“beam misaligned”
“…”
Say you want to use a deep neural network for online identification
of problems when running diffraction experiments

https://doi.org/10.1109/NYSDS.2017.8085045

▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?

DLHub
[“beam off image”, …]
model/xray/batch_predict
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?

Data and Learning Hub (DLHub): Overview
• Collect, publish, categorize models/code/weights/data from many sources
• Serve models via API to foster sharing, consumption, and access to data,
training sets, and models
• Automate training of models (using HPC as needed) as new data are available
• Enable new science through reuse and synthesis of existing models
TrainCollect Serve

DLHub: Collect, serve, train community models
DLHub
Collect
Data
1) Register a model
Train
Model
Register
Model Model /
transform
containers
Receive DOI
Send to DLHub

DLHub
Collect
Data
Receive
predicted
Properties
Send
compositions
Call
DLHub
Find
Model
2) Run a model
Model /
transform
containers
DLHub: Collect, serve, train community models
Collect
Data
Receive DOI
1) Register a model
Train
Model
Register
Model
Send to DLHub

Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
Thanks also to:
• Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer,
Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source
• Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility

In summary
More data demands new methods for designing experiments,
managing data, analyzing data, and creating and delivering software
We must automate and outsource to manage data, run pipelines,
and train and run (machine learning) models
I presented examples that illustrate what can be done:
• High-speed storage services for data staging and distribution: Petrel
• Cloud-based services for data transfer and sharing: Globus Transfer
• Data publication and discovery services: Materials Data Facility
• Cloud-based automation services: Globus Automate
• Model and transformation services to encapsulate software: DLHub
There are many opportunities, and great need, for collaboration
To follow up: foster@anl.gov

Data Automation at Light Sources

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Data Automation at Light Sources

Ähnlich wie Data Automation at Light Sources (20)

Mehr von Ian Foster

Mehr von Ian Foster (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data Automation at Light Sources