Plenary talk at the international Synchrotron Radiation Instrumentation conference in Taiwan, on work with great colleagues Ben Blaiszik, Ryan Chard, Logan Ward, and others.
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. I present here three projects that use cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. I draw conclusions about best practices for building next-generation data automation systems for future light sources.
3. von Laszewski et al., Real-time
analysis, visualization, and
steering of microtomography
experiments at photon
sources, SIAM Parallel
Processing, 1999
I have been working with light sources for some time!
“the data rates and
compute power
required ... are
prodigious, easily
reaching one gigabit per second
and a teraflop per second [respectively]”
4. Ptychography: Use GPU cluster for 360x speedup,
from 7 hours to 72 s
[Deng, Vine, Chen, Nashed, Philips, Jin,
Peterka, Ross, Jacobsen]
Enable online analysis and use of fly scans
Microtomography: Use 32K Mira BG/Q nodes to
reduce reconstruction time from days to 2 mins
[Bicer, Gursoy, Kettimuthu, De Carlo, Agrawal]
Identify and correct experimental
misconfiguration
High-energy diffraction microscopy: 10K BG/Q
nodes to reconstruct in 10 minutes
[Sharma, Almer, Wozniak, Wilde, Foster]
Zoom in on crack locations (switch far field near field)
Coherence
Brightness
High Energy
Micrometer porosity structure of shale samples
Microstructure of a copper wire, 0.2mm diameter
Work on high-speed analysis continues
5. We face a data crisis (and opportunity)
New instrumentation means that data rates
are growing much faster than Moore’s Law
Neither humans nor computers can cope by
using current methods
We need new methods for designing
experiments, managing data, analyzing data,
and creating and delivering software
“A knowledge-based society, connected by the
Internet and powered by AI …”
— Chen Chien-jen
9. Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
9
Petrel online store
petrel.alcf.anl.gov
94 Gbit/s Petrel—Blue Waters
2 petabytes
100 Gbps
10. Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
10
2 petabytes
100 Gbps
Globus APIs
12. Automate and outsource:
(1) Data distribution
Needs: Usable, efficient, reliable,
secure, sustainable
Outsource:
(a) Petrel data store to hold data
prior to/during distribution
(b) Globus service for data
transfer and sharing
Automate:
(c) DMagic script uses Globus
APIs to transfer data and
configure permissions
12
http://dmagic.readthedocs.io
Francesco de Carlo
Given an experiment date:
• Retrieve user info from APS scheduler
• Create Globus “shared endpoint” and
configure permissions
• Monitor directory at beamline and use
Globus to copy new files to endpoint
• Email link to shared endpoint for data
retrieval
13. Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1313
2 petabytes
100 Gbps
Globus APIs
14. Automate and outsource:
(2) Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1414
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
15. Automate and outsource:
(2) Publication and discovery
1515
Programmatic access (REST, Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs
16. For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository, …
Building a different custom pipeline for every situation is impractical
Automate and outsource:
(3) End-to-end data pipelines
17. For each data, must apply quality control, assign identifiers, move to
compute, extract features, eventually publish to public repository
Building a different custom pipeline for every situation is impractical
Automate: Trigger-action programming (“if this happens, then do that)
Outsource: Cloud-based trigger-action service for reliability,
scalability, ease of use, security, sustainability
Automate and outsource:
(3) End-to-end data pipelines
18. National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
19. National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Archive1
1
Rules
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
20. National
Facility
Local Storage and Compute
• Quality Control
• Assign Handle
Beamline
Instrument • Email / SMS
notification
Globus Transfer
Central Storage and Compute (CSC)
• Feature extraction
• Aggregate and convert format
Globus Transfer
Archive
• Set sharing ACLs
• Set timer for publication
to Materials Data Facility
Data publication
1
2
1
Rules
2
• IF new files THEN run quality
control scripts
• IF quality is good THEN send
email and transfer data to CSC
• IF new files THEN run feature
extraction
• IF feature detected THEN
transfer data to archival storage
• IF time since ingest > 6 months
THEN publish dataset to
Materials Data Facility
Automate and outsource:
(3) End-to-end pipelines with trigger-action programming
21. Data
Source
Collector Storage and
Compute
• Capture dataset creation
• Review center position
APS beamline 32-ID
ALCF Cooley Cluster
• Generate preview and
center images
• Reconstruct image
• Extract metadata
Ingest in Globus
Search
Set sharing ACLs
Data publication
1
2
1
Rules
2
• IF new HDF5 files THEN
transfer to Cooley
• IF new center_pos
THEN initiate
reconstruction
• IF transfer complete
THEN execute preview
and center finding
• IF results THEN return
data to APS
• IF reconstruction THEN
transfer data to Petrel
AND publish dataset
ALCF Petrel
Archive
Visualize with Neuroglancer
Another example: Mosaic tomography for neurocartography
(N. Kasthuri, R. Chard, et al.)
23. Automate and outsource:
(4) Data transformation and analysis
“beam misaligned”
“…”
Say you want to use a deep neural network for online identification
of problems when running diffraction experiments
25. Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
26. DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
27. DLHub
[“beam off image”, …]
model/xray/batch_predict
Automate and outsource:
(4) Data transformation and analysis
▪ Where are the model and trained weights?
▪ How do I run the model on my data?
▪ Should I run the model on my data?
▪ How can I retrain the model on new data?
https://doi.org/10.1109/NYSDS.2017.8085045
28. Data and Learning Hub (DLHub): Overview
• Collect, publish, categorize models/code/weights/data from many sources
• Serve models via API to foster sharing, consumption, and access to data,
training sets, and models
• Automate training of models (using HPC as needed) as new data are available
• Enable new science through reuse and synthesis of existing models
TrainCollect Serve
29. DLHub: Collect, serve, train community models
DLHub
Collect
Data
1) Register a model
Train
Model
Register
Model Model /
transform
containers
Receive DOI
Send to DLHub
34. Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana
Ananthakrishnan
Ryan Chard Mike Papka Rick Wagner
I reported on the work of many talented people
Thanks also to:
• Jon Almer, Francesco de Carlo, Hemant Sharma, Brian Toby, Stefan Vogt, Stephen Streiffer,
Nicholas Schwarz, Doga Gursoy, and others, Advanced Photon Source
• Tekin Bicer, Jonathan Gaff, Raj Kettimuthu, Justin Wozniak, and others, Argonne Computing
We are grateful to our sponsors
DLHub Globus
IMaD
Petrel
Argonne Leadership
Computing Facility
35. In summary
More data demands new methods for designing experiments,
managing data, analyzing data, and creating and delivering software
We must automate and outsource to manage data, run pipelines,
and train and run (machine learning) models
I presented examples that illustrate what can be done:
• High-speed storage services for data staging and distribution: Petrel
• Cloud-based services for data transfer and sharing: Globus Transfer
• Data publication and discovery services: Materials Data Facility
• Cloud-based automation services: Globus Automate
• Model and transformation services to encapsulate software: DLHub
There are many opportunities, and great need, for collaboration
To follow up: foster@anl.gov