Research computing facilities, such as the national supercomputing centers, and shared instruments, such as cryo electron microscopes and advanced light sources, are generating large volumes of data daily. These growing data volumes make it challenging for researchers to perform what should be mundane tasks: move data reliably, describe data for subsequent discovery, and make data accessible to geographically distributed collaborators. Most employ some set of ad hoc methods, which are not scalable, and it is clear that some level of automation is required for these tasks.
Globus is an established service from the University of Chicago that is widely used for managing research data in national laboratories, campus computing centers, and HPC facilities. While its intuitive web app addresses simple file transfer and sharing scenarios, automation at scale requires integrating Globus data management platform services into custom science gateways, data portals and other web applications in service of research. Such applications should enable automated ingest of data from diverse sources, launching of analysis runs on diverse computing resources, extraction and addition of metadata for creating search indexes, assignment of persistent identifiers faceted search for rapid data discovery, and point-and-click downloading of datasets by authorized users — all protected by an authentication and authorization substrate that allows the implementation of flexible data access policies for both metadata and data alike.
We describe current and emerging Globus services that facilitate these automated data flows while ensuring a streamlined user experience. We also demonstrate Petreldata.net, a data management portal and gateway to multiple computing resources, that supports large-scale research at the Advanced Photon Source.
2. Globus is …
a non-profit service
developed and operated by
3. Our mission is to…
increase the efficiency and
effectiveness of researchers
engaged in data-driven
science and scholarship
through sustainable software
5. Fast, reliable file transfer …from any to any system
User-initiated,
or automated
transfer request
1
Instrument,
Lab server
Compute
Facility
Globus transfers files
reliably, securely
2
Globally accessible
multi-tenant service
• Fire-and-forget transfers
• Optimized speed
• Assured reliability
• Unified view of storage
• Browser, REST API, CLI
v
Optional
notifications
3
6. Secure data sharing …from any storage
Collaborator logs into Globus
and accesses shared files;
no local account required;
download via Globus2
On-prem or public
cloud storage
Select files to share,
select user or group,
and set access
permissions
1Globally accessible
multi-tenant service
Globus controls
access to shared files
on existing storage
Laptop, server,
compute facility
• Fine-grained access
control “overlay” on
storage system
• Share with any
identity, email, group
• No need to stage
data just for sharing
v
13. Automated instrument data egress
Cryo EM
Lightsheet
Sequencer
ALS/APS
….
Local system
download
Remote analysis,
visualization
• Reliable, near-real time
data access
• Automatically set policy
based permissions
• Self-service access
control, management
• Federated login for
frictionless data access
Local
policy
store
--/cohort045
--/cohort096
--/cohort127
v
14. v
Repository data distribution
Bulk data
transfer
2
Search, request
data of interest
1
• Gateway/data portal/app
enables faceted search
• Enforces fine-grained
authorization
• HTTPS download for
“small” data
• Asynchronous transfer
for larger data sets
2
Browser based
download
Globally accessible
multi-tenant service
2
15. --/run123/output (r)
Output data staged
with access control
2
Data staging for compute
--/run123/input (rw)
Compute service
v
• User securely uploads data
for analysis
• Results available with fine-
grained permissions
• Automated setup/tear down
of folders, permissions
3 User accesses,
downloads results
1
Input data
upload
16. Automation Examples
• Syncing a directory
– bash script; calls the Globus CLI
– Python module; run as script or import as module
• Staging data for distribution
– bash and Python variants
• Removing directories after files are transferred
– Python script
16
github.com/globus/automation-examples
20. Globus Automate
A platform service for defining, applying, and sharing
distributed research automation flows
• Triggers start flows based on subscribed events
• Flows call Action Providers to perform tasks
21. Globus automation architecture
• Built on AWS Step Functions
– JSON-based state machine language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
22. Automation Action Providers
Delete ACLs
Search
DLHub
User Form Notification
Expression
Evaluation
Describe
Web FormIdentifier
Transfer
Ingest
Xtract
funcX
Globus action
providers
Custom action
providers
23. funcX action provider
funcX: FaaS platform for HPC
• funcX endpoints deployed at resources
• Service routes requests to endpoints
• Parsl acquires resources
• Singularity containers run functions
• Globus Auth secures communication
funcX
26. UChicago Kasthuri Lab: Brain aging and disease
• Construct connectomes—mapping of neuron connections
• Use APS synchrotron to rapidly image brains
– Beam time available once every few months
– ~20GB/minute for large (cm) unsectioned brains
• Generate segmented datasets/visualizations for the community
• Perform semi-standard reconstruction on all data across HPC
resources
28. Automating neurocartography
Web form
User input
Search
Ingest
Share
Set policy
Identifier
Mint DOI
funcX
Auth
Get
credentials
Automate
Run job
Describe
Get
metadata
Transfer
Transfer
data
funcX
Run job
Transfer
Transfer
data
31. Materials Data Facility
> 35 TB of data > 320 authors> 400 datasets
• Accept data from many
locations with flexible
interfaces
• Index dataset contents in
science-aware ways
• Dispatch data to the
community
• Using Automate to
simplify building
composable flows of
services
32. MDF Data Publication Automation
Ingest
Bulk
Ingest
Auth
Get
Credentials
Automate
Transfer
Transfer
Dataset
XTract
Extract
Metadata
Share
Set
permissions
Transfer
Move
metadata
Transfer
Transfer
Dataset
Transfers
Transfer
Dataset
Identifier
Mint DOI
Web form
Metadata
Notify
Notify
Curator
Web form
Curation
Notify
Notify
user
33. Petrel data services
• Data service providing simple, self-managed project-based
data and sharing capabilities (https://petreldata.net)
• Flexible user-managed search index and discovery portal
34. PaaS: develop custom action providers
• Directly use the platform to build and run
extensible flows
• Develop action providers
– Fit for purpose
– Developed and deployed by the project
– Plugged into their flows
• Action Provider Development toolkit
36
35. SaaS: instrument data management
• Templated solution
• Configurable…
– Set transfer triggers
– Select destination(s)
– Define metadata
• Extensible…
– Add/remove actions
– Change action providers
• No development required
Cryo EM
Lightsheet
Sequencer
….
Indexing for
search
Image reconstruction,
analysis, visualization
Automated egress
from device
--/cohort045
--/cohort096
--/cohort127
Transfer
funcXXtract
36. SaaS: Data Management Plans
• “Turnkey” DMP enablement
• Select dataset (collection)…
• …add metadata for indexing
• …generate persistent ID
(DOI, ARK, etc.)
38
Transfer
Identifier
Ingest
“Point & Click” to
findable and
accessible data