We present an overview of Globus services for automating research computing and data management tasks, to accelerate research process throughput. This content is aimed at researchers who wish to automate repetitive data management tasks (such as backup and data distribution to collaborators), as well as those working with instruments (cryoEM, next-gen sequencers, fMRI, etc.) who wish to streamline data egress, downstream analysis, and sharing at scale.
This material was presented at the Research Computing and Data Management Workshop, hosted by Rensselaer Polytechnic Institute on February 27-28, 2024.
2. What do we mean by
research “automation”?
Executing research tasks* reliably,
at scale, with minimal (or no)
human intervention when required.
*data management and computation
2
3. Stepping into automation using Globus
• Level 1: Use the web app; it’s manual, but it may be
more automated than your current process J
• Level 2: Semi-automated, recurring tasks
• Level 3: Automation using the Globus CLI
• Level 4: Automation using Globus Flows
• Level 5: “Lights-out” automation using Globus Flows
with event triggers
4. A simple, and very common, use case
Transfer data
to a system for
sharing
Transfer
Set access
controls for
sharing data
Share
We’ll use this to
demonstrate
5 levels of automation
9. Globus Command Line Interface
Automation of
simple data
management tasks
Integration with
existing scripts
(job submission …)
Open source, uses
the Python SDK
$ globus
Usage: globus [OPTIONS] COMMAND [ARGS]...
Interact with Globus from the command line
All `globus` subcommands support `--help` documentation.
Use `globus login` to get started!
The documentation is also online at https://docs.globus.org/cli/
Options:
-v, --verbose Control level of output
-h, --help Show this message and exit.
-F, --format [unix|json|text] Output format for stdout. Defaults to
text
--jmespath, --jq TEXT A JMESPath expression to apply to json
output. Forces the format to be json
processed by this expression
--map-http-status TEXT Map HTTP statuses to any of these exit
codes:
0,1,50-99. e.g. "404=50,403=51"
Commands:
api Make API calls to Globus services
bookmark Manage endpoint bookmarks
cli-profile-list List all CLI profiles which have been used
collection Manage your Collections
delete Submit a delete task (asynchronous)
endpoint Manage Globus endpoint definitions
flows Interact with the Globus Flows service
10. Transfer and share CLI commands
11
$ globus transfer
> --recursive
> source_collection_uuid:source_path
> guest_collection_uuid:destination_path
Message: The transfer has been accepted and a task has been created and
queued for execution
Task ID: f5eb855c-4098-11ee-8ba2-2197ca2bfedc
$ globus endpoint permission create
> --group $group_uuid
> --permissions $permissions
> guest_collection_uuid:destination_path
Granting group, ............., read access to the destination directory
Message: Access rule created successfully.
Rule ID: 7fe723a4-413b-11ee-88f9-03dc0e0dcc45
11. Exercise: Run script using the Globus CLI
• Log into your instance
• Go to the ~/globus-tutorials directory
• Run the transfer_share.sh script
$ ./transfer_share.sh
> --source-collection a6f165fa-aee2-4fe5-95f3-97429c28bf82
> --source-path /cli
> --guest-collection fe2feb64-4ac0-4a40-ba90-94b99d06dd2c
> --sharing-path /rpi/YOUR_NAME
> --group-id 50b6a29c-63ac-11e4-8062-22000ab68755
12. Level 4
Using a Globus Flow
13
Transfer Share
Example: Moving data from
instrument to campus
cluster for analysis
13. Level 4: Automation with Globus Flows
• Flows Service: A platform for managed, secure,
reliable task orchestration
• Flows comprise Actions à invoke Globus services;
extensible to support your own services
• Run via web app, CLI, API, event-based triggers*
14. Common tasks in most instrument scenarios
Transfer raw
images to HPC
cluster
Transfer
Set access
controls to allow
analysis
Share
Flows
Service
2
Actions
1
:set_permission
Action Provider
:transfer
Action Provider
17. Flow lifecycle
18
• Define using JSON/YAML
• Deploy to Flows service
• Set access policy for
visibility and execution
18. Flow lifecycle
19
• Define using JSON/YAML
• Deploy to Flows service
• Set access policy for
visibility and execution
• Run (debug) and monitor
19. Flow lifecycle: Write once, run many
20
• Define using JSON/YAML
• Deploy to Flows service
• Set access policy for
visibility and execution
• Run (debug) and monitor
• …and run again!
24. Exercise: Run Globus Flow using the web app
• Find “Tutorial - Transfer and Share” in flows library
• Click “Start”
• Confirm the source and destination collections
• Change the name of target path: /rpi/YOUR_NAME
• Enter a label for the flow run
• Click “Start Run”
• Monitor flow progress on the “Event log” tab
26. EC2
Instance
“Instrument”
Simulating an instrument flow
Monitor
script
transfer control
Access data
and run
analysis
0 trigger
flow run
set
permissions
2
Globus
Connect
Server
Globus
Connect Server
ALCF
Eagle
transfer
files
1
28. A more interesting scenario: cryoEM
Globus
Flows
Carbon!
Correct,
classify, …
Transfer
Transfer
raw files
Compute
Launch
analysis job
Share
Set access
controls
Transfer
Move final
files to repo
29. Globus
Flows
End-to-end Automation: Serial Crystallography
Image
processing
Data capture
Carbon!
Check
threshold
Data publication
Transfer
Transfer
raw files
Transfer
Move results
to repo
Compute
Analyze
images
Compute
Visualize
Compute
Gather
metadata
Share
Set access
controls
Compute
Launch QA
job
Search
Ingest to
index
31. Extending the ecosystem: Action providers
33
• Action Provider is a
service endpoint
– Run
– Status
– Cancel
– Release
– Resume
• Action Provider Toolkit
action-provider-tools.readthedocs.io
compute
ACLs
delete
identifier
transfer
notify ingest
mkdir
search
ls
Xtract describe
web form
Custom developed
docs.globus.org/api/flows/hosted-action-providers
32. Support resources
• Flows service in web app: app.globus.org/flows
• Flows documentation: docs.globus.org/api/flows
• Helpdesk: support@globus.org
• Customer engagement team can advise on flows
• Professional services team can help build flows