This document introduces the Globus platform for automating research data flows. It describes Globus capabilities for scheduled transfers, command line scripting, and comprehensive task orchestration. It also covers Globus Auth for identity and access management, securing apps with Globus Auth, and the Globus Timer Service. The Globus command line interface and Python SDK allow for programmatic access and automation of data transfers and other tasks.
Automating Research Data Flows and Introduction to the Globus Platform
1. Automating Research Data Flows and
Introduction to the Globus Platform
Greg Nawrocki
greg@globus.org
nawrocki@uchicago.edu
nawrocki@anl.gov
September 21, 2022
2. Globus Automation Capabilities
Timer Service
Scheduled and recurring transfers
(a.k.a. Globus cron)
Command Line Interface
Ad hoc scripting and integration
Globus Flows service
Comprehensive task (data and
compute) orchestration with human in
the loop interactions
3. Globus Auth: Foundational IAM service
• Brokers authentication and authorization among…
– End-users
– Identity providers: enterprise, external (federated identities)
– Services: resource servers with REST APIs
– Apps: web, mobile, desktop, command line clients
– Services acting as clients to other services
• Support high assurance service for use with
protected data (e.g. HIPAA protected data)
5
4. Securing Apps with Globus Auth
• Native App (with refresh tokens – extend expiration)
– Authentication as user identity
– Authentication URL / come back with a auth code – exchanged for tokens
– Clients can’t keep a secret - tokens in plain text
– Jupyter Notebook examples / CLI Based Timer Service
• Auth Code Grant – Templated App
– Authentication as user identity
– Browser redirect to Globus Auth, auth code returned (no manual copy)
– Tokens stored securely
– CLI / Jupyter Hub secured with Globus Auth
• Confidential Client:
– Authentication as application
– ClientID and Secret stored securely
– Custom apps
– developers.globus.org - Client ID / Secret / Client Identity Username
6
6. Use case: Data replication
• For backup: initiated by user or system back up
• Automated transfer of data from science instrument
12
Recurring transfers
with sync option
Copy /ingest
Daily @ 3:30am
7. The Globus Timer Service
• Scheduled/recurring file transfers
• Supports all Globus transfer and sync options
• Service accessible via web app and CLI
• Example: NIH – hpc.nih.gov/storage/globus_cron.html
13
11. Globus Command Line Interface
Open source, uses
the Python SDK
Because of this
correspondence the CLI is
an excellent tool for getting
the gist of how he SDK
functions.
Very scriptable!
12. Globus CLI
• It’s a stand alone application distributed by Globus
– https://docs.globus.org/cli/
– https://github.com/globus/globus-cli
• Easy install and updates
• Command “globus login” gets access tokens
• All interactions with the service use the tokens
– The CLI is acting as you (your identity)
• Command “globus logout” deletes those
13. CLI Basics – “globus” is the executable
$ globus endpoint search 'Globus Tutorial'
$ globus task list
$ globus get-identities greg@globus.org --verbose
• Getting help / list of commands
– globus list-commands
– globus –help
– https://docs.globus.org/cli/examples/
• UUIDs for endpoint, task, user identity, groups…
• Can query to discover the UUIDs
– Use search / list / get options
14. Use case: Sharing out data
Researcher initiates
transfer request
Instrument
Globus
controls
access to
shared files on
existing
storage; no
need to move
files to cloud
storage!
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
Personal Computer
Transfer
Share
Compute Facility
Globus transfers files
reliably, securely
15. Step 1: Transfer
• Using a Batch Transfer
– Transfer tasks have one source/destination, but can have any
number of files
– Provide input source-dest pairs via local file
– File may have embedded comments
$ export ep1=ddb59aef-6d04-11e5-ba46-22000b92c6ec
$ export ep2=af7bda53-6d04-11e5-ba46-22000b92c6ec
$ globus transfer $ep1:/share/godata/ $ep2:/~/automation-example-
data-share/outbound/Vas --label "Files to share" --batch
/Users/gregnawrocki/Greg/Globus/Demos/ornl20220623/files.txt
# this is the contents of files.txt:
# a list of source paths followed by destination paths
file1.txt file1.txt
# file2.txt file2.txt # inline-comments are also allowed
file3.txt file3.txt
16. Step 2: Share - Set permissions
• Set and manage permissions on guest collection
• Requires access manager role
$ globus endpoint search 'automation-example-data-share’
$ export share=bf4445c4-ecfd-11ec-aed7-6f7c2b57b05c
$ globus endpoint permission create --permissions r --identity
vas@uchicago.edu $share:/Vas/
17. Useful submission commands
• Safe resubmissions
– Applies to asynchronous tasks (transfer and delete)
– Get a task UUID, use that in submission
– $ globus task generate-submission-id
– --submission-id option in transfer
– Useful for lazy branching or when dealing with unreliable
networks
• Task wait
– useful for scripting conditionals on transfer task status
18. Parsing CLI output
$ globus endpoint search --filter-scope my-endpoints
$ globus endpoint search --filter-scope my-endpoints --format json
$ globus endpoint search --filter-scope my-endpoints --jmespath
'DATA[].[id, display_name]'
• Default output is text; for JSON output use --format
json
• Extract specific attributes using --jmespath
<expression>
19. Managing notifications
• Turn off emails sent for tasks
• Useful when an application manages tasks for a user
• Disable notifications with the --notify option
--notify off (all notifications)
--notify succeeded|failed|inactive (select notifications)
21. Managed automation of tasks
• Flows: A platform service for defining, applying, and
sharing distributed research automation flows
• Flows comprise Actions
• Action Providers: Called by Flows to perform tasks
• Triggers*: Start flows based on events
* In development
22. Automation services ecosystem
GET /provider_url/
POST /provider_url/run
GET /provider_url/action_id/status
GET /provider_url/action_id/cancel
GET /provider_url/action_id/status
Create Action
Providers
Define and
deploy flows
{ “StartAt”: ”ToProject”,
”States” : {
”ToProject” : { … },
”SetPermission” : { …},
“ProcessData” : { … } … }}
Run flows
23. Automation with Globus Flows
• Built on AWS Step Functions
– Simple JSON-based state machine
language
– Conditions, loops, fault tolerance, etc.
– Propagates state through the flow
• Standardized API for integrating
custom event and action services
– Actions: synchronous or asynchronous
– Custom Web forms prompt for user input
• Actions secured with Globus Auth
28. Extending the ecosystem: Action providers
35
• Action Provider is a
service endpoint
– Run
– Status
– Cancel
– Release
– Resume
• Action Provider Toolkit
action-provider-
tools.readthedocs.io/en/latest
Search
Transfer
Notification
ACLs Identifier
Delete
Ingest
User
Form
Describe Xtract
funcX Web
Form
Custom built
Globus Provided
30. 37
Custom portals? Science Gateways? Unique workflows? Our open
REST APIs and Python SDK empower you to create an integrated
ecosystem of research data services and applications.
32. App Access to Collections
• Globus Transfer – Authentication with access tokens
– Individual: Globus login (consents) to get tokens
– Application: Apps are people too!
o developers.globus.org - Client ID / Secret / Client Identity Username
• Collection access
– GCSv4 Mapped Collections – user consent
o Activating endpoint means binding a credential to an endpoint for login
– GCSv5 Mapped Collections (no user certificates, OAUTH tokens and consents)
o https://docs.globus.org/globus-connect-server/v5.4/use-client-credentials/
o Request the data_access scope (per collection) to be able to access the collection.
o The storage gateway must permit identities from the 'clients.auth.globus.org' identity domain
o Identity Mapping Policy that maps the ‘UUID@clients.auth.globus.org' identity to a valid local user
– Guest Collections
o Guest Collections auto-activate - need to do this before API calls to endpoints
o Use Guest Collections whenever possible
– Remember to set your ACLs (WebApp)
Automation
34. Globus APIs
• Auth
• Groups
• Transfer
• Search
• Timer
• Flows
• GCS Manager
• Globus Web App consumes public
Transfer API
• Resource named by URL (standard
REST approach)
• Globus APIs use JSON for documents
docs.globus.org/api/transfer
35. Globus Python SDK
• Python client library for the Globus REST APIs
• Largely direct mapping to REST API
• globus_sdk.TransferClient class handles
connection management, security, framing,
marshaling
globus-sdk-python.readthedocs.io/en/stable/
globus.github.io/globus-sdk-python
42
36. Synchronous Tasks
• Endpoint search (with scopes)
• List directory contents (ls)
• Make directory (mkdir)
• Rename
• Note:
– Path encoding & UTF gotchas
– Don’t forget to auto-activate first
46
37. Asynchronous Tasks
• Transfer
– Sync level option
• Delete
• Get submission_id, followed by submit
– Once and only once submission
• Use task id to “follow up”
47
38. The Globus API / SDK with a Jupyter Notebook in a
Jupyter Hub – Auth Code Grant
login
REST APIs
{ “tokens”:…
{“tokens”:…
REST APIs
REST APIs
Bearer a45cd…
39. Walkthrough API with our Jupyter Hub
• https://jupyter.demo.globus.org
– Sign in with Globus
– Verify the consents
– Start My Server (this will take about a minute)
– Open folder: globus-jupyter-notebooks
– Run Platform_Introduction_JupyterHub_Auth.ipynb
• If you mess it up and want to “go back to the beginning”
– Just stop and restart the server
• If you want to use the notebook outside of our hub
– https://github.com/globus/globus-jupyter-notebooks
– Authentication is a manual cut and paste of exchanging the
authorization code for an access token – Native App
49
41. Developer References
• Globus API / SDK Documentation
– Transfer API : docs.globus.org/api/transfer/
– SDK: globus-sdk-python.readthedocs.io/en/stable/
• Globus GitHub: github.com/globus/
– Jupyter Notebooks
o Stand alone notebooks and hub integrations that walk through much of the
functionality of our SDK
o https://github.com/globus/globus-jupyter-notebooks
– Automation Examples
o Shell scripted CLI and Python module examples of common research data
management use cases
o https://github.com/globus/automation-examples