Research Automation for Data-Driven Discovery

Research Automation
for Data-Driven Discovery
Ian Foster
Argonne National Laboratory &
The University of Chicago
foster@anl.gov

A productivity crisis in research
Data volumes are growing
much faster than Moore’s law …
(10,000x more over 6 years for
genome data)
Kahn, Science, 331
(6018): 728-729
But most labs
have extremely
limited resources
Heidorn: NSF
grants in 2007
< $350,000
80% of awards
50% of grant $$

"Well, in our country," said Alice …
"you'd generally get to somewhere else
— if you run very fast for a long time,
as we've been doing.”
"A slow sort of country!" said the
Queen. "Now, here, you see, it
takes all the running you can do,
to keep in the same place. If you
want to get somewhere else, you
must run at least twice as fast as that!"
The challenge of staying competitive

4https://bit.ly/2l4gfgu
How industry handles complexity

cloud4scieng.org
Industry software builds on powerful platform services

Cloud platforms have transformed how software is
developed and delivered
6
Can we do the same for science?
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission: thus
reduce costs, increase quality, promote interoperability

What capabilities?
7
• Auth: Manage identities, authentication, and authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …

globus.org
Science services
operated by UChicago for researchers worldwide

Monitor transfer
Monitor activitiesManage data

Automate and outsource with
REST APIs and Python SDK

Automate and
outsource with
REST APIs and
Python SDK
11
UK
NIST
NSF
NSF
NSF
DOE
NSF
Canada

Automate and outsource:
Publication and discovery
Move to permanent location
(or publish in place)
Compute and record checksums
Obtain and record metadata
Assign persistent identifier
Index for discovery
1212
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs

Automate and outsource:
Publication and discovery
1313
Programmatic access (Python, Jupyter)
Web browse and search
Data Publication
Indexing
materialsdatafacility.org
2 petabytes
100 Gbps
Globus APIs

Example: NCAR’s Research Data Archive
Globus used for
• Single sign on via
streamlined account
provisioning
• Data sharing
• Data downloads

Beyond transfer
(Experimental)

Cloud platforms have transformed how software is
developed and delivered
17
We can do the same for science
• Identify cross-cutting capabilities required by many groups
• Define simple REST APIs for accessing those capabilities
• Operate high-quality, scalable, secure, performant cloud-hosted
implementations
• Ensure persistence and evolution over time
In so doing, enable many scientists and tool developers to automate
and outsource tasks that are not central to their core mission, to
reduce costs, increase quality, promote interoperability

We have identified some needed capabilities
18
• Auth: Manage identities, authentication, authorization
• Transfer: Manage movement of files from A to B
• Sharing: Manage who can access data at a location
• Publish: Preserve, identify, describe, curate
• Search: Index and search data
• Identifiers: Assign identifiers to collections of files
• Automate: Organize sets of activities
• Learn: Discover, train, run machine learning models
• …
Established
12,000 endpoints
100,000+ users
New
100s of users
Experimental
10s of users
globus.org — Ian Foster — foster@anl.gov

Research Automation for Data-Driven Discovery

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Research Automation for Data-Driven Discovery

Ähnlich wie Research Automation for Data-Driven Discovery (20)

Mehr von Ian Foster

Mehr von Ian Foster (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Research Automation for Data-Driven Discovery

Hinweis der Redaktion