SlideShare ist ein Scribd-Unternehmen logo
1 von 52
Ian Foster
Accelerating
data-driven discovery
in energy science
Distinguished Fellow
Life Sciences
and Biology
Advanced
MaterialsCondensed
Matter Physics
Chemistry and
Catalysis
Soft Materials
Environmental
and Geo
Sciences
Can we determine
pathways that lead
to novel states and
nonequilibrium
assemblies?
Can we observe –
and control –
nanoscale chemical
transformations in
macroscopic
systems?
Can we create new materials with
extraordinary properties – by engineering
defects at the atomic scale?
Can we map – and
ultimately harness –
dynamic heterogeneity
in complex correlated
systems?
Can we unravel the
secrets of biological
function – across
length scales?
Can we understand
physical and chemical
processes in the most
extreme environments?
2
New tools are needed to answer
the most pressing scientific Qs
The resulting data deluge
Spans biology, climate, cosmology, materials,
physics, urban sciences, …
Simulation data
Petascale  exascale simulations;
simulation datasets as laboratories;
high-throughput characterization; etc.
Experimental data
Light sources, genome sequencing,
next-gen ARM radar, sky surveys,
high-throughput experiments, etc.
New research methods that depend on coupling
1) Of computation and experiment 2) Across data sources and types
- inverse problems, computer control - knowledge integration, analysis
Scientific progress requires
collaborative discovery engines
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Rick Stevens
Example: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
Accelerating
data-driven discovery
in energy science
(1) Eliminate data friction
Eliminating data friction is
essential to modern science
Civilization advances
by extending the number
of important operations
which we can perform
without thinking about
them (Whitehead, 1912)
Obstacles to data access,
movement, discovery,
sharing, and analysis slow
research, distort research
directions, and waste time
(DOE reports, 2005-2015)
Software as a service (SaaS)
as lubricant
Customer relationship
management (CRM):
A knowledge-intensive process
Historically, handled manually
or via expensive, inflexible on-
premise software
SaaS has revolutionized
how CRM is consumed
 Outsource to provider who
runs software on cloud
 Access via simple interfaces
Ease of use Cost
Flexibility
SaaS
On-premise
Globus: Research data
management as a service
Essential research data
management services
 File transfer
 Data sharing
 Data publication
 Identity and groups
Builds on 15 years of DOE
research
Outsourced and automated
 High availability, reliability,
performance, scalability
 Convenient for
 Casual users: Web interfaces
 Power users: APIs
 Administrators: Install, manage
globus.org
“I need to easily, quickly, & reliably
move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
DOE supercomputer
Public Cloud
10
“I need to get data from a scientific
instrument to my analysis system.”
Next Gen
Sequencer
Light Sheet Microscope
MRI Advanced
Light Source
11
“I need to easily and securely
share my data with my colleagues.”
12
Globus and the research data lifecycle
Researcher initiates
transfer request; or
requested automatically
by script, science
gateway
1
Instrument
Compute Facility
Globus transfers files
reliably, securely
2
Globus controls
access to shared
files on existing
storage; no need
to move files to
cloud storage!
4
Curator reviews and
approves; data set
published on campus
or other system
7
Researcher
selects files to
share, selects
user or group,
and sets access
permissions
3
Collaborator logs in to
Globus and accesses
shared files; no local
account required;
download via Globus
5
Researcher
assembles data set;
describes it using
metadata (Dublin
core and domain-
specific)
6
6
Peers, collaborators
search and discover
datasets; transfer and
share using Globus
8
Publication
Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaS  Only a web
browser required
• Use storage system
of your choice
• Access using your
campus credentials
13
Globus at a glance
4
major services
13
national labs
use Globus
services
100 PB
petabytes transferred
8,000
active endpoints
20 billion
files processed
>300
users are active
daily
25,000
registered users
99.95%
uptime over the
past two years
>30
subscribers
The biggest
transfer to date is
1 petabyte
The longest-
running transfer to
date took
3 months
We’re eager to
learn what
you
want to do with
Globus services
15
One APS node
connects to
125 locations
thru mid 2014
Same node
(1 Gbps link)
Globus and DOE:
Terabytes per month
Globus and DOE:
Running total terabytes
Globus and DOE:
Active users per month
Response has been gratifying
"Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory
"Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates,
Professional Research Assistant, NOAA
“…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was
trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing
"... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" -
Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications
"We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service
was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory
"The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user
"I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley
National Lab
"We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator,
Oak Ridge National Laboratory
"Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user
"The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC
support team, NCSA
Globus service APIs serve
as a science platform
Identity, Group, and
Profile Management
…
Globus Toolkit
GlobusAPIs
GlobusConnect
Data Publication & Discovery
File Sharing
File Transfer & Replication
21
Globus platform
services enable new
application capabilities
Publication as service for ACME
Globus platform
accelerates development
of new services
Operating a sustainable service
Globus is a not-for-profit
service for researchers
We adopt a subscription-
supported freemium model
Subscribers get extra
features, rapid support
We’re engaged in crossing
the chasm
Support from DOE will
contribute to long-term
success
Accelerating
data-driven discovery
in energy science
(2) Liberate scientific data
Q: What is the biggest obstacle
to data sharing in science?
A: The vast majority of data
that is lost, or not online;
if online, not described;
if described, not indexed
Not accessible
Not discoverable
Not used
Contrast with common practice
for consumer photos (iPhoto)
 Automated capture
 Publish then curate
 Processing to add value
 Outsourced storage
We must automate the capture,
linking, and indexing of all data
Globus publication service
encodes and automates data
publication pipelines
Example application: Materials
Data Facility for materials
simulation and experiment data
Proposed distributed virtual
collections index, organize,
tag, & manage distributed data
Think iPhoto on steroids –
backed by domain knowledge
and supercomputing power
We must automate the capture,
linking, and indexing of all data
chiDB: Human-computer
collaboration to extract Flory-
Huggins (𝞆) parameters from
polymers literature
R. Tchoua et al.
Plenario: Spatially and
temporally integrated, linked,
and searchable database of
urban data
C. Catlett, B. Goldstein, T. Malik et al.
“I need to publish my data so that
others can find it and use it.”
Scholarly
Publication
Reference
Dataset
Research
Community
Collaboration
30
Publish dashboard
31
Start a new submission
32
33
Describe submission:
1) Dublin Core
34
Describe submission:
2) Science metadata
Assemble the dataset
35
36
Transfer files to
submission endpoint
37
Check dataset is
assembled correctly
Submission now in curation
workflow
38
Search published datasets
39
Search across collections
Discover a published dataset
41
Select a published dataset
42
View downloaded dataset
43
Configuring a publication
pipeline: Publication “facets”
URL Handle DOI
identifier
none standard custom
description
domain-specific
none acceptance machine-validated
curation
human-validated
anonymous Public collaborators
access
embargoed
transient project lifetime “forever”
preservation
archive
44
Accelerating
data-driven discovery
in energy science
(3) Create discovery engines
at DOE facilities
Recall: A discovery engine for
disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimental
scattering
Material
composition
Simulated
structure
Simulated
scattering
La
60%
Sr
40%
Detect errors
(secs—mins)
Knowledge base
Past experiments;
simulations; literature;
expert knowledge
Select experiments
(mins—hours)
Contribute to knowledge base
Simulations driven by
experiments (mins—days)
Knowledge-driven
decision making
Evolutionary optimization
Simulation
Characterize,
Predict
Assimilate
Steer data
acquisition
Data analysis
Reconstruct,
detect features,
auto-correlate,
particle
distributions, …
Science automation services
Scripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow
~2 GB/s total burst
~200 TB/month
~10 concurrent flows
(Today: x10 in 5 yrs)
Integration
Optimize, fit, …
Configure
Check
Guide
Batch
Immediate
0.001 1 100+
PFlops
Precompute
material
database
Reconstruct
image
Auto-
correlation
Feature
detection
Scientific opportunities
 Probe material structure and
function at unprecedented scales
Technical challenges
 Many experimental modalities
 Data rates and computation
needs vary widely; increasing
 Knowledge management,
integration, synthesis
Towards discovery engines for
energy science (Argonne LDRD)
Linking experiment and
computation
Single-crystal diffuse scattering
Defect structure in disordered materials.
(Osborn, Wilde, Wozniak, et al.)
Estimate structure via inverse modeling:
many-simulation evolutionary optimization on
100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy
Microstructure in bulk materials (Almer, Sharma, et al.)
Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,
vs. >5 hours on APS cluster or months if data taken home. Used to
detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomography
Bio, geo, and material science imaging.
(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).
Innovative in-slice parallelization method gives
reconstruction of 360x2048x1024 dataset in ~1
minute, using 32K BG/Q cores, vs. many days
on cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper
wire, 0.2mm diameter
Advanced
Photon Source
Experimental and simulated
scattering from manganite
1: Run script (EL1.layer)
2. Lookup file
name=EL1.layer
user=Anton
type=reconstruction
Storage
locations
3: Transfer inputs
Compute
facilities
4: Run app
6: Update catalogs
5: Transfer results
External
collaborators
Collaboration catalogs
Provenance
Files & Metadata
Script
libraries
0: Develop or
reuse script
49
Researchers
Tying it all together: An energy
sciences infrastructure
informatics
analysis
high-throughput
experiments
problem
specification
modeling and
simulation
analysis &
visualization
experimental
design
analysis &
visualization
Integrated
databases
Summary: Big opportunities
and challenges for energy data
Immediate opportunities
 Reduce data friction and
accelerate discovery by
deploying Globus services
across all DOE facilities
 Develop new services to
capture, link energy data
Important research agenda
 Discovery engines to answer
major scientific questions
 New research modalities
linking computation and data
 Organization and analysis of
massive science data
Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
51
For more information: foster@anl.gov
Thanks to co-authors and Globus team
Globus services (globus.org)
 Foster, I. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.
 Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,
Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55,
2014.
 Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for
Collaborative Science Applications. Concurrency - Practice and Experience,
27(2):290-305, 2014.
Publication (globus.org/data-publication)
 Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,
Globus Data Publication as a Service: Lowering Barriers to Reproducible
Science. 11th IEEE International Conference on eScience Munich, Germany, 2015
Discovery engines
 Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,
M. and Wozniak, J. Networking materials data: Accelerating discovery at an
experimental facility. Big Data and High Performance Computing, 2015.

Weitere ähnliche Inhalte

Was ist angesagt?

Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Peter van Heusden
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
Tanu Malik
 

Was ist angesagt? (20)

SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDE...
 
Cloud com foster december 2010
Cloud com foster december 2010Cloud com foster december 2010
Cloud com foster december 2010
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
WoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific DataWoSC19: Serverless Workflows for Indexing Large Scientific Data
WoSC19: Serverless Workflows for Indexing Large Scientific Data
 
Toward a National Research Platform
Toward a National Research PlatformToward a National Research Platform
Toward a National Research Platform
 
Assessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformaticsAssessing Galaxy's ability to express scientific workflows in bioinformatics
Assessing Galaxy's ability to express scientific workflows in bioinformatics
 
Cloud Accelerated Genomics
Cloud Accelerated GenomicsCloud Accelerated Genomics
Cloud Accelerated Genomics
 
Sgg crest-presentation-final
Sgg crest-presentation-finalSgg crest-presentation-final
Sgg crest-presentation-final
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Benchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging ServicesBenchmarking Cloud-based Tagging Services
Benchmarking Cloud-based Tagging Services
 
GlobusWorld 2019 Opening Keynote
GlobusWorld 2019 Opening KeynoteGlobusWorld 2019 Opening Keynote
GlobusWorld 2019 Opening Keynote
 
GlobusWorld 2015
GlobusWorld 2015GlobusWorld 2015
GlobusWorld 2015
 
CHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning PlatformCHASE-CI: A Distributed Big Data Machine Learning Platform
CHASE-CI: A Distributed Big Data Machine Learning Platform
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
Apache Spark NLP for Healthcare: Lessons Learned Building Real-World Healthca...
 
Managing data in computational edge clouds
Managing data in computational edge cloudsManaging data in computational edge clouds
Managing data in computational edge clouds
 
PTU: Using Provenance for Repeatability
PTU: Using Provenance for RepeatabilityPTU: Using Provenance for Repeatability
PTU: Using Provenance for Repeatability
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects Sciunits: Reusable Research Objects
Sciunits: Reusable Research Objects
 
GeoDataspace: Simplifying Data Management Tasks with Globus
GeoDataspace: Simplifying Data Management Tasks with GlobusGeoDataspace: Simplifying Data Management Tasks with Globus
GeoDataspace: Simplifying Data Management Tasks with Globus
 

Ähnlich wie Accelerating Data-driven Discovery in Energy Science

Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
HostedbyConfluent
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
Ian Foster
 
Belak_ICME_June02015
Belak_ICME_June02015Belak_ICME_June02015
Belak_ICME_June02015
Jim Belak
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
Ian Foster
 

Ähnlich wie Accelerating Data-driven Discovery in Energy Science (20)

The Pacific Research Platform
The Pacific Research PlatformThe Pacific Research Platform
The Pacific Research Platform
 
Shifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data ProviderShifting the Burden from the User to the Data Provider
Shifting the Burden from the User to the Data Provider
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
The Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, EvolutionThe Earth System Grid Federation: Origins, Current State, Evolution
The Earth System Grid Federation: Origins, Current State, Evolution
 
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus PosterNIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
NIH NCI Childhood Cancer Data Initiative (CCDI) Symposium Globus Poster
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
IEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdfIEEE_BigData2014-Lee.pdf
IEEE_BigData2014-Lee.pdf
 
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
Monitoring Exascale Supercomputers With Tim Osborne | Current 2022
 
re:Invent 2013-foster-madduri
re:Invent 2013-foster-maddurire:Invent 2013-foster-madduri
re:Invent 2013-foster-madduri
 
Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013Big Process for Big Data @ PNNL, May 2013
Big Process for Big Data @ PNNL, May 2013
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
big_data_casestudies_2.ppt
big_data_casestudies_2.pptbig_data_casestudies_2.ppt
big_data_casestudies_2.ppt
 
Peering The Pacific Research Platform With The Great Plains Network
Peering The Pacific Research Platform With The Great Plains NetworkPeering The Pacific Research Platform With The Great Plains Network
Peering The Pacific Research Platform With The Great Plains Network
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
Belak_ICME_June02015
Belak_ICME_June02015Belak_ICME_June02015
Belak_ICME_June02015
 
Big Data
Big Data Big Data
Big Data
 
Data Automation at Light Sources
Data Automation at Light SourcesData Automation at Light Sources
Data Automation at Light Sources
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Next-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information RetrievalNext-Generation Search Engines for Information Retrieval
Next-Generation Search Engines for Information Retrieval
 
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway SystemThe Pacific Research Platform: A Science-Driven Big-Data Freeway System
The Pacific Research Platform: A Science-Driven Big-Data Freeway System
 

Mehr von Ian Foster

Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
Ian Foster
 

Mehr von Ian Foster (20)

Global Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptxGlobal Services for Global Science March 2023.pptx
Global Services for Global Science March 2023.pptx
 
Better Information Faster: Programming the Continuum
Better Information Faster: Programming the ContinuumBetter Information Faster: Programming the Continuum
Better Information Faster: Programming the Continuum
 
ESnet6 and Smart Instruments
ESnet6 and Smart InstrumentsESnet6 and Smart Instruments
ESnet6 and Smart Instruments
 
Linking Scientific Instruments and Computation
Linking Scientific Instruments and ComputationLinking Scientific Instruments and Computation
Linking Scientific Instruments and Computation
 
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific DiscoveryA Global Research Data Platform: How Globus Services Enable Scientific Discovery
A Global Research Data Platform: How Globus Services Enable Scientific Discovery
 
Foster CRA March 2022.pptx
Foster CRA March 2022.pptxFoster CRA March 2022.pptx
Foster CRA March 2022.pptx
 
Big Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental ScienceBig Data, Big Computing, AI, and Environmental Science
Big Data, Big Computing, AI, and Environmental Science
 
AI at Scale for Materials and Chemistry
AI at Scale for Materials and ChemistryAI at Scale for Materials and Chemistry
AI at Scale for Materials and Chemistry
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Research Automation for Data-Driven Discovery
Research Automation for Data-Driven DiscoveryResearch Automation for Data-Driven Discovery
Research Automation for Data-Driven Discovery
 
Scaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and JupyterScaling collaborative data science with Globus and Jupyter
Scaling collaborative data science with Globus and Jupyter
 
Team Argon Summary
Team Argon SummaryTeam Argon Summary
Team Argon Summary
 
Thoughts on interoperability
Thoughts on interoperabilityThoughts on interoperability
Thoughts on interoperability
 
NIH Data Commons Architecture Ideas
NIH Data Commons Architecture IdeasNIH Data Commons Architecture Ideas
NIH Data Commons Architecture Ideas
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...Computing Just What You Need: Online Data Analysis and Reduction  at Extreme ...
Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
 
Software Infrastructure for a National Research Platform
Software Infrastructure for a National Research PlatformSoftware Infrastructure for a National Research Platform
Software Infrastructure for a National Research Platform
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Accelerating Data-driven Discovery in Energy Science

  • 1. Ian Foster Accelerating data-driven discovery in energy science Distinguished Fellow
  • 2. Life Sciences and Biology Advanced MaterialsCondensed Matter Physics Chemistry and Catalysis Soft Materials Environmental and Geo Sciences Can we determine pathways that lead to novel states and nonequilibrium assemblies? Can we observe – and control – nanoscale chemical transformations in macroscopic systems? Can we create new materials with extraordinary properties – by engineering defects at the atomic scale? Can we map – and ultimately harness – dynamic heterogeneity in complex correlated systems? Can we unravel the secrets of biological function – across length scales? Can we understand physical and chemical processes in the most extreme environments? 2 New tools are needed to answer the most pressing scientific Qs
  • 3. The resulting data deluge Spans biology, climate, cosmology, materials, physics, urban sciences, … Simulation data Petascale  exascale simulations; simulation datasets as laboratories; high-throughput characterization; etc. Experimental data Light sources, genome sequencing, next-gen ARM radar, sky surveys, high-throughput experiments, etc. New research methods that depend on coupling 1) Of computation and experiment 2) Across data sources and types - inverse problems, computer control - knowledge integration, analysis
  • 4. Scientific progress requires collaborative discovery engines informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Rick Stevens
  • 5. Example: A discovery engine for disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  • 6. Accelerating data-driven discovery in energy science (1) Eliminate data friction
  • 7. Eliminating data friction is essential to modern science Civilization advances by extending the number of important operations which we can perform without thinking about them (Whitehead, 1912) Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)
  • 8. Software as a service (SaaS) as lubricant Customer relationship management (CRM): A knowledge-intensive process Historically, handled manually or via expensive, inflexible on- premise software SaaS has revolutionized how CRM is consumed  Outsource to provider who runs software on cloud  Access via simple interfaces Ease of use Cost Flexibility SaaS On-premise
  • 9. Globus: Research data management as a service Essential research data management services  File transfer  Data sharing  Data publication  Identity and groups Builds on 15 years of DOE research Outsourced and automated  High availability, reliability, performance, scalability  Convenient for  Casual users: Web interfaces  Power users: APIs  Administrators: Install, manage globus.org
  • 10. “I need to easily, quickly, & reliably move data to other locations.” Research Computing HPC Cluster Lab Server Campus Home Filesystem Desktop Workstation Personal Laptop DOE supercomputer Public Cloud 10
  • 11. “I need to get data from a scientific instrument to my analysis system.” Next Gen Sequencer Light Sheet Microscope MRI Advanced Light Source 11
  • 12. “I need to easily and securely share my data with my colleagues.” 12
  • 13. Globus and the research data lifecycle Researcher initiates transfer request; or requested automatically by script, science gateway 1 Instrument Compute Facility Globus transfers files reliably, securely 2 Globus controls access to shared files on existing storage; no need to move files to cloud storage! 4 Curator reviews and approves; data set published on campus or other system 7 Researcher selects files to share, selects user or group, and sets access permissions 3 Collaborator logs in to Globus and accesses shared files; no local account required; download via Globus 5 Researcher assembles data set; describes it using metadata (Dublin core and domain- specific) 6 6 Peers, collaborators search and discover datasets; transfer and share using Globus 8 Publication Repository Personal Computer Transfer Share Publish Discover • SaaS  Only a web browser required • Use storage system of your choice • Access using your campus credentials 13
  • 14. Globus at a glance 4 major services 13 national labs use Globus services 100 PB petabytes transferred 8,000 active endpoints 20 billion files processed >300 users are active daily 25,000 registered users 99.95% uptime over the past two years >30 subscribers The biggest transfer to date is 1 petabyte The longest- running transfer to date took 3 months We’re eager to learn what you want to do with Globus services
  • 15. 15 One APS node connects to 125 locations thru mid 2014
  • 18. Globus and DOE: Running total terabytes
  • 19. Globus and DOE: Active users per month
  • 20. Response has been gratifying "Really great software." - Benjamin Mayer, Research Associate, Climate Change Science Institute, Oak Ridge National Laboratory "Whoa! Transfer from NERSC to BNOC (data transfer node) using Globus is screaming!" - Gary Bates, Professional Research Assistant, NOAA “…Now my users have a fast, easy way to get their data wherever it needs to go, and the setup process was trivial." - Brock Palen, Associate Director, University of Michigan Advanced Research Computing "... we just had a 153TB transfer that got 20Gb/s and another with 144TB at 25Gb/s! That's pretty insane!" - Jason Alt, Systems Management and Development Lead at National Center for Supercomputing Applications "We were thrilled by how well Globus worked. We've never seen such high transfer rates, and the service was trivial to install and use." - Dale Land, IT Chief Engineer, Los Alamos National Laboratory "The system is reliable and secure - and also amazingly easy to use. …It just works." - David Skinner, NERSC user "I moved 400 GB of files and didn’t even have to think about it." - Jeff Porter, STAR Experiment, Lawrence Berkeley National Lab "We have been extremely impressed with Globus and how easy it is to use." - Pete Eby, Linux System Administrator, Oak Ridge National Laboratory "Drag and drop archiving is an incredibly useful feature." - Shreyas Cholia, NERSC user "The time before Globus now seems like the dark ages!" - Galen Arnold, Systems Engineer, NCSA and Blue Waters PRAC support team, NCSA
  • 21. Globus service APIs serve as a science platform Identity, Group, and Profile Management … Globus Toolkit GlobusAPIs GlobusConnect Data Publication & Discovery File Sharing File Transfer & Replication 21
  • 22. Globus platform services enable new application capabilities
  • 25. Operating a sustainable service Globus is a not-for-profit service for researchers We adopt a subscription- supported freemium model Subscribers get extra features, rapid support We’re engaged in crossing the chasm Support from DOE will contribute to long-term success
  • 26. Accelerating data-driven discovery in energy science (2) Liberate scientific data
  • 27. Q: What is the biggest obstacle to data sharing in science? A: The vast majority of data that is lost, or not online; if online, not described; if described, not indexed Not accessible Not discoverable Not used Contrast with common practice for consumer photos (iPhoto)  Automated capture  Publish then curate  Processing to add value  Outsourced storage
  • 28. We must automate the capture, linking, and indexing of all data Globus publication service encodes and automates data publication pipelines Example application: Materials Data Facility for materials simulation and experiment data Proposed distributed virtual collections index, organize, tag, & manage distributed data Think iPhoto on steroids – backed by domain knowledge and supercomputing power
  • 29. We must automate the capture, linking, and indexing of all data chiDB: Human-computer collaboration to extract Flory- Huggins (𝞆) parameters from polymers literature R. Tchoua et al. Plenario: Spatially and temporally integrated, linked, and searchable database of urban data C. Catlett, B. Goldstein, T. Malik et al.
  • 30. “I need to publish my data so that others can find it and use it.” Scholarly Publication Reference Dataset Research Community Collaboration 30
  • 32. Start a new submission 32
  • 38. Submission now in curation workflow 38
  • 41. Discover a published dataset 41
  • 42. Select a published dataset 42
  • 44. Configuring a publication pipeline: Publication “facets” URL Handle DOI identifier none standard custom description domain-specific none acceptance machine-validated curation human-validated anonymous Public collaborators access embargoed transient project lifetime “forever” preservation archive 44
  • 45. Accelerating data-driven discovery in energy science (3) Create discovery engines at DOE facilities
  • 46. Recall: A discovery engine for disordered structures Diffuse scattering images from Ray Osborn et al., Argonne SampleExperimental scattering Material composition Simulated structure Simulated scattering La 60% Sr 40% Detect errors (secs—mins) Knowledge base Past experiments; simulations; literature; expert knowledge Select experiments (mins—hours) Contribute to knowledge base Simulations driven by experiments (mins—days) Knowledge-driven decision making Evolutionary optimization
  • 47. Simulation Characterize, Predict Assimilate Steer data acquisition Data analysis Reconstruct, detect features, auto-correlate, particle distributions, … Science automation services Scripting, security, storage, cataloging, transfer ~0.001-0.5 GB/s/flow ~2 GB/s total burst ~200 TB/month ~10 concurrent flows (Today: x10 in 5 yrs) Integration Optimize, fit, … Configure Check Guide Batch Immediate 0.001 1 100+ PFlops Precompute material database Reconstruct image Auto- correlation Feature detection Scientific opportunities  Probe material structure and function at unprecedented scales Technical challenges  Many experimental modalities  Data rates and computation needs vary widely; increasing  Knowledge management, integration, synthesis Towards discovery engines for energy science (Argonne LDRD)
  • 48. Linking experiment and computation Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP). Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.) Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes, vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime. X-ray nano/microtomography Bio, geo, and material science imaging. (Bicer, Gursoy, Kettimuthu, De Carlo, et al.). Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response 2-BM 1-ID 6-ID Populate Sim Sim Select Sim Microstructure of a copper wire, 0.2mm diameter Advanced Photon Source Experimental and simulated scattering from manganite
  • 49. 1: Run script (EL1.layer) 2. Lookup file name=EL1.layer user=Anton type=reconstruction Storage locations 3: Transfer inputs Compute facilities 4: Run app 6: Update catalogs 5: Transfer results External collaborators Collaboration catalogs Provenance Files & Metadata Script libraries 0: Develop or reuse script 49 Researchers Tying it all together: An energy sciences infrastructure
  • 50. informatics analysis high-throughput experiments problem specification modeling and simulation analysis & visualization experimental design analysis & visualization Integrated databases Summary: Big opportunities and challenges for energy data Immediate opportunities  Reduce data friction and accelerate discovery by deploying Globus services across all DOE facilities  Develop new services to capture, link energy data Important research agenda  Discovery engines to answer major scientific questions  New research modalities linking computation and data  Organization and analysis of massive science data
  • 51. Thank you to our sponsors! U.S. DEPARTMENT OF ENERGY 51
  • 52. For more information: foster@anl.gov Thanks to co-authors and Globus team Globus services (globus.org)  Foster, I. Globus Online: Accelerating and democratizing science through cloud-based services. IEEE Internet Computing(May/June):70-73, 2011.  Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer, Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.  Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014. Publication (globus.org/data-publication)  Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I., Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015 Discovery engines  Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde, M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.

Hinweis der Redaktion

  1. One useful thing One exciting initiative
  2. New tools are needed to answer the most pressing scientific questions
  3. Accelerate “knowledge turns.” Unleash the 99% of not-easily-accessible data. Integrate data and computation.
  4. Fix IMAGE “Most of materials science is bottlenecked by disordered structures”—Littlewood. Solve inverse problem. How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. Challenge: takes months to do a single loop through cycle. Just as important, it is an incredibly labor intensive and expensive process.
  5. Change to OLCF, NERSC
  6. And sys admin response? Other potential quotes: "Thanks to Globus it's easy for our users to move big files. ...Globus is an awesome tool that is really helping our user community." - John Hanks, Senior HPC Analyst, University of Colorado "I am very impressed - Globus is the most beneficial grid technology I have ever seen." - Steven Gottlieb, Indiana University "With Globus, I’m averaging 40 Mb/s and can even reach 400 Mb/s on occasion – that’s insanely fast!" - Luke Van Roekel, University of Colorado "I have been very impressed with Globus - both the speed and ease of use." - William Daughton, Los Alamos National Laboratory – User of the Month, May 2012 "Without a service such as Globus it would have been basically impossible to move this large amount of data." - Katrin Heitmann, LANL and Argonne National Laboratory "The biggest benefit to Globus by far is the auto performance tuning. … Globus is an invaluable tool to me." - Luke Van Roekel, University of Colorado
  7. Highlight XSEDE’s planned adoption of user, group and profile management
  8. RDA: outsource data sharing and transfer
  9. kBase: Outsource identity and group management
  10. The publish dashboard shows all current submissions at any stage of the submission workflow. Here users can view accepted submissions, see a list of all submissions currently in the curation process, view/edit their unfinished submissions, and start a new submission. "The Scientist" will now start a new submission.
  11. The first step of submission is to select a collection. In this case "The Scientist" selects the “Center for Nanoscale Materials”, as this is the department through which he conducted his research. Note: "The Scientist" can only see collections he is allowed to publish to.
  12. "The Scientist" must first describe the dataset he is publishing. There are two types of metadata required for submission to the CNM collection: 1) Dublin core and 2) scientific metadata. These metadata requirements are defined by the collection and can be configured depending on the domain. Additional pages can also be defined. Here, "The Scientist" enters information about the Authors, their ORCID (a unique researcher identity), the submission title, the date of publication, the accompanying publication to which this dataset is related, and the DOI for that publication. Note: "The Scientist" has missed an ORCID for one of his co-authors.
  13. The second type of metadata required by the CNM relates to the materials science research at the Advanced Photon Source. Here, "The Scientist" enters information such as keywords describing the dataset, information about the sponsors who funded this research, a description of the dataset, the experiment name, the materials analyzed in this dataset, the energy density of the materials (this is important for research into battery development) and the Argonne General User Proposal (GUP) number. The GUP number is a unique identifier for all beam time allocations at the APS and is used by administrators to associate researchers, experiments, and allocations. All of this entered information can be subsequently used by other researchers with appropriate access to discover this dataset.
  14. Having described the dataset, "The Scientist" must now assemble the dataset. To do so, he first chooses to select the files to be published.
  15. Using the familiar Globus interface, "The Scientist" is able to select files from multiple sources and transfer them to his unique submission endpoint (publish#submission_11). This submission endpoint is created on shared Argonne storage resources, but is initially accessible only to "The Scientist" The dataset may be assembled over any period of time. "The Scientist" can create new files and folders on the endpoint and he can arrange these files in any hierarchy. At the completion of the submission the permissions on the endpoint will be changed such that the dataset is immutable. "The Scientist” will be given read access to the dataset, collection curators will also be given read access to the data so that they can view the contents.
  16. When "The Scientist" is happy with his assembled dataset, he can return to the publication workflow. Here, he sees a summary of the dataset and may confirm the correct file sizes and names are associated. The system attempts to determine the file types for each of the dataset’s files. "The Scientist" can choose to edit, remove or add files if necessary.
  17. When submitted, the dataset now enters a pre-determined curation workflow. "The Scientist” can check the progress of the submission through his dashboard. If any further attention is required, it will be displayed through his dashboard.
  18. “The Researcher” chooses to search for all published data in the CNM collection. The results show a brief summary of each published dataset including information about the publication time, collection, summary of number of files, name, authors, description and a set of keyword tags as well as key-value tags. Each of these fields can be used to search for a particular dataset.
  19. Knowing that other collections may well have datasets of interest , “The Researcher” may broaden the search context to all accessible collections and search for datasets related to “Li-ion” and “autonomic”. Here, the results show datasets from 2 collections: the CNM and the Chemical Sciences and Engineering collection (red boxes). Results are ranked according to their relevance to the search.
  20. Going further, “The Researcher” can use different queries such as key-value and ranges. In this case, “The Researcher” searchers for energy density > 1500 and microcapsules, and finds the dataset previously published in this demo with an associated key-value pair of energy-density:2000 that fits the range query criteria.
  21. Having found the desired published dataset, “The Researcher” can navigate to the summary page.
  22. Finally, “The Researcher” can view the downloaded dataset on their desktop PC.
  23. Description: another aspect - general metadata (Dublin Core) and scientific metadata Curation: another aspect – self, project owner, librarian
  24. Fix IMAGE “Most of materials science is bottlenecked by disordered structures”—Littlewood. Solve inverse problem. How do we make this sort of application routine? Allow thousands—millions?—to contribute to the knowledge base. Challenge: takes months to do a single loop through cycle. Just as important, it is an incredibly labor intensive and expensive process.
  25. Add CNM - Innovative in-slice parallelization method permits reconstruction of 720x2160x2560 dataset (7-BM) in less than 3 minutes (for each iteration), using 34K BG/Q cores, vs. many days on typical cluster. Innovative in-slice parallelization for iterative algorithms permits large-scale image reconstruction. Execution times are reduced to minutes for many large datasets and algorithms using 32K BG/Q cores, vs. many days on typical cluster.
  26. This diagram show the CMTS project’s vision for a cyberinfrastructure that we believe will further enhance our scientific productivity. It shows the major components of the CMTS cyberinfrastructure we are integrating. Here’s how CMTS will use it and how it will help them. 0. Develop script A CMTS researcher will go to the project script library to locate existing scripts to run, or to find components from which to compose or adapt a new script. The library will point to script codes managed by revision control systems like Git or Subversion, and the many public and private servers that host these services. The Swift parallel scripting language is central to our approach, because it imparts a uniform, high level interface to script components, and its implicitly parallel. Run script When Swift runs your script, it automatically handles several previously difficult aspects of script development and execution for the scientist: 1) automatically manages parallel execution (dataflow, throttling, etc); 2) abstracts the interfaces to diverse and distributed clusters; 3) automates data transfer; 4) automatically records provenance; 5) retries failing application runs. All these would otherwise have to be programmed manually, if they were to be done at all. Locate input file locations via dataset catalog CMTS scripts will locate major datasets through a networked catalog that enables scripts to be written with no dependencies on where datasets are located or replicated. Transfer inputs Swift will automatically transport input datasets to the selected computational resource for an application run (if needed) Run app Swift will then run the application, retrying failures (if requested) and recording a “provenance log” that traces where the app ran, with what runtime and memory usage, and with what arguments and environmental settings. Transfer results Swift will automatically transport output datasets to the selected archival or temporary storage resource (if needed) Update catalogs …and it will update CMTS collaboration catalogs with new dataset locations, derived metadata annotations on those datasets, and the provenance of the data and runs. Collaborate! (2 clicks) All of this facilitates collaboration, both by project team members and by external collaborators – whether across the hall or across the world.
  27. Talk about the Globus as being part of UChicago + ANL, as well as other context setting about how this work came about and is funded