Big Process for Big Data @ PNNL, May 2013

computationinstitute.org
Big process for big data
Ian Foster
foster@anl.gov

Thanks to great colleagues
and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle
Chard, Raj Kettimuthu, Ravi Madduri, Tanu
Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler,
and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others
at Argonne
• Kerstin Kleese-Van Dam, Carina Lansing, and
others at PNNL

The Computation Institute
= UChicago + Argonne
= Cross-disciplinary nexus
= Home of the Discovery Cloud

x10 in 6 years
x105 in 6 years
Will data kill genomics?
Kahn, Science, 331 (6018): 728-729

18 orders
of magnitude
in 5 decades!
12 orders
of magnitude
In 6 decades!
Moore’s Law for X-Ray Sources

Large Hadron Collider
Higgs discovery “only possible because
of the extraordinary achievements of …
grid computing”—Rolf Heuer, CERN DG

1.2 PB of climate data
Delivered to 23,000 users

We have exceptional
infrastructure for the 1%

What about the 99%?

Big science. Small labs.

Need: A new way to deliver
research cyberinfrastructure
Frictionless
Affordable
Sustainable

We asked ourselves:
What if the research work flow
could be managed as easily as…
…our pictures
…home entertainment
…our e-mail

What makes these services great?
Great User Experience
+
High performance
(but invisible) infrastructure

We aspire (initially) to create a
great user experience for
research data management
What would a “dropbox for
science” look like?

• Collect
• Move
• Sync
• Share
• Analyze
• Annotate
• Publish
• Search
• Backup
• Archive
BIG DATA

Registry
Staging
Store
Ingest
Store
Analysis
Store
Community
Store
Archive Mirror
Ingest
Store
Analysis
Store
Community
Store
Archive Mirror
Registry
Quota
exceeded
!
Expired
credentials
!
Network
failed. Retry.
!
Permission
denied
!
It should be trivial to Collect, Move, Sync, Share, Analyze,
Annotate, Publish, Search, Backup, & Archive BIG DATA
… but in reality it’s often very challenging

Automation is required
to apply more
sophisticated methods to
far more data
Automation and outsourcing are key

Automation is required
to apply more
sophisticated methods to
far more data
Outsourcing is needed
to achieve economies of
scale in the use of
automated methods
Automation and outsourcing are key

Building a discovery cloud:
Research strategy
• Identify time-consuming activity that appears
amenable to automation and outsourcing
• Implement activity as a high-quality, low-touch
SaaS solution, leveraging commercial IaaS for
high reliability, economies of scale
• Evaluate
• Extract common elements as a
research automation platform
• Repeat
Bonus question: Identify methods for
delivering SaaS solutions sustainably
Software as a service
Platform as a service
Infrastructure as a service

• Collect
• Move
• Sync
• Share
• Analyze
• Annotate
• Publish
• Search
• Backup
• Archive
• Collect
• Move
• Sync
• Share
Capabilities delivered using
Software-as-Service (SaaS) model

Data
Source
Data
Destination
User
initiates
transfer
request
1
Globus
Online
moves/sy
ncs files
2
Globus Online
notifies user
3

Data
Source
User A selects
file(s) to share;
selects
user/group, sets
share permissions
1
Globus Online tracks
shared files; no need
to move files to
cloud storage!
2
User B logs in to
Globus Online
and accesses
shared file
3

Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect” install
• 5-minute Globus Connect Multi User install

Early adoption is encouraging

Early adoption is encouraging
8,000 registered users; >100 daily
~16 PB moved; ~1B files
10x (or better) performance vs. scp
99.9% availability
Entirely hosted on Amazon

1e-011e+011e+031e+051e+07
duration
2011 2012
1 second
1 minute
1 hour
1 day
1 week

We benefit greatly from
ESnet’s “Science DMZ”
Three key components, all required:
• “Friction free” network path
– Highly capable network devices (wire-speed, deep queues)
– Virtual circuit connectivity option
– Security policy and enforcement specific to science workflows
– Located at or near site perimeter if possible
• Dedicated, high-performance Data Transfer Nodes (DTNs)
– Hardware, operating system, libraries optimized for transfer
– Optimized data transfer tools: Globus Online, GridFTP
• Performance measurement/test node
– perfSONAR
Details at http://fasterdata.es.net/science-dmz/

K. Heitmann (Argonne)
moves 22 TB of cosmology
data LANL  ANL at 5 Gb/s

B. Winjum (UCLA) moves
900K-file plasma physics
datasets UCLA NERSC

Dan Kozak (Caltech)
replicates 1 PB LIGO
astronomy data for resilience

3Credit: Kerstin Kleese-van Dam
Erin Miller (PNNL)
collects data at
Advanced Photon
Source, renders at
PNNL, and views at
ANL

Globus Online already does a lot
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)
GlobusOnlineAPIs
GlobusConnect

Data management SaaS (Globus) +
Next-gen sequence analysis pipelines (Galaxy) +
Cloud IaaS (Amazon) =
Flexible, scalable, easy-to-use genomics
analysis for all biologists
globus
genomics

A platform for integration

We are also adding capabilities
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus
GlobusOnlineAPIs
GlobusConnect

More capabilities underway …
Globus Toolkit
Sharing Service
Transfer Service
Dataset Services
Globus Nexus
GlobusOnlineAPIs
GlobusConnect

Expanding Globus Online services
• Ingest and publication
– Imagine a DropBox that not only replicates, but
also extracts metadata, catalogs, converts
• Cataloging
– Virtual views of data based on user-defined
and/or automatically extracted metadata
• Computation
– Associate computational procedures,
orchestrate application, catalog results, record
provenance

Looking deeply at how
researchers use data
• A single research question often requires the
integration of many data elements, that are:
– In different locations
– In different formats (Excel, text, CDF, HDF, …)
– Described in different ways
• Best grouping can vary during investigation
– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit
– Share, annotate, process, copy, archive, …

How do we manage data today?
• Often, a curious mix of ad hoc methods
– Organize in directories using file and directory
naming conventions
– Capture status in README files, spreadsheets,
notebooks
• Time-consuming, complex, error prone
Why can’t we manage our data like
we manage our pictures and music?

Introducing the dataset
• Group data based on use, not location
– Logical grouping to organize, reorganize, search, and
describe usage
• Tag with characteristics that reflect content …
– Capture as much existing information as we can
• …or to reflect current status in investigation
– Stage of processing, provenance, validation, ..
• Share data sets for collaboration
– Control access to data and metadata
• Operate on datasets as units
– Copy, export, analyze, tag, archive, …

Builds on catalog as a service
Approach
• Hosted user-defined
catalogs
• Based on tag model
<subject, name, value>
• Optional schema
constraints
• Integrated with other
Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve
tags
/tagdef/
• Create, delete, retrieve
tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)

50
Multi-scale
imaging at
APS
Storage
Image processing
(noise removal, etc.)
Tomographic
reconstruction
Visual inspection
Selection
Beamline 2-BM-B
~1.5um resolution
Beamline 32-ID-C
20-50 nm resolution
Image processing
(noise removal, etc.)
Tomographic
reconstruction
Visual inspection
Selection
Selection
Multi-scale
image fusion
Visual inspection
Up to 100 fps
2K x 2K, 16 bits
11 GB raw data
1,500 fps
2K x 2K, 16 bits
1 min readout
11 GB raw data

51
mydata42
owner: Francesco
type: 3dtomo
format: HDF5
beamline: 2BM
Define dataset
Infer type
Extract metadata
Populate catalog(s)
Locate datasets
Access files
analyze
Catalog derived
products
transfer/schedule
Orchestration
Organization
Record
provenance
Annotate, share
browse, search

Our challenge:
Sustainability
We are a non-profit service
provider to the non-profit
research community

Globus Online Provider Plans
Support ongoing operations
Offer value-added capabilities
Engage more closely with users

Starting at $20k per year
• Provider endpoints with sharing
• Multiple GridFTP servers per endpoint
• Branded web sites
• Alternate identity provider
• Usage reporting
• MSS optimizations
• Operations monitoring and management
• Input into and access to product roadmap
Provider Plans offer…

To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources
“Science as a service”
Our vision for a 21st century
discovery infrastructure

It’s a time of great opportunity … to
develop and apply Science aaS
Globus Nexus
…
Sharing Service
Transfer Service
Dataset Services
Globus Toolkit
GlobusOnlineAPIs
GlobusConnect

Thank you to our sponsors!

Big Process for Big Data @ PNNL, May 2013

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (18)

Ähnlich wie Big Process for Big Data @ PNNL, May 2013

Ähnlich wie Big Process for Big Data @ PNNL, May 2013 (20)

Mehr von Ian Foster

Mehr von Ian Foster (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Big Process for Big Data @ PNNL, May 2013

Hinweis der Redaktion