2. computationinstitute.org
Thanks to great colleagues
and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle
Chard, Raj Kettimuthu, Ravi Madduri, Tanu
Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler,
and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others
at Argonne
• Kerstin Kleese-Van Dam, Carina Lansing, and
others at PNNL
20. computationinstitute.org
Automation is required
to apply more
sophisticated methods to
far more data
Outsourcing is needed
to achieve economies of
scale in the use of
automated methods
Automation and outsourcing are key
21. computationinstitute.org
Building a discovery cloud:
Research strategy
• Identify time-consuming activity that appears
amenable to automation and outsourcing
• Implement activity as a high-quality, low-touch
SaaS solution, leveraging commercial IaaS for
high reliability, economies of scale
• Evaluate
• Extract common elements as a
research automation platform
• Repeat
Bonus question: Identify methods for
delivering SaaS solutions sustainably
Software as a service
Platform as a service
Infrastructure as a service
25. computationinstitute.org
Data
Source
User A selects
file(s) to share;
selects
user/group, sets
share permissions
1
Globus Online tracks
shared files; no need
to move files to
cloud storage!
2
User B logs in to
Globus Online
and accesses
shared file
3
26. computationinstitute.org
Extreme ease of use
• InCommon, Oauth, OpenID, X.509, …
• Credential management
• Group definition and management
• Transfer management and optimization
• Reliability via transfer retries
• Web interface, REST API, command line
• One-click “Globus Connect” install
• 5-minute Globus Connect Multi User install
30. computationinstitute.org
We benefit greatly from
ESnet’s “Science DMZ”
Three key components, all required:
• “Friction free” network path
– Highly capable network devices (wire-speed, deep queues)
– Virtual circuit connectivity option
– Security policy and enforcement specific to science workflows
– Located at or near site perimeter if possible
• Dedicated, high-performance Data Transfer Nodes (DTNs)
– Hardware, operating system, libraries optimized for transfer
– Optimized data transfer tools: Globus Online, GridFTP
• Performance measurement/test node
– perfSONAR
Details at http://fasterdata.es.net/science-dmz/
42. computationinstitute.org
We are also adding capabilities
Globus Toolkit
Sharing Service
Transfer Service
Globus Nexus
(Identity, Group, Profile)
GlobusOnlineAPIs
GlobusConnect
44. computationinstitute.org
Expanding Globus Online services
• Ingest and publication
– Imagine a DropBox that not only replicates, but
also extracts metadata, catalogs, converts
• Cataloging
– Virtual views of data based on user-defined
and/or automatically extracted metadata
• Computation
– Associate computational procedures,
orchestrate application, catalog results, record
provenance
45. computationinstitute.org
Looking deeply at how
researchers use data
• A single research question often requires the
integration of many data elements, that are:
– In different locations
– In different formats (Excel, text, CDF, HDF, …)
– Described in different ways
• Best grouping can vary during investigation
– Longitudinal, vertical, cross-cutting
• But always needs to be operated on as a unit
– Share, annotate, process, copy, archive, …
46. computationinstitute.org
How do we manage data today?
• Often, a curious mix of ad hoc methods
– Organize in directories using file and directory
naming conventions
– Capture status in README files, spreadsheets,
notebooks
• Time-consuming, complex, error prone
Why can’t we manage our data like
we manage our pictures and music?
47.
48. computationinstitute.org
Introducing the dataset
• Group data based on use, not location
– Logical grouping to organize, reorganize, search, and
describe usage
• Tag with characteristics that reflect content …
– Capture as much existing information as we can
• …or to reflect current status in investigation
– Stage of processing, provenance, validation, ..
• Share data sets for collaboration
– Control access to data and metadata
• Operate on datasets as units
– Copy, export, analyze, tag, archive, …
49. computationinstitute.org
Builds on catalog as a service
Approach
• Hosted user-defined
catalogs
• Based on tag model
<subject, name, value>
• Optional schema
constraints
• Integrated with other
Globus services
Three REST APIs
/query/
• Retrieve subjects
/tags/
• Create, delete, retrieve
tags
/tagdef/
• Create, delete, retrieve
tag definitions
Builds on USC Tagfiler project (C. Kesselman et al.)
50. computationinstitute.org
50
Multi-scale
imaging at
APS
Storage
Image processing
(noise removal, etc.)
Tomographic
reconstruction
Visual inspection
Selection
Beamline 2-BM-B
~1.5um resolution
Beamline 32-ID-C
20-50 nm resolution
Image processing
(noise removal, etc.)
Tomographic
reconstruction
Visual inspection
Selection
Selection
Multi-scale
image fusion
Visual inspection
Up to 100 fps
2K x 2K, 16 bits
11 GB raw data
1,500 fps
2K x 2K, 16 bits
1 min readout
11 GB raw data
58. computationinstitute.org
Building a discovery cloud:
Research strategy
• Identify time-consuming activity that appears
amenable to automation and outsourcing
• Implement activity as a high-quality, low-touch
SaaS solution, leveraging commercial IaaS for
high reliability, economies of scale
• Evaluate
• Extract common elements as a
research automation platform
• Repeat
Bonus question: Identify methods for
delivering SaaS solutions sustainably
Software as a service
Platform as a service
Infrastructure as a service
61. computationinstitute.org
Starting at $20k per year
• Provider endpoints with sharing
• Multiple GridFTP servers per endpoint
• Branded web sites
• Alternate identity provider
• Usage reporting
• MSS optimizations
• Operations monitoring and management
• Input into and access to product roadmap
Provider Plans offer…
62. computationinstitute.org
To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources
“Science as a service”
Our vision for a 21st century
discovery infrastructure
63. computationinstitute.org
It’s a time of great opportunity … to
develop and apply Science aaS
Globus Nexus
(Identity, Group, Profile)
…
Sharing Service
Transfer Service
Dataset Services
Globus Toolkit
GlobusOnlineAPIs
GlobusConnect
64. computationinstitute.org
Thanks to great colleagues
and collaborators
• Steve Tuecke, Rachana Ananthakrishnan, Kyle
Chard, Raj Kettimuthu, Ravi Madduri, Tanu
Malik, and many others at Argonne & Uchicago
• Carl Kesselman, Karl Czajkowski, Rob Schuler,
and others at USC/ISI
• Francesco de Carlo, Chris Jacobsen, and others
at Argonne
• Kerstin Kleese-Van Dam, Carina Lansing, and
others at PNNL
The Computation Institute (or CI)A joint initiative between Uchicago and Argonne National LabA place where researchers from multiple disciplines come together and engage in research that is fundamentally enabled by computationMore recently ….we’ve been talking about it as the home of the research cloud …and I’ll describe what we mean by that throughout this talk
Here are some of the areas where we have active projectsFocus on areas of particular interest to I2/Esnet, namely HEP, climate change, genomics (up and coming)
And the reason is pretty obvious…This chart and others like it are becoming a cliché in next gen sequencing and big data presentations …but the point is that while Moore’s law translates to roughly 10x increase in processor power…data volumes are growing many orders of magnitude fasterAND MEANWHILE, other necessary resources [money, people] are staying pretty flatSo we have a crisis …and we hear that magic bullet of “the cloud” is going to solve itWell, as far as cost goes, clouds are helping but many issues remain
173 TB/day
Another example if the earth systems grid that provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?It’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us that the Broad Institute hit 10PB of spinning disk last year …and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
The point is, the 1% of projects are in good shape
But what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challengesSo their research suffers …and over time many may become irrelevantSo at the CI we asked ourselves a question …many questions actually about how we can help avert this crisisAnd one question that kinds sums up a lot of our thinking is…
Can’t just expect to throw more people and $$$ at the problem ….already seeing the limits
Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machinesWell, the scientific research equivalent is a little different
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (a sequencer at JGI or a light source like APS/ALS)…since these instruments are in high demand, users have to get their data off the instrument to make way for the next userSo the data is typically moved from a staging area to some type of ingest storeEtcetera for analysis, sharing of results with collaborators, annotation with metadata for future search, backup/sync/archival, …
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
Started with seemingly simple/mundane task of transferring files …etc.
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
This image shows a 3D rendering of a Shewanella biofilm grown on a flat plastic substrate in a Constant Depth bioFilm Fermenter (CDFF). The image was generated using x-ray microtomography at the Advanced Photon Source, Argonne National Laboratory.
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many
We figured it needs to allow a group of collaborating researchers to do many or all of these things with their data ……and not just the 2GB of powerpoints…or the 100GB of family photos and videos….but the petabytes and exabytes of data that will soon be the norm for many