SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Hybrid Strategies for
Research Data Management
  Vas Vasiliadis, Computation Institute
         vas@ci.uchicago.edu




                                    computationinstitute.org
The Computation Institute
= UChicago + Argonne
= Cross-disciplinary nexus
= Home of the Research Cloud

                             computationinstitute.org
computationinstitute.org
x10 in 6 years
x105 in 6 years




        computationinstitute.org
1 PB data in last experiment
Accessed by 800 scientists
worldwide
                          computationinstitute.org
1.2 PB of climate data
Delivered to 23,000 users

                       computationinstitute.org
computationinstitute.org
We have exceptional
  infrastructure for the 1%

How can the 99%      manage
           this?

                        computationinstitute.org
What would a “dropbox for
science” look like?




                      computationinstitute.org
• Collect     • Catalog
    • Move        • Publish
    • Replicate   • Search
    • Share       • Archive
    • Analyze     • Backup
…among distributed research groups
                           computationinstitute.org
Registry
Staging   Ingest
 Store     Store

                               Community
                                 Store
          Analysis
           Store



                     Archive               Mirror




                                           computationinstitute.org
Registry
Staging   Ingest
 Store     Store

                               Community
                                 Store
          Analysis
           Store



                     Archive               Mirror




                                           computationinstitute.org
Registry
Staging   Ingest
 Store     Store

                               Community
                                 Store
          Analysis
           Store



                     Archive               Mirror




                                           computationinstitute.org
•   Collect     •   Catalog
•   Move        •   Publish
•   Replicate   •   Search    -as-a-Service
•   Share       •   Archive
•   Analyze     •   Backup



                                    computationinstitute.org
Security
   Privacy
         Reliability
             Scalability
                   Control
                     computationinstitute.org
A great user experience




                    computationinstitute.org
Registry
Staging         Ingest
 StoreResearch Data Management-as-a-Service
                 Store

     Globus       Globus          Globus          Globus
                                         Community             SaaS
    Transfer      Storage       Collaborate       Catalog
                                            Store
                Analysis
       Globus Integrate (Globus Nexus, Globus Connect)
                  Store
                                                               PaaS


                               Archive              Mirror




                                                    computationinstitute.org
Communities using Globus




                   computationinstitute.org
What does it mean for us as
IT resource managers?




                       computationinstitute.org
installers  brokers




                   computationinstitute.org
developers  integrators




         GSI-OpenSSH

                       computationinstitute.org
administrators  curators
                    (of the user experience)


  Cloud? What cloud?
    1   :   1   :      0
    UX :    Dev : Ops
                            computationinstitute.org
computationinstitute.org
computationinstitute.org
Other innovative science
SaaS projects




                       computationinstitute.org
Our vision for a 21st century
      cyberinfrastructure

To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources in a
hybrid world
                           computationinstitute.org
Thank you to our sponsors




                    computationinstitute.org

Weitere ähnliche Inhalte

Ähnlich wie Hybrid Strategies for Research Data Management

Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
Kirill Osipov
 

Ähnlich wie Hybrid Strategies for Research Data Management (20)

Research Data Management as a Service
Research Data Management as a ServiceResearch Data Management as a Service
Research Data Management as a Service
 
Big Process for Big Data @ NASA
Big Process for Big Data @ NASABig Process for Big Data @ NASA
Big Process for Big Data @ NASA
 
Science cloud foster june 2013
Science cloud foster june 2013Science cloud foster june 2013
Science cloud foster june 2013
 
Science as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate DiscoveryScience as a Service: How On-Demand Computing can Accelerate Discovery
Science as a Service: How On-Demand Computing can Accelerate Discovery
 
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
Part 2 OCLC Strategic Presentation Bruce Crocco ACURIL 2011
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Breeding 1
Breeding 1Breeding 1
Breeding 1
 
Serverless data lake architecture
Serverless data lake architectureServerless data lake architecture
Serverless data lake architecture
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Deep thoughts from the real world of azure
Deep thoughts from the real world of azureDeep thoughts from the real world of azure
Deep thoughts from the real world of azure
 
Metadata-powered dissemination of content
Metadata-powered dissemination of contentMetadata-powered dissemination of content
Metadata-powered dissemination of content
 
OpenStack: Why Is It Gaining So Much Traction?
OpenStack: Why Is It Gaining So Much Traction?OpenStack: Why Is It Gaining So Much Traction?
OpenStack: Why Is It Gaining So Much Traction?
 
LiquidPub: Services at Service of Science
LiquidPub: Services at Service of ScienceLiquidPub: Services at Service of Science
LiquidPub: Services at Service of Science
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Webinar: Semantic web for developers
Webinar: Semantic web for developersWebinar: Semantic web for developers
Webinar: Semantic web for developers
 
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computingClouds, Clusters, and Containers: Tools for responsible, collaborative computing
Clouds, Clusters, and Containers: Tools for responsible, collaborative computing
 
Sequence Services Phase 2 Webinar Series: Constellation Technology and Genestack
Sequence Services Phase 2 Webinar Series: Constellation Technology and GenestackSequence Services Phase 2 Webinar Series: Constellation Technology and Genestack
Sequence Services Phase 2 Webinar Series: Constellation Technology and Genestack
 
Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform Technologies
 
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...How to manage one million messages per second using Azure, Radu Vunvulea, ITD...
How to manage one million messages per second using Azure, Radu Vunvulea, ITD...
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Hybrid Strategies for Research Data Management

  • 1. Hybrid Strategies for Research Data Management Vas Vasiliadis, Computation Institute vas@ci.uchicago.edu computationinstitute.org
  • 2. The Computation Institute = UChicago + Argonne = Cross-disciplinary nexus = Home of the Research Cloud computationinstitute.org
  • 4. x10 in 6 years x105 in 6 years computationinstitute.org
  • 5. 1 PB data in last experiment Accessed by 800 scientists worldwide computationinstitute.org
  • 6. 1.2 PB of climate data Delivered to 23,000 users computationinstitute.org
  • 8. We have exceptional infrastructure for the 1% How can the 99% manage this? computationinstitute.org
  • 9. What would a “dropbox for science” look like? computationinstitute.org
  • 10. • Collect • Catalog • Move • Publish • Replicate • Search • Share • Archive • Analyze • Backup …among distributed research groups computationinstitute.org
  • 11. Registry Staging Ingest Store Store Community Store Analysis Store Archive Mirror computationinstitute.org
  • 12. Registry Staging Ingest Store Store Community Store Analysis Store Archive Mirror computationinstitute.org
  • 13. Registry Staging Ingest Store Store Community Store Analysis Store Archive Mirror computationinstitute.org
  • 14. Collect • Catalog • Move • Publish • Replicate • Search -as-a-Service • Share • Archive • Analyze • Backup computationinstitute.org
  • 15. Security Privacy Reliability Scalability Control computationinstitute.org
  • 16. A great user experience computationinstitute.org
  • 17. Registry Staging Ingest StoreResearch Data Management-as-a-Service Store Globus Globus Globus Globus Community SaaS Transfer Storage Collaborate Catalog Store Analysis Globus Integrate (Globus Nexus, Globus Connect) Store PaaS Archive Mirror computationinstitute.org
  • 18. Communities using Globus computationinstitute.org
  • 19. What does it mean for us as IT resource managers? computationinstitute.org
  • 20. installers  brokers computationinstitute.org
  • 21. developers  integrators GSI-OpenSSH computationinstitute.org
  • 22. administrators  curators (of the user experience) Cloud? What cloud? 1 : 1 : 0 UX : Dev : Ops computationinstitute.org
  • 25. Other innovative science SaaS projects computationinstitute.org
  • 26. Our vision for a 21st century cyberinfrastructure To provide more capability for more people at substantially lower cost by creatively aggregating (“cloud”) and federating (“grid”) resources in a hybrid world computationinstitute.org
  • 27. Thank you to our sponsors computationinstitute.org

Hinweis der Redaktion

  1. Share some thoughts with youAsk you to think critically about managing research data in what is rapidly becoming a hybrid IT world
  2. A place where researchers from multiple disciplines come together and engage in research that is fundamentally enabled by computationMore recently we’ve been talking about it as the home of the “research cloud”… I’ll describe what we mean by that throughout this talk
  3. Example of areaswhere we have active projectsMuch of our legacy is in the physical sciencesBut increasingly we are finding ourselves working in the life sciences….
  4. And the reason is pretty obvious…This chart and others like it are becoming a cliché in next gen sequencing and big data presentations >>>> ANIMATE…but the point I want to make is that while Moore’s law translates to roughly 10x increase in processor power>>>> ANIMATE…data volumes are growing many orders of magnitude fasterAND MEANWHILE, other resources [money, people] are staying pretty flatSo we have a looming crisis……and we hear that magic bullet of “the cloud” is going to solve itAs far as cost goes, clouds are helping …but many issues remain
  5. Two examples to illustrate some of these issues…LIGO searches for gravitational waves to explore fundamental physics conceptsIt runs three observatories around the world and generated over a petabyte of data in their most recent experimentIt’s no just the volume of data – arguably 1PB is becoming commonplace……the real complexity is that this data has to be made available to almost a thousand researchers all over the world…it has to be actively managed for many years while experiments and analyses are run against itA very complex undertakingAnd by the way, their next experiment, Advanced LIGO, will generate a couple of orders of magnitude more data
  6. Earth System Grid Federation provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?Again, tt’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us in his keynote that the Broad Institute hit 10PB of spinning disk last year -- and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
  7. The obligatory data deluge slide…>>>> ANIMATESo this fellow here is well prepared for the data deluge …but what about the rest of us?
  8. The point is, the 1% of projects are in good shape>>>>ANIMATEBut what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challenges…So their research suffers …and over time they may become irrelevantSo at the CI we asked ourselves questions about how we can help avert this crisisAnd one question that sums up our thinking is…
  9. Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machines>>>> ASK FOR SHOW OF HANDS …confirm majorityWell, the scientific research equivalent is a little different…
  10. We figured it needs to allow researchers to do many or all of these things with their data ……and not just with the 2GB of PowerPoint decks or the 100GB of family photos and videos…but the petabytes and exabytes of data that will soon be the norm for many>>> ANIMATEAgain, it’s the large distributed group of collaborating researchers that’s key here
  11. So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (an NGS core in China, or a large telescope in Chile)Since these instruments are in high demand, users have to get their data off the instrument to make way for the next user……so the data is typically moved from a staging area to some type of ingest storeThis is usually pretty raw data …so some of it may need be run through one or more analysis pipelinesAt this point we’ve not only distributed the data, we’ve also multiplied it in sizeThen we may need to maybe do some post-processing and apply some metadata……before publishing it in a Community Store where other collaborators can access it securelyPerhaps also place a subset of the data in a national Registry for public accessAnd we’d also like to keep Mirrors of the data for performance and various other reasonsAnd over time we will end up moving data to an Archive, perhaps a hierarchical storage systemIn practice the various stores are probably owned and managed by different organizations:>>>>>ANIMATE …Ingest is on my campus at University of Chicago>>>>>ANIMATE…Analysis may be on a public cloud provider because I can’t get enough cycles on demand on campus>>>>>ANIMATE …The Registry is in some vault in Virginia>>>>>ANIMATE…The Community Store is on a private cloud on one of the national labsAnd so on… we have to deal with a hybrid storage world
  12. Beyond the hybrid storage environments, we also have to deal with moving the data reliably -- something that sounds pretty mundane…and it is mundane when you’re moving 50 pictures of Fluffy to Picassa…but it’s a little more challenging when you’re moving a petabyte to half a dozen locations around the worldYou end up having to become familiar with many tools and techniques>>>ANIMATE …some systems will force you use arcane commands like SCP that require extensive configuration and tuning – and yet still deliver only modest performance and reliability>>>ANIMATE…in other cases you’ll find that a hard drive and a FedEx account are the way to go>>>ANIMATE…or some custom portal with a convoluted workflowSo we have to deal with a hybrid (and generally poor) user experience
  13. And if that wasn’t enough, each of these systems is going to bein a different security domain>>>>ANIMATE….and you’ll have to deal with multiple identities and security protocols to get the job doneSo we have to deal with a hybrid security worldRealization: building a solution is really only feasible for very few among us -- certainly not for the typical research labSo we looked at what’s worked in a number of business application areas like CRM and ERP and decided that…
  14. …for small research groups, the only feasible way to provide all of these capabilities is…>>>> ANIMATE…Using a software-as-a-service approachAnd what’s interesting is that much of this also applies to larger groups who are starting to question the level of investment they are making in building their ownIt’s similar to the debate that many large companies have had about using SaaS vs. in-house software…and we’ve seen that pendulum swing strongly in favor of SaaS
  15. And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
  16. We can deal with that complexity technically but the key is to deliver a great user experienceWe’re trying to serve the needs of the vast majority of researchers who cannot hope to…navigate Amazon’s API…or figure out how to configure an Isilonstorage node for their internal cloud
  17. So a couple of years ago we started building such a solution…Transfer: move big data reliably >4,000 users in just over a year, approaching the 4PB mark …Storage: enabling any number of object stores to be used in a consistent manner to replicate, version, and share dataCollaborate: allow the group to manage their work flows and publish data for internal and external consumptionCatalog: make metadata part and parcel of the data, not an afterthoughtIntegrate: enable groups to access the various services programmaticallyNexus: provide a federated identity infrastructure which allows users to access the services with their existing accounts at their primary institution…+ a group management service that serves as the basis for sharing of data across all other Globus servicesIn developing this we started with the User experience…service + multiple Uis for different types of users…a very, small, no-maintenance footprint on the endpoints -- a drag and drop or single command packaged installation that makes the resource part of the Globus service
  18. So SaaS is one strategy for dealing with the hybrid world coming our way…but we also need strategies for dealing with our organizationFor many years we built up a fairly traditional software development organization: lots of devs, some QA, some opsWe realized that we would need change our view of what the organization should look like
  19. The first shift we are experiencing is from being installers to capability brokersWe are less concerned with building a data center or installing and configuring softwareThere is absolutely still a role for that but there a few that have the skills and experience…so we take advantage of that experience and focus instead of selecting various components and spend our time making them easy to use-- again it’s focusing on the user experienceAn example of this is the Globus Storage serviceWe are working with multiple providers>>> talk to UC IT Services deployment et alCloud storage providers will keep driving the unit cost of storage downWe believe the value lies in making trivial to use that storage in the normal course of their workOther components for Globus Collaborate: Drupal, JIRA, ConfluenceAnd we eat our own dog food …Zendeskfor support…Using Globus Integrate and Globus Nexus…from the user’s perspective they only have a single account on Globus and can access external services like Zendesk to track their support tickets, post to forums, etc.
  20. We’re also moving from being developers to playing more of an integrator roleAgain, there are lots of smart people out there that have figured out the hard bits, for example in identity management and securityWe’ve taken that knowledge and packaged it in such a way that shields the user from all of this complexity…they just need to remember their single username/password or campus login or Google account or whatever>>>> TALK TO FEDERATED IDENTITY
  21. If you truly want to focus on the user experience then you need to build the as suchWe’ve shifted the make up of our team fromdev-heavy to more balanced with respect to UX…and quite a shift away from traditional ops (the devs run their own stuff using simple software like Chef)