While some early adopters have realized benefits by incorporating clouds into their analysis pipelines, many challenges remain. In this presentation we will highlight the critical issues associated with research data management, and describe alternative approaches for addressing these challenges by optimizing the use of local, distributed and cloud-hosted resources.
26. Our vision for a 21st century
cyberinfrastructure
To provide more capability for
more people at substantially
lower cost by creatively
aggregating (“cloud”) and
federating (“grid”) resources in a
hybrid world
computationinstitute.org
27. Thank you to our sponsors
computationinstitute.org
Hinweis der Redaktion
Share some thoughts with youAsk you to think critically about managing research data in what is rapidly becoming a hybrid IT world
A place where researchers from multiple disciplines come together and engage in research that is fundamentally enabled by computationMore recently we’ve been talking about it as the home of the “research cloud”… I’ll describe what we mean by that throughout this talk
Example of areaswhere we have active projectsMuch of our legacy is in the physical sciencesBut increasingly we are finding ourselves working in the life sciences….
And the reason is pretty obvious…This chart and others like it are becoming a cliché in next gen sequencing and big data presentations >>>> ANIMATE…but the point I want to make is that while Moore’s law translates to roughly 10x increase in processor power>>>> ANIMATE…data volumes are growing many orders of magnitude fasterAND MEANWHILE, other resources [money, people] are staying pretty flatSo we have a looming crisis……and we hear that magic bullet of “the cloud” is going to solve itAs far as cost goes, clouds are helping …but many issues remain
Two examples to illustrate some of these issues…LIGO searches for gravitational waves to explore fundamental physics conceptsIt runs three observatories around the world and generated over a petabyte of data in their most recent experimentIt’s no just the volume of data – arguably 1PB is becoming commonplace……the real complexity is that this data has to be made available to almost a thousand researchers all over the world…it has to be actively managed for many years while experiments and analyses are run against itA very complex undertakingAnd by the way, their next experiment, Advanced LIGO, will generate a couple of orders of magnitude more data
Earth System Grid Federation provides data and tools to over 20,000 climate scientists around the worldSo what’s notable about these examples?Again, tt’s the combination of the amount of data being managed and the number of people that need access to that dataWe heard Martin Leach tell us in his keynote that the Broad Institute hit 10PB of spinning disk last year -- and that it’s not a big dealTo a select few, these numbers are routine ….And for the projects I just talked about, the IT infrastructure is in placeThey have robust production solutionsBuilt by substantial teams at great expenseSustained, multi-year effortsApplication-specific solutions, built mostlyon common/homogeneoustechnology platforms
The obligatory data deluge slide…>>>> ANIMATESo this fellow here is well prepared for the data deluge …but what about the rest of us?
The point is, the 1% of projects are in good shape>>>>ANIMATEBut what about the 99% set?There are hundreds of thousands of small and medium labs around the world that are faced with similar data management challengesThey don’t have the resources to deal with these challenges…So their research suffers …and over time they may become irrelevantSo at the CI we asked ourselves questions about how we can help avert this crisisAnd one question that sums up our thinking is…
Many in this room are probably users of Dropbox or similar services for keeping their files synced across multiple machines>>>> ASK FOR SHOW OF HANDS …confirm majorityWell, the scientific research equivalent is a little different…
We figured it needs to allow researchers to do many or all of these things with their data ……and not just with the 2GB of PowerPoint decks or the 100GB of family photos and videos…but the petabytes and exabytes of data that will soon be the norm for many>>> ANIMATEAgain, it’s the large distributed group of collaborating researchers that’s key here
So how would such a drop box for science be used? Let’s look at a very typical scientific data work flow . . .Data is generated by some instrument (an NGS core in China, or a large telescope in Chile)Since these instruments are in high demand, users have to get their data off the instrument to make way for the next user……so the data is typically moved from a staging area to some type of ingest storeThis is usually pretty raw data …so some of it may need be run through one or more analysis pipelinesAt this point we’ve not only distributed the data, we’ve also multiplied it in sizeThen we may need to maybe do some post-processing and apply some metadata……before publishing it in a Community Store where other collaborators can access it securelyPerhaps also place a subset of the data in a national Registry for public accessAnd we’d also like to keep Mirrors of the data for performance and various other reasonsAnd over time we will end up moving data to an Archive, perhaps a hierarchical storage systemIn practice the various stores are probably owned and managed by different organizations:>>>>>ANIMATE …Ingest is on my campus at University of Chicago>>>>>ANIMATE…Analysis may be on a public cloud provider because I can’t get enough cycles on demand on campus>>>>>ANIMATE …The Registry is in some vault in Virginia>>>>>ANIMATE…The Community Store is on a private cloud on one of the national labsAnd so on… we have to deal with a hybrid storage world
Beyond the hybrid storage environments, we also have to deal with moving the data reliably -- something that sounds pretty mundane…and it is mundane when you’re moving 50 pictures of Fluffy to Picassa…but it’s a little more challenging when you’re moving a petabyte to half a dozen locations around the worldYou end up having to become familiar with many tools and techniques>>>ANIMATE …some systems will force you use arcane commands like SCP that require extensive configuration and tuning – and yet still deliver only modest performance and reliability>>>ANIMATE…in other cases you’ll find that a hard drive and a FedEx account are the way to go>>>ANIMATE…or some custom portal with a convoluted workflowSo we have to deal with a hybrid (and generally poor) user experience
And if that wasn’t enough, each of these systems is going to bein a different security domain>>>>ANIMATE….and you’ll have to deal with multiple identities and security protocols to get the job doneSo we have to deal with a hybrid security worldRealization: building a solution is really only feasible for very few among us -- certainly not for the typical research labSo we looked at what’s worked in a number of business application areas like CRM and ERP and decided that…
…for small research groups, the only feasible way to provide all of these capabilities is…>>>> ANIMATE…Using a software-as-a-service approachAnd what’s interesting is that much of this also applies to larger groups who are starting to question the level of investment they are making in building their ownIt’s similar to the debate that many large companies have had about using SaaS vs. in-house software…and we’ve seen that pendulum swing strongly in favor of SaaS
And when we spoke with IT folks at various research communities they insisted that some things were not up for negotiation
We can deal with that complexity technically but the key is to deliver a great user experienceWe’re trying to serve the needs of the vast majority of researchers who cannot hope to…navigate Amazon’s API…or figure out how to configure an Isilonstorage node for their internal cloud
So a couple of years ago we started building such a solution…Transfer: move big data reliably >4,000 users in just over a year, approaching the 4PB mark …Storage: enabling any number of object stores to be used in a consistent manner to replicate, version, and share dataCollaborate: allow the group to manage their work flows and publish data for internal and external consumptionCatalog: make metadata part and parcel of the data, not an afterthoughtIntegrate: enable groups to access the various services programmaticallyNexus: provide a federated identity infrastructure which allows users to access the services with their existing accounts at their primary institution…+ a group management service that serves as the basis for sharing of data across all other Globus servicesIn developing this we started with the User experience…service + multiple Uis for different types of users…a very, small, no-maintenance footprint on the endpoints -- a drag and drop or single command packaged installation that makes the resource part of the Globus service
So SaaS is one strategy for dealing with the hybrid world coming our way…but we also need strategies for dealing with our organizationFor many years we built up a fairly traditional software development organization: lots of devs, some QA, some opsWe realized that we would need change our view of what the organization should look like
The first shift we are experiencing is from being installers to capability brokersWe are less concerned with building a data center or installing and configuring softwareThere is absolutely still a role for that but there a few that have the skills and experience…so we take advantage of that experience and focus instead of selecting various components and spend our time making them easy to use-- again it’s focusing on the user experienceAn example of this is the Globus Storage serviceWe are working with multiple providers>>> talk to UC IT Services deployment et alCloud storage providers will keep driving the unit cost of storage downWe believe the value lies in making trivial to use that storage in the normal course of their workOther components for Globus Collaborate: Drupal, JIRA, ConfluenceAnd we eat our own dog food …Zendeskfor support…Using Globus Integrate and Globus Nexus…from the user’s perspective they only have a single account on Globus and can access external services like Zendesk to track their support tickets, post to forums, etc.
We’re also moving from being developers to playing more of an integrator roleAgain, there are lots of smart people out there that have figured out the hard bits, for example in identity management and securityWe’ve taken that knowledge and packaged it in such a way that shields the user from all of this complexity…they just need to remember their single username/password or campus login or Google account or whatever>>>> TALK TO FEDERATED IDENTITY
If you truly want to focus on the user experience then you need to build the as suchWe’ve shifted the make up of our team fromdev-heavy to more balanced with respect to UX…and quite a shift away from traditional ops (the devs run their own stuff using simple software like Chef)