Making Connections, Eliot Metsger, Johns Hopkins University; Policy-based Data Management; RDAP11 Summit
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
ICT role in 21st century education and its challenges
Â
Metsger RDAP11 Policy-based Data Management
1. Data Conservancy 1 Data Conservancy embraces a shared vision: scientific data curation is a means to collect, organize, validate and preserve data so that scientists can find new ways to address the grand research challenges that face society. ASIS&T RDAP Summit April 1, 2011 Elliot Metsger (emetsger@jhu.edu)
3. Architecture 3 Open Archival Information System Functional Entities Data Conservancy Service Architecture Block Diagram
4. Policy Framework 4 Policy management and enforcement must be properly modeled Understand the policy framework interactions with other components of the system Build proper abstractions Support inclusion of associated policies when transferring objects among archives Support services over data which apply policies
5. (Some) Motivating Use Cases 5 Embargo Logging Authentication and Authorization Privacy controls Obfuscating certain data Geo-locations of endangered species Personally identifiable information Issues: Granularity of policy application Obfuscation without reducing data utility (“fuzzing” algorithms)
6. Implementation 6 Design and implementation in Year 3 August 2011 – July 2012 In collaboration with Other DataNets DC Partners (e.g. NSIDC) Existing organizations (Federation of Earth Science Information Partners)
Hinweis der Redaktion
Talk is not about DC, but it sets the contextProvide brief context of DC, its architecture and design, move on to policy aspectsFunded by the NSF through the DataNet program out of OCIIn our 19th monthWhat are we building: infrastructure providing curation, preservation, and access to scientific dataDCS as technical manifestation of infrastructureNot singular monolithic instance, but a blueprint made up of modular servicesI don’t intend this to be a talk about the Data Conservancy, but because it has been my life for the past 18 months, it really sets the context of this talk. So I’ll provide some brief context about the Data Conservancy, and its architecture and design, and then move onto the policy aspects.The Data Conservancy is building infrastructure that will provide curation, preservation, and access to scientific data. The Data Conservancy Service, or DCS, is the technical manifestation of this infrastructure. We do not envision the DCS as a singular instance of a monolithic system, but a blueprint for a modular system that can be followed by those who choose to do so.
Simultaneously developing a system, exploring research problems, managing a user requirements processFlexible to accommodate input from users and requirements processesModularity a focal point of DCS designabstracted at proper level to ensure completeness, correctness, and impls adapted for user needs and research outcomesProvides public APIs, minimizes dependencies between system componentsOpen technical environment allows for adapting and evolving in desirable waysInteroperability with other infrastructureTechnical sustainability (storage plugin leveraging more cost effective storage)Evolution of DCS modules (adding ingest pipeline components)multiple implementations of archival storage API; separation of bit storage from archival storageIn addition to technical benefits, this design principle has facilitated collaboration with other DataNet awardeesOpen: closed system is non-starter. At odds with providing long-term preservation and access to data.Principles have immediate application, also forward thinking, provides technical sustainability. Because we are simultaneously developing a system, exploring research problems and managing a user requirements process, the DCS infrastructure needs to be flexible to accommodate user needs and research outcomes. Modularity has been a focal point of DCS design. Each element of the DCS infrastructure must be abstracted at the proper level. This ensures the correctness and completeness of the system, and allows for concrete implementations to be adapted to changing user needs and research outcomes.By providing public APIs and minimizing dependencies between system components, we provide an open technical environment where the DCS can adapt and evolve in desirable ways. Where possible we intend to “prove” abstractions by providing multiple concrete implementations. For example, we have two different implementations of our archival storage API: one file-system based, the other object-based using fedora. We have also been careful to differentiate between archival storage and bit storage.In addition to the technical benefits, this design principle has facilitated collaboration with other DataNet awardees. Finally, the DCS infrastructure must be open. A closed system is a non-starter; At odds with providing long-term preservation and access to data These principles are not only applied immediately, they are forward thinking and ensure the technical sustainability of the DCS and the data managed within for years to come.Â
The DCS architecture has been influenced and guided by the OAIS reference model. As you can see on this figure, OAIS functional concepts are realized in various DCS modules.Not every DCS module directly maps to an OAIS functional concept.
Adhering to our principles of navigation, policy management and enforcement must be properly modeledUnderstand interactions with other system componentsBuild the proper abstractionsWe believe it will be a requirement to transfer data between archival systems, including “policy-encumbered” dataPlan on supporting the inclusion of associated policies when transferring objects among archivesStoring objects in our local archive Audit will be one mechanism used to ensure that remote archives are able to enforce the policyOf course, we also plan to support services over data which apply policies
EmbargosLoggingAccessDownloadsAuthentication and Authorizationprivacy controlsE.g. user must contact producer for a copy of the dataDeliberate obfuscation of certain dataGeo-locations of endangered speciesPersonally identifiable informationIssues: granularity of the policy application, obfuscate data without reducing the utility of the data (“fuzzing” algorithms)
Policy framework implementation has not yet begunScheduled for year 3, which starts in Aug. 2011We plan to design and implement our policy framework with collaboration from:Other datanets We feel the need need for broad interoperability beyond just the DataNets both in a disciplinary sense and in a interdisciplinary sense.DC partners like NSIDCExisting and evolving frameworks in the earth sciences