Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery; Steve Hughes, NASA; Data Publication Repositories
The 2nd Research Data Access and Preservation (RDAP) Summit
An ASIS&T Summit
March 31-April 1, 2011 Denver, CO
In cooperation with the Coalition for Networked Information
http://asist.org/Conferences/RDAP11/index.html
RDAP 16: Building Without a Plan: How do you assess structural strength? (Pan...
Hughes RDAP11 Data Publication Repositories
1. Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery Research Data Access & Preservation Denver, Colorado March 31 - April 1, 2011 Steve Hughes Dan Crichton Chris Mattmann Sean Kelly
2. Topics E-Science Trends Software Architectures Open Source Object-Oriented Data Technology Use Case Data Driven 2 Leveraging Open Source Technologies to Enable Scientific Discovery
3. “eScience” Trends Highly distributed, multi-organizational systems Systems are moving towards loosely coupled systems or federations in order to solve science problems which span center and institutional environments Sharing of data and services which allow for the discovery, access, and transformation of data Systems are moving towards publishing of services and data in order to address data and computationally-intensive problems Infrastructures which are being built to handle future demand Use of commodity services to address elasticity Address complex modeling, inter-disciplinary science and decision support needs Need a dynamic environment where data and services can be used quickly as the building blocks for constructing predictive models and answering critical science questions Need to ensure information architecture support the varying science needs Changing the way in which data analysis is performed Moving towards analysis of distributed data to increase the study power Enabling greater collaboration across centers Systematizing, where possible 3 Leveraging Open Source Technologies to Enable Scientific Discovery
5. Why Software Architecture? Software Architecture: The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000) Architecture is about strategy to address key architectural concerns… How can we exploit common patterns to improve reuse? Can we develop software product lines? Can we improve interoperability? Can we reduce dependencies? What are the architectural principles..?: loosely-coupled, data-driven, highly distributed, commodity services, service oriented, collaborative/multi-institutional 5 Leveraging Open Source Technologies to Enable Scientific Discovery
17. What does this have to do with open source? The identification of core software product lines and tools, that can be reused, are excellent examples of opportunities to create open source projects Across a federation of organizations, systems and users, what be developed and shared? How can software components be developed in generic ways, but allow for extensions? Open source itself is a strategy Can improve collaborations Can drive a robust set of reusable software components and tools Can push standards development Can encourage use of common architectural patterns Leveraging Open Source Technologies to Enable Scientific Discovery 7
18. Open Source Models Software sharing with an open source license (e.g, BSD-style license) Software distribution through open source organizations (e.g., SourceForge) Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation) Ad hoc open source project communities with their own governance Leveraging Open Source Technologies to Enable Scientific Discovery 8
19. Open Source Models: Our Opinion Software sharing with an open source license (e.g, BSD-style license) It’s a great start Limited community involvement Software distribution through open source organizations (e.g., SourceForge) Provides good software distribution support Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation) This moves from just distribution support to collaboration and governance over the development Ad hoc open source project communities with their own governance This can make a lot of sense for larger federations… Leveraging Open Source Technologies to Enable Scientific Discovery 9
35. Heterogeneous sites with common interfaces allowing access to distributed portals Integrated based on common data standards Secure (e.g. encryption, authentication, authorization) 12 Leveraging Open Source Technologies to Enable Scientific Discovery
48. SMAPSeaWinds on ADEOS II (Launched Dec 2002) DJC-13 Leveraging Open Source Technologies to Enable Scientific Discovery Credit: D. Freeborn, C. Mattmann, D. Woollard
49. Conceptual Capabilities OODT Apache Suite (oodt.apache.org) File Management Workflow Management (for jobs/processing) Data Transformation Data Access Metadata Query Registry (future addition to OODT) Metadata Management based on ebXML registry specification Used to manage different type of “extrinsic” objects (metadata descriptions of data, services, etc) “targets”, “science data products”, “documents”, “services”, etc Product identification, versioning, tracking, and subscription/notification Indexing, Classification, and Associations
50. Information Architecture OODT + Registry contains two different types of “models” Core Infrastructure model Discipline model Core infrastructure model is intrinsic (integrated with the software) It is built in and used by the software; this never changes and you don’t need to worry about it Services are part of the core infrastructure (“intrinsic”) but all other metadata objects are “extrinsic” Discipline model is extrinsic (defined outside the software) It is dynamically configured For example, the registry can be configured to use whatever “extrinsic” metadata objects are important to manage This allows for the registry to be used for tracking artifacts, managing services, etc. This is what projects need to define
54. External Data Standards Open Archival Information System (OAIS) Reference Model - Defines the “Information Object” a key component of the model. ISO/IEC 11179-3: Registry Metamodel and Basic Attributes - Provides the schema for the data dictionary. Defines the concepts of registration authority and steward for governance. Object_Oriented Data Modeling – Used as a standard modeling methodology. XML/XML Schema – Provides the label syntax and validation mechanism. OASIS/ebXML Registry Information Model - Provides attributes for object registration within a federated registry/repository. ISO 15836:2009 The Dublin Core Metadata Element Set – Provides standard web resource identification attributes. Semantics - RDF, RDFS, OWL - Provides W3C standards for knowledge representation.
55. A perspective to leave you with… Agency science federations, based on an open source/collaborative model, are very attractive for the following reasons: Science benefits: can drive a growing enterprise of shared science services and software infrastructure support Technology benefits: can drive innovation through its peer review and collaboration process Infusion benefits: creates a defined process for contributing new ideas and capabilities Architecture benefits: helps you build towards a common architectural vision and drive community standards Cost benefits: can enable better leveraging and reuse of skills and capabilities across institutions Tech Transfer Benefits: may benefit other science (and non-science disciplines) 20 Leveraging Open Source Technologies to Enable Scientific Discovery
56. Questions? Thank You!!! Steve Hughes Steve.Hughes@jpl.nasa.gov Chris Mattmann Chris.Mattmann@jpl.nasa.gov Note…we have several papers, book chapters on data intensive systems, etc that we’d be happy to share! A few key ones… D. Crichton, C. Mattmann, J. S. Hughes, S. Kelly, and A. Hart. “A Multi-Disciplinary, Model- Driven, Distributed Science Data System Architecture.” Guide to e-Science: Next Generation Scientific Research and Discovery. X. Yang, L. L. Wang, W. Jie, eds. Spring Verlag, 2010, To appear. D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. “A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer”. Accepted for publication at the 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam, the Netherlands, December 4th-6th, 2006. C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. “A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications”. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006. 21 Leveraging Open Source Technologies to Enable Scientific Discovery Dan Crichton Dan.Crichton@jpl.nasa.gov Sean Kelly Sean.Kelly@jpl.nasa.gov