Hughes RDAP11 Data Publication Repositories

1. Leveraging Open Source Technologies to Enable Scientific Archiving and Discovery Research Data Access & Preservation Denver, Colorado March 31 - April 1, 2011 Steve Hughes Dan Crichton Chris Mattmann Sean Kelly

2. Topics E-Science Trends Software Architectures Open Source Object-Oriented Data Technology Use Case Data Driven 2 Leveraging Open Source Technologies to Enable Scientific Discovery

3. “eScience” Trends Highly distributed, multi-organizational systems Systems are moving towards loosely coupled systems or federations in order to solve science problems which span center and institutional environments Sharing of data and services which allow for the discovery, access, and transformation of data Systems are moving towards publishing of services and data in order to address data and computationally-intensive problems Infrastructures which are being built to handle future demand Use of commodity services to address elasticity Address complex modeling, inter-disciplinary science and decision support needs Need a dynamic environment where data and services can be used quickly as the building blocks for constructing predictive models and answering critical science questions Need to ensure information architecture support the varying science needs Changing the way in which data analysis is performed Moving towards analysis of distributed data to increase the study power Enabling greater collaboration across centers Systematizing, where possible 3 Leveraging Open Source Technologies to Enable Scientific Discovery

4. Highly Distributed Science Environments Leveraging Open Source Technologies to Enable Scientific Discovery 4 Highly distributed/federated Collaborative Information-centric Discipline-specific Growing/evolving Heterogeneous (Implementations)

5. Why Software Architecture? Software Architecture: The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution. (ANSI/IEEE Std. 1471-2000) Architecture is about strategy to address key architectural concerns… How can we exploit common patterns to improve reuse? Can we develop software product lines? Can we improve interoperability? Can we reduce dependencies? What are the architectural principles..?: loosely-coupled, data-driven, highly distributed, commodity services, service oriented, collaborative/multi-institutional 5 Leveraging Open Source Technologies to Enable Scientific Discovery

7. Loosely coupled

8. Elasticity (e.g. Commodity-based)

9. Multi-organizational

10. etc

11. At an enterprise-scale, architectures don’t need to prescribe what’s inside services….just their interfaces, function, behavior, etc…

12. Services might include….

13. Data discovery

14. Data access

15. Security

16. TransformationC2 Architectural Style

17. What does this have to do with open source? The identification of core software product lines and tools, that can be reused, are excellent examples of opportunities to create open source projects Across a federation of organizations, systems and users, what be developed and shared? How can software components be developed in generic ways, but allow for extensions? Open source itself is a strategy Can improve collaborations Can drive a robust set of reusable software components and tools Can push standards development Can encourage use of common architectural patterns Leveraging Open Source Technologies to Enable Scientific Discovery 7

18. Open Source Models Software sharing with an open source license (e.g, BSD-style license) Software distribution through open source organizations (e.g., SourceForge) Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation) Ad hoc open source project communities with their own governance Leveraging Open Source Technologies to Enable Scientific Discovery 8

19. Open Source Models: Our Opinion Software sharing with an open source license (e.g, BSD-style license) It’s a great start Limited community involvement Software distribution through open source organizations (e.g., SourceForge) Provides good software distribution support Software projects under the governance of an open source community/foundation (e.g., Apache Software Foundation) This moves from just distribution support to collaboration and governance over the development Ad hoc open source project communities with their own governance This can make a lot of sense for larger federations… Leveraging Open Source Technologies to Enable Scientific Discovery 9

22. distribute environments

23. science data generation

24. data capture, end-to-end

25. access to science data by the community

26. A set of building blocks/services to exploit common system patterns for reuse

27. 04-FEB-2011 - Apache OODT v0.2 Released

28. Used for a number of science data system activities11 Leveraging Open Source Technologies to Enable Scientific Discovery http://oodt.apache.org/

30. New centers plugging in (i.e. data nodes)

31. Multi-center data system infrastructure

33. New centers plugging in (i.e. data nodes)

34. Multi-center data system infrastructure

35. Heterogeneous sites with common interfaces allowing access to distributed portals Integrated based on common data standards Secure (e.g. encryption, authentication, authorization) 12 Leveraging Open Source Technologies to Enable Scientific Discovery

37. Used OODT Catalog and Archive Service software

38. Focus is on “process management”

39. Constructed “workflows”

40. Execution of “processors” based on a set of rules

41. Explicit separation of workflow management from management of computational resources

42. Provided “lights out” operations

43. Multiple Missions

44. SeaWinds

45. QuikSCAT

46. Orbiting Carbon Observatory (OCO), OCO-2…

47. NP Sounder PEATE

48. SMAPSeaWinds on ADEOS II (Launched Dec 2002) DJC-13 Leveraging Open Source Technologies to Enable Scientific Discovery Credit: D. Freeborn, C. Mattmann, D. Woollard

49. Conceptual Capabilities OODT Apache Suite (oodt.apache.org) File Management Workflow Management (for jobs/processing) Data Transformation Data Access Metadata Query Registry (future addition to OODT) Metadata Management based on ebXML registry specification Used to manage different type of “extrinsic” objects (metadata descriptions of data, services, etc) “targets”, “science data products”, “documents”, “services”, etc Product identification, versioning, tracking, and subscription/notification Indexing, Classification, and Associations

50. Information Architecture OODT + Registry contains two different types of “models” Core Infrastructure model Discipline model Core infrastructure model is intrinsic (integrated with the software) It is built in and used by the software; this never changes and you don’t need to worry about it Services are part of the core infrastructure (“intrinsic”) but all other metadata objects are “extrinsic” Discipline model is extrinsic (defined outside the software) It is dynamically configured For example, the registry can be configured to use whatever “extrinsic” metadata objects are important to manage This allows for the registry to be used for tracking artifacts, managing services, etc. This is what projects need to define

51. Observational Product – Concept Map

52. PDS4 High Level Concept Map

53. Defining Extrinsic Objects and their Context (Ontology)

54. External Data Standards Open Archival Information System (OAIS) Reference Model - Defines the “Information Object” a key component of the model. ISO/IEC 11179-3: Registry Metamodel and Basic Attributes - Provides the schema for the data dictionary. Defines the concepts of registration authority and steward for governance. Object_Oriented Data Modeling – Used as a standard modeling methodology. XML/XML Schema – Provides the label syntax and validation mechanism. OASIS/ebXML Registry Information Model - Provides attributes for object registration within a federated registry/repository. ISO 15836:2009 The Dublin Core Metadata Element Set – Provides standard web resource identification attributes. Semantics - RDF, RDFS, OWL - Provides W3C standards for knowledge representation.

55. A perspective to leave you with… Agency science federations, based on an open source/collaborative model, are very attractive for the following reasons: Science benefits: can drive a growing enterprise of shared science services and software infrastructure support Technology benefits: can drive innovation through its peer review and collaboration process Infusion benefits: creates a defined process for contributing new ideas and capabilities Architecture benefits: helps you build towards a common architectural vision and drive community standards Cost benefits: can enable better leveraging and reuse of skills and capabilities across institutions Tech Transfer Benefits: may benefit other science (and non-science disciplines) 20 Leveraging Open Source Technologies to Enable Scientific Discovery

56. Questions? Thank You!!! Steve Hughes Steve.Hughes@jpl.nasa.gov Chris Mattmann Chris.Mattmann@jpl.nasa.gov Note…we have several papers, book chapters on data intensive systems, etc that we’d be happy to share! A few key ones… D. Crichton, C. Mattmann, J. S. Hughes, S. Kelly, and A. Hart. “A Multi-Disciplinary, Model- Driven, Distributed Science Data System Architecture.” Guide to e-Science: Next Generation Scientific Research and Discovery. X. Yang, L. L. Wang, W. Jie, eds. Spring Verlag, 2010, To appear. D. Crichton, S. Kelly, C. Mattmann, Q. Xiao, J. S. Hughes, J. Oh, M. Thornquist, D. Johnsey, S. Srivastava, L. Esserman, and B. Bigbee. “A Distributed Information Services Architecture to Support Biomarker Discovery in Early Detection of Cancer”. Accepted for publication at the 2nd IEEE International Conference on e-Science and Grid Computing, Amsterdam, the Netherlands, December 4th-6th, 2006. C. Mattmann, D. Crichton, N. Medvidovic and S. Hughes. “A Software Architecture-Based Framework for Highly Distributed and Data Intensive Scientific Applications”. In Proceedings of the 28th International Conference on Software Engineering (ICSE06), pp. 721-730, Shanghai, China, May 20th-28th, 2006. 21 Leveraging Open Source Technologies to Enable Scientific Discovery Dan Crichton Dan.Crichton@jpl.nasa.gov Sean Kelly Sean.Kelly@jpl.nasa.gov

57. Backup 22 Leveraging Open Source Technologies to Enable Scientific Discovery

Hughes RDAP11 Data Publication Repositories

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Hughes RDAP11 Data Publication Repositories

Ähnlich wie Hughes RDAP11 Data Publication Repositories (20)

Mehr von ASIS&T

Mehr von ASIS&T (20)

Hughes RDAP11 Data Publication Repositories