IAC 2024 - IA Fast Track to Search Focused AI Solutions
Β
Rdap12 wrap up reagan moore
1. RDAP Summary
Topics that drive future digital libraries
Reagan Moore
4/4/2012 ASIST RDAP 2012 1
2. Topics
β’ Data Management Plans and Policies
β Scientific research data support
β Planning for NSF Data Management Plans
β’ Data Citation Panel
β Digital identifiers
β Data representation (context)
β’ Curation Service Models
β Institution-based repositories
β’ SIG-DL Sustainability Panel
β Cost model
β Business model
β’ Training Data Management Practitioners
β Theory for information and knowledge, but not digital data
β Teaching eScience librarians how to manage data for researchers
4/4/2012 ASIST RDAP 2012 2
3. Data Management Plans
β’ Enforcement of regulations:
β IRB, FERPA, HIPAA
β’ Enforcement of agency policies:
β NSF Data management plans
β’ Enforcement of institutional policies:
β Trustworthiness
β’ Compliance with community consensus on collection properties
β Compliance with standards for discovery and access
β’ Enforcement of management policies:
β Integrity, authenticity, retention, disposition, replication
β’ Automation of administrative tasks
β Migration
β’ Validation of assessment criteria
4/4/2012 ASIST RDAP 2012 3
4. Data Identifiers
β’ Generate identifiers that are location
independent
β Handle system, hash function
β Data management system updates link from identifier
to representation of location (replicas)
β’ Given an identifier, what does it represent
β Landing page that provides context for the data
β Data model that approximates data in space and time
β Direct access to the data
β Access to procedure that generates the data
4/4/2012 ASIST RDAP 2012 4
5. Data Identifiers
β’ For derived data
β NASA Level 0 β raw data
β NASA Level 1 β Calibrated
β NASA Level 2 β Transformed to physical quantities
β NASA Level 3 β Functional transformations, projections
β’ Can we identify the process that created the data
β Generalization of workflow provenance
β Re-execute the workflow to re-create the data
β’ Create identifier for the workflow
β Need workflow virtualization
β’ Reproducible science
4/4/2012 ASIST RDAP 2012 5
6. Curation Service Models
β’ Driven by user requirements
β Unique services for each science and engineering domain
β Different data formats, data analyses, semantics
β’ Can generic software support each unique collection?
β View curation as a continuum with varying policies and
procedures for each stage of the data life cycle
β Characterize domains by access methods, policies, and
procedures
β’ Are there standard best practices for a data center?
β Data colocation β minimize administrative costs
β Evolution of center to broaden range of supported
communities
4/4/2012 ASIST RDAP 2012 6
7. Standard Services
β’ Data discovery
β’ Data access
β’ Data manipulation
β Re-creation of derived data products
β Transformation
β Feature detection
β Indexing
β Representation β fit polynomial in space and time
β’ Manipulate data based on polynomial
4/4/2012 ASIST RDAP 2012 7
8. Sustainability
β’ Business models
β Identification of a sustaining community
β Quantification of benefit
β’ Cost model
β Distribution of cost across entire community
β Membership fee
β Pro-rated per item cost
β’ Minimizing cost
β Automate curation
β Transfer curation tasks to submitter
β FITS file (astronomy)
β’ Metadata for project/observatory
β’ Metadata for each image
4/4/2012 ASIST RDAP 2012 8
9. Creating a Repository
β’ Identify a support community
β Tie to requirements of researchers
β Tie to new science and research initiatives
β Tie to intellectual capital of the university
β’ Identify cost benefit
β Co-location of services
β Benefit of scale
β’ Demonstrate responsiveness
β Support for users
4/4/2012 ASIST RDAP 2012 9
10. Educating Next Generation
β’ Identify a motivating challenge
β’ Curriculum development
β Coupling of research to education
β Competency in scientific data management and technology
β’ Data intensive science
β Interest driven by a domain
β Multi-disciplinary problems
β Treat as a skill
β’ Work with live data
β Enable students to make a discovery
4/4/2012 ASIST RDAP 2012 10
11. Data β Information β Knowledge
(iRODS)
β’ Data β instantiation of an approximation to reality
β Form of representation of reality
β Requires description of the physical approximation (context)
β’ Information β application of label to data
β Requires identification of the relationships that must be
satisfied for the label to be applied
β Reification of knowledge (extraction of features)
β’ Knowledge β relationships between labels
β Requires procedures to parse data to see if relationships are
present
β’ Data science β transformation of data into knowledge
β Use case driven
4/4/2012 ASIST RDAP 2012 11
12. Digital Library Evolution
β’ Witnessing rapid evolution of digital libraries
β Item level indexing
β Item level searching
β Data manipulation services
β’ Driven by scale
β Completeness of semantics
β’ Represent every word in the English language (15 million)
β’ Represent cultural knowledge (~ 1 Tbyte)
β Types of reified relationships
β’ Index based on more than 100 relationships present within
documents (IBM-Watson)
β’ Spatial, temporal, organizational, familial, β¦
β Ability to couple indexing to data within storage
4/4/2012 ASIST RDAP 2012 12
13. Vision
β’ Dynamic digital library
β Continually extract features from data
β Generate index based on features within the data
β’ Create knowledge base
β Link local index to community index
β’ Support evolution of the library
β Define new relationships
β Analyze contents
β Generate new index
4/4/2012 ASIST RDAP 2012 13
14. Implications
β’ Characterize scientific data by the workflow that creates the
published version
β Transform from a library of data files into a library of workflows
β’ Support re-execution of workflows
β Modify input parameters, generate new version
β’ Generate discovery semantics (features) through reification
of relationships
β Must be able to parse each file
β Create algorithm that tests for the desired relationship
β Apply algorithms within storage systems
β Build terabyte index of reified relationships for each storage
system
4/4/2012 ASIST RDAP 2012 14
15. Virtualization
β’ Digital library represents data as searchable metadata
β’ Collection virtualization defines and manages the
properties of the collection
β Assertions about each file in the collection
β Location independent naming and access
β Management of state information
β’ Workflow virtualization defines the properties of
procedures
β Provenance information for each procedure
β Location independent naming and execution
β Management of state information
4/4/2012 ASIST RDAP 2012 15
16. Digital Library in 2050
β’ Links contents to cultural knowledge
β Terabyte indices
β’ Enables analysis of library contents
β Feature detection services
β’ Provides workspace in which research is conducted
β Coupling of processing to data storage
β’ Validates assertions about collection properties
β Published policies
β’ Scalable infrastructure
4/4/2012 ASIST RDAP 2012 16