SlideShare a Scribd company logo
1 of 18
SEAD Virtual Archive:
   Building a Federation of Institutional
Repositories for Long-Term Data Preservation
          in Sustainability Science
             Beth Plale, Indiana University, Bloomington, Indiana, USA
       Robert H. McDonald, Indiana University, Bloomington, Indiana, USA
       Kavitha Chandrasekar, Indiana University, Bloomington, Indiana, USA
            Inna Kouper, Indiana University, Bloomington, Indiana, USA
           Stacy Konkiel, Indiana University, Bloomington, Indiana, USA
      Margaret L. Hedstrom, University of Michigan, Ann Arbor, Michigan, USA
         Jim Myers, Rensselaer Polytechnic Institute, Troy, New York, USA
             Praveen Kumar, University of Illinois, Urbana, Illinois, USA


                                                                        Cooperative agreement
                                                                        #OCI0940824
                       IDCC 2013 – Amsterdam – Jan. 16, 2013                               1
SEAD TEAMS
              Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams,
 Michigan     George Alter (ICPSR), Bryan Beecher (ICPSR)

              Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light,
              Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Stacy Konkiel,
  Indiana     Robert Ping, Ryan Cobine

              James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd
Rensselaear

              Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA),
  Illinois    Luigi Marini (NCSA)




                               IDCC 2013 – Amsterdam – Jan. 16, 2013               2
Challenge: The Data Deluge




1. Scientific data ingestion must be quick and minimally intrusive on a scientist’s time.
2. Ingesting must be flexible enough to handle the varied kinds of data.
           sizes // formats // composition
3. Tools for advertising and serving data from an institutional repository need to be
consistent with tools and processes of the scientific community.



                                 IDCC 2013 – Amsterdam – Jan. 16, 2013                      3
Challenge: Long Tail Scientific Research
• Many research niches
  – customized methods
    & toolsets
  – localized storage


• Less consideration for long-term availability
  and data reuse


                 IDCC 2013 – Amsterdam – Jan. 16, 2013   4
Requirements of Virtual Archive for
        Sustainability Science
• Must connect multiple IRs
• Must be minimally intrusive on a scientist’s time
• Must handle varied data:
  –   multi-GB collection,
  –   vastly heterogeneous collection of files,
  –   small complex database of a thousand variables, or
  –   set of files in formats that are unique to the
      subdiscipline
• Must be consistent with tools and processes of
  the community
                    IDCC 2013 – Amsterdam – Jan. 16, 2013   5
SEAD
                               discover
   ingest




    publish                                                    associate

                   SEAD Virtual Archive (SVA)
                   -- manage sustainability science
                       window to multiple IRs
                           --OAIS model


 IU Scholarworks   UIUC IDEALS                    UMich Deep
 IR                IR                             Blue IR

                     IDCC 2013 – Amsterdam – Jan. 16, 2013                 6
SEAD Virtual Archive (SVA)
                                                                   Design

                                                                  Policy
                                                                 Decisions

                                                                 Progress to
SEAD Virtual Archive (SVA)
-- manage sustainability science                                    Date
    window to multiple IRs
        --OAIS model



[Single view into data] [Easy deposit]
                         IDCC 2013 – Amsterdam – Jan. 16, 2013                 7
Accept
             SEAD Virtual Archive Workflow
Repository
Agreement




             Upload        Run            File              Mint            Deposit    Update
Preview                                                                                          Index
             Data to      Virus        Charact-                             to IR (&    DOI
 Data                                                       DOI                                 Metadata
               VA        Checking      erization                             cloud)    target



                                                    IR                 Large                      Index
                                                                                                   Index
                       Version                                                                  Scientific
                                                                                                 Scientific
                                                   Match-             Dataset
                        Data                                                                    Metadata
                                                   maker              Decision                   Metadata




                                                Ongoing work


                                    IDCC 2013 – Amsterdam – Jan. 16, 2013                            8
Architecture: SEAD VA Matchmaker


                                                IR
                                               Match-
                                               maker




                     Query for
             data contributor metadata              IR Matchmaker
                                                        Client

VIVO              Return data contributor’s
                    affiliation information
                                              Query             Get
                                              Match            Match
                                                                       Return all
                                          Query
                                                                       IRs’ details
                                          VA load
                     VA Load Monitor                 IR Matchmaker
                                                                                      Repository Agent
                          Agent                          Service
                                                                        Query
                                        Return
                                                                        for IRs’
                                       VA load
                                                                        details
                                      constraints
                    IDCC 2013 – Amsterdam – Jan. 16, 2013                                          9
Policy: Licensing Agreements

• Right to store and re-format files
  (preservation)
• Allow editing to protect human
  subjects, sensitive data (protection)                      Repository
• Make metadata public
  (discoverability)
                                                               rights
• Ensure sponsor compliance
  (liability)



                     IDCC 2013 – Amsterdam – Jan. 16, 2013                10
Policy: Licensing Agreements


               • Retain copyright/moral
                 rights
Depositor      • Deposits will not be
 rights          changed from original
                 intent
               • Embargoes will be honored


            IDCC 2013 – Amsterdam – Jan. 16, 2013   11
Policy: Licensing Agreements
  Single-license                          Matchmaking
     solution                               solution

                                                 Connect
                                                 requirements of:
     Satisfy all repository
        requirements                             • End users
                                                 • Repositories
                                                 • SEAD Virtual Archive




      Mitigate rights on
     behalf of depositor



                 IDCC 2013 – Amsterdam – Jan. 16, 2013                    12
Policy: Permanent Identifiers

  Author IDs                                  Dataset IDs

•VIVO                                    •Digital
 identifiers                              Object
                                          Identifiers
                                          (DOIs)
               IDCC 2013 – Amsterdam – Jan. 16, 2013        13
Policy: Author IDs
                                                                  •   Global system
                                                                  •   Buy-in from and
•   Used primarily at                                                 integration with
    domain/institution                 ORCID                          major publishers
    al level                                                          and institutions
•   Supports many
    researcher ID
    systems,                                        ResearcherID
    including                                          Scopus
                       VIVO ID
    ORCID                                            Author ID
                                                      Pivot ID



                          IDCC 2013 – Amsterdam – Jan. 16, 2013                      14
Policy: Dataset IDs

 Handles                          DOIs




    IDCC 2013 – Amsterdam – Jan. 16, 2013   15
Progress to Date
• Ingested all NCED data
  – Small-sized collection (overall < 150 Mb)
  – File organization for heterogeneous collection of
    related files with flat or hierarchical structure
• Tested deposit between the VA, UIUC IDEALS,
  and IUScholarWorks




                  IDCC 2013 – Amsterdam – Jan. 16, 2013   16
Future Work
• Address other use cases
  – Large size collections (overall > 1 Gb)
  – Relational database / interconnected variables
  – Unique formats (to project, discipline, community)
• Interoperability with other DataNets
• Support for API access
• Determine how prototype fits researcher
  workflows
                  IDCC 2013 – Amsterdam – Jan. 16, 2013   17
Thank you


http://www.sead-data.net
@SEADdatanet



                                            Cooperative agreement
                                            #OCI0940824
    IDCC 2013 – Amsterdam – Jan. 16, 2013                     18

More Related Content

What's hot

Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identificationguest453b14
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesKerstin Forsberg
 
Challenges in setting up an RDM Support Service
Challenges in setting up an RDM Support ServiceChallenges in setting up an RDM Support Service
Challenges in setting up an RDM Support ServiceGarethKnight
 
Datashare cni spring2013
Datashare cni spring2013Datashare cni spring2013
Datashare cni spring2013rizkjackson
 
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...TERN Australia
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?LIBER Europe
 
On Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a ServiceOn Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a ServiceHong-Linh Truong
 
THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009Fiona Salvage
 
Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...librarianrafia
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and LibariesRob Grim
 
Graham Pryor
Graham PryorGraham Pryor
Graham PryorEduserv
 
RDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataRDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataGudmundur Thorisson
 

What's hot (17)

NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
NISO Forum, Denver, Sept. 24, 2012: Scientific discovery and innovation in an...
 
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Data EquivalenceNISO Forum, Denver, Sept. 24, 2012: Data Equivalence
NISO Forum, Denver, Sept. 24, 2012: Data Equivalence
 
Dataset Citation and Identification
Dataset Citation and IdentificationDataset Citation and Identification
Dataset Citation and Identification
 
Metadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiencesMetadata in general and Dublin Core in specific; some experiences
Metadata in general and Dublin Core in specific; some experiences
 
Challenges in setting up an RDM Support Service
Challenges in setting up an RDM Support ServiceChallenges in setting up an RDM Support Service
Challenges in setting up an RDM Support Service
 
Datashare cni spring2013
Datashare cni spring2013Datashare cni spring2013
Datashare cni spring2013
 
DMP in 5 minutes
DMP in 5 minutesDMP in 5 minutes
DMP in 5 minutes
 
Hawaii Pacific GIS Conference 2012: GIS in Education: K-12 and University - H...
Hawaii Pacific GIS Conference 2012: GIS in Education: K-12 and University - H...Hawaii Pacific GIS Conference 2012: GIS in Education: K-12 and University - H...
Hawaii Pacific GIS Conference 2012: GIS in Education: K-12 and University - H...
 
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
 
Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?Where is the opportunity for libraries in the collaborative data infrastructure?
Where is the opportunity for libraries in the collaborative data infrastructure?
 
Whither Small Data?
Whither Small Data?Whither Small Data?
Whither Small Data?
 
On Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a ServiceOn Evaluating and Publishing Data Concerns for Data as a Service
On Evaluating and Publishing Data Concerns for Data as a Service
 
THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009THE Jisc Supplement 25 Nov 2009
THE Jisc Supplement 25 Nov 2009
 
Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...Open Access: Open Access Looking for ways to increase the reach and impact of...
Open Access: Open Access Looking for ways to increase the reach and impact of...
 
e-Science, Research Data and Libaries
e-Science, Research Data and Libariese-Science, Research Data and Libaries
e-Science, Research Data and Libaries
 
Graham Pryor
Graham PryorGraham Pryor
Graham Pryor
 
RDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research DataRDFC2012 Open Access to Research Data
RDFC2012 Open Access to Research Data
 

Similar to SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceRobert H. McDonald
 
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)SEAD
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDCAstroAtom
 
100615 htap network_brussels
100615 htap network_brussels100615 htap network_brussels
100615 htap network_brusselsRudolf Husar
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTERN Australia
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyHitachi Vantara
 
110823 data fed_solta11
110823 data fed_solta11110823 data fed_solta11
110823 data fed_solta11Rudolf Husar
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10keirdo1
 
An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...
An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...
An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...HMO Research Network
 
TUW - 184.742 Evaluating Data Concerns for DaaS
TUW - 184.742 Evaluating Data Concerns for DaaSTUW - 184.742 Evaluating Data Concerns for DaaS
TUW - 184.742 Evaluating Data Concerns for DaaSHong-Linh Truong
 
Solving Compliance for Big Data
Solving Compliance for Big DataSolving Compliance for Big Data
Solving Compliance for Big Datafbeckett1
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
Multidimensional Data in the VO
Multidimensional Data in the VOMultidimensional Data in the VO
Multidimensional Data in the VOJose Enrique Ruiz
 

Similar to SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science (20)

Building a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability ScienceBuilding a Data Discovery Network for Sustainability Science
Building a Data Discovery Network for Sustainability Science
 
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
Digital Library Federation - DataNets Panel presentation (Nov. 1st, 2011)
 
Introducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data EngineIntroducing Splunk – The Big Data Engine
Introducing Splunk – The Big Data Engine
 
Data-intensive profile for the VAMDC
Data-intensive profile for the VAMDCData-intensive profile for the VAMDC
Data-intensive profile for the VAMDC
 
iRODS
iRODSiRODS
iRODS
 
Hawaii Pacific GIS Conference 2012: Natural Resource Management - Coupling Cy...
Hawaii Pacific GIS Conference 2012: Natural Resource Management - Coupling Cy...Hawaii Pacific GIS Conference 2012: Natural Resource Management - Coupling Cy...
Hawaii Pacific GIS Conference 2012: Natural Resource Management - Coupling Cy...
 
Exposing Real World Information for the Web of Things
Exposing Real World Information for the Web of ThingsExposing Real World Information for the Web of Things
Exposing Real World Information for the Web of Things
 
100615 htap network_brussels
100615 htap network_brussels100615 htap network_brussels
100615 htap network_brussels
 
Tim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasetsTim Malthus_Towards standards for the exchange of field spectral datasets
Tim Malthus_Towards standards for the exchange of field spectral datasets
 
Big Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage StrategyBig Data, Big Content, and Aligning Your Storage Strategy
Big Data, Big Content, and Aligning Your Storage Strategy
 
110823 data fed_solta11
110823 data fed_solta11110823 data fed_solta11
110823 data fed_solta11
 
Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10Accel Partners New Data Workshop 7-14-10
Accel Partners New Data Workshop 7-14-10
 
An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...
An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...
An Eye on the Future A Review of Data Virtualization Techniques to Improve Re...
 
What is SDMX-RDF?
What is SDMX-RDF?What is SDMX-RDF?
What is SDMX-RDF?
 
TUW - 184.742 Evaluating Data Concerns for DaaS
TUW - 184.742 Evaluating Data Concerns for DaaSTUW - 184.742 Evaluating Data Concerns for DaaS
TUW - 184.742 Evaluating Data Concerns for DaaS
 
Solving Compliance for Big Data
Solving Compliance for Big DataSolving Compliance for Big Data
Solving Compliance for Big Data
 
Shaman Project Hemmje
Shaman Project  HemmjeShaman Project  Hemmje
Shaman Project Hemmje
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
Multidimensional Data in the VO
Multidimensional Data in the VOMultidimensional Data in the VO
Multidimensional Data in the VO
 

SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science

  • 1. SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science Beth Plale, Indiana University, Bloomington, Indiana, USA Robert H. McDonald, Indiana University, Bloomington, Indiana, USA Kavitha Chandrasekar, Indiana University, Bloomington, Indiana, USA Inna Kouper, Indiana University, Bloomington, Indiana, USA Stacy Konkiel, Indiana University, Bloomington, Indiana, USA Margaret L. Hedstrom, University of Michigan, Ann Arbor, Michigan, USA Jim Myers, Rensselaer Polytechnic Institute, Troy, New York, USA Praveen Kumar, University of Illinois, Urbana, Illinois, USA Cooperative agreement #OCI0940824 IDCC 2013 – Amsterdam – Jan. 16, 2013 1
  • 2. SEAD TEAMS Margaret Hedstrom-PI, Marietta Van Buhler, Karen Woollams, Michigan George Alter (ICPSR), Bryan Beecher (ICPSR) Beth Plale-Co-PI, Katy Börner, Robert H. McDonald, Robert Light, Kavitha Chandrasekar, Stacy Kowalczyk, Inna Kouper, Stacy Konkiel, Indiana Robert Ping, Ryan Cobine James Myers-Co-PI, Ram Prasanna Govind Krishnan, Lindsay Todd Rensselaear Praveen Kumar-Co-PI, Terry McLaren (NCSA), Rob Kooper (NCSA), Illinois Luigi Marini (NCSA) IDCC 2013 – Amsterdam – Jan. 16, 2013 2
  • 3. Challenge: The Data Deluge 1. Scientific data ingestion must be quick and minimally intrusive on a scientist’s time. 2. Ingesting must be flexible enough to handle the varied kinds of data. sizes // formats // composition 3. Tools for advertising and serving data from an institutional repository need to be consistent with tools and processes of the scientific community. IDCC 2013 – Amsterdam – Jan. 16, 2013 3
  • 4. Challenge: Long Tail Scientific Research • Many research niches – customized methods & toolsets – localized storage • Less consideration for long-term availability and data reuse IDCC 2013 – Amsterdam – Jan. 16, 2013 4
  • 5. Requirements of Virtual Archive for Sustainability Science • Must connect multiple IRs • Must be minimally intrusive on a scientist’s time • Must handle varied data: – multi-GB collection, – vastly heterogeneous collection of files, – small complex database of a thousand variables, or – set of files in formats that are unique to the subdiscipline • Must be consistent with tools and processes of the community IDCC 2013 – Amsterdam – Jan. 16, 2013 5
  • 6. SEAD discover ingest publish associate SEAD Virtual Archive (SVA) -- manage sustainability science window to multiple IRs --OAIS model IU Scholarworks UIUC IDEALS UMich Deep IR IR Blue IR IDCC 2013 – Amsterdam – Jan. 16, 2013 6
  • 7. SEAD Virtual Archive (SVA) Design Policy Decisions Progress to SEAD Virtual Archive (SVA) -- manage sustainability science Date window to multiple IRs --OAIS model [Single view into data] [Easy deposit] IDCC 2013 – Amsterdam – Jan. 16, 2013 7
  • 8. Accept SEAD Virtual Archive Workflow Repository Agreement Upload Run File Mint Deposit Update Preview Index Data to Virus Charact- to IR (& DOI Data DOI Metadata VA Checking erization cloud) target IR Large Index Index Version Scientific Scientific Match- Dataset Data Metadata maker Decision Metadata Ongoing work IDCC 2013 – Amsterdam – Jan. 16, 2013 8
  • 9. Architecture: SEAD VA Matchmaker IR Match- maker Query for data contributor metadata IR Matchmaker Client VIVO Return data contributor’s affiliation information Query Get Match Match Return all Query IRs’ details VA load VA Load Monitor IR Matchmaker Repository Agent Agent Service Query Return for IRs’ VA load details constraints IDCC 2013 – Amsterdam – Jan. 16, 2013 9
  • 10. Policy: Licensing Agreements • Right to store and re-format files (preservation) • Allow editing to protect human subjects, sensitive data (protection) Repository • Make metadata public (discoverability) rights • Ensure sponsor compliance (liability) IDCC 2013 – Amsterdam – Jan. 16, 2013 10
  • 11. Policy: Licensing Agreements • Retain copyright/moral rights Depositor • Deposits will not be rights changed from original intent • Embargoes will be honored IDCC 2013 – Amsterdam – Jan. 16, 2013 11
  • 12. Policy: Licensing Agreements Single-license Matchmaking solution solution Connect requirements of: Satisfy all repository requirements • End users • Repositories • SEAD Virtual Archive Mitigate rights on behalf of depositor IDCC 2013 – Amsterdam – Jan. 16, 2013 12
  • 13. Policy: Permanent Identifiers Author IDs Dataset IDs •VIVO •Digital identifiers Object Identifiers (DOIs) IDCC 2013 – Amsterdam – Jan. 16, 2013 13
  • 14. Policy: Author IDs • Global system • Buy-in from and • Used primarily at integration with domain/institution ORCID major publishers al level and institutions • Supports many researcher ID systems, ResearcherID including Scopus VIVO ID ORCID Author ID Pivot ID IDCC 2013 – Amsterdam – Jan. 16, 2013 14
  • 15. Policy: Dataset IDs Handles DOIs IDCC 2013 – Amsterdam – Jan. 16, 2013 15
  • 16. Progress to Date • Ingested all NCED data – Small-sized collection (overall < 150 Mb) – File organization for heterogeneous collection of related files with flat or hierarchical structure • Tested deposit between the VA, UIUC IDEALS, and IUScholarWorks IDCC 2013 – Amsterdam – Jan. 16, 2013 16
  • 17. Future Work • Address other use cases – Large size collections (overall > 1 Gb) – Relational database / interconnected variables – Unique formats (to project, discipline, community) • Interoperability with other DataNets • Support for API access • Determine how prototype fits researcher workflows IDCC 2013 – Amsterdam – Jan. 16, 2013 17
  • 18. Thank you http://www.sead-data.net @SEADdatanet Cooperative agreement #OCI0940824 IDCC 2013 – Amsterdam – Jan. 16, 2013 18

Editor's Notes

  1. Good morning, my name is Stacy Konkiel and I’m an E-Science Librarian at Indiana University in the United States. SEAD—which stands for Sustainable Environment, Actionable Data--is an NSF DataNet project that is aiming to create a federation of institutional repositories for long-term data preservation in sustainability science.
  2. Myself and my colleague Robert McDonald who is also here today are part of a larger team of scientists and research data specialists, led by PIs Margaret Hedstrom, Beth Plale, Jim Myers, and Praveen Kumar, that are working to build the SEAD Virtual Archive.Before I go into too much detail about the project, I’m going to set up our research problem, which should help explain why this federation is both unique and essential.
  3. It would come as no surprise to many of you that many universities are currently grappling with their response to the deluge of scientific data emerging from their faculty’s research. Many are looking to their libraries and institutional repositories as a solution. Research libraries have considerable expertise with preservation of the scholarly record--in cooperation with the university’s IT organization, a research library can play a significant role in preservation of scientific data, and in connecting the it to the published record of scholarship. However, the general consensus is that institutional repositories are difficult to use. Often, it takes a long time and manual intervention to deposit data. If university IRs are to compete with commercial and specialized scientific repositories, they need to provide services that are easy, fast, reliable, and community friendly. Geyser photo source: http://www.flickr.com/photos/worldtrek/5782815999/sizes/m/Usability photo source: http://www.localwin.com/julie/system/files/lu10/Usability_Testing.jpg
  4. Another challenge is how to deal with Long tail scientific research.Long tail science, such as sustainability science, often has many research niches, which rely on customized methods and toolsets and on localized storage The data are often designed to satisfy the immediate needs of researchers, with less consideration for long-term availability and data reuse. Photo credit: http://www.pdx.edu/sustainability/sites/www.pdx.edu.sustainability/files/styles/pdx_collage_medium/public/kj8-07PSU%20DJ00425x7.jpg
  5. How can institutions—using their IRs—address these problems?One approach is to create a virtual archive over IR interactions.Any virtual archive for sustainability science would require the following features:Must mediate over multiple Institutional Repositories – no single IR will hold enough sustainability science data to be meaningfulPolicies and tools for ingest of scientific data must be minimally intrusive on a scientist’s time.Processes for ingest of data must handle varied data:  multi-GB collection, vastly heterogeneous collection of files,  small complex database of a thousand variables, or  set of files in formats that are unique to the subdiscipline.Tools for serving data from an institutional repository must be consistent with tools and processes of the sustainability science community
  6. The SEAD Virtual Archive (VA) supports not only the preservation of sustainability science data in the institutional repository today, but also aims to facilitate rich access and use into the future. In our specific prototype, we worked with the NSF National Center National Center for Earth-Surface Dynamics (NCED), one of the original NSF Science and Technology Centers, to see how data would work in such an archive.The community needs described in the previous slide have heavily shaped the creation of the SEAD project as a whole. There are three major components to SEAD, which I’ll describe by way of explaining how data moves within the framework.A user “ingests” his or her her data to ACR where metadata are harvested and add’l annotation by the user and community can take place. Data sets that are considered “fixed” are then “published” to SVA, a long-term preservation layer which manages a sustainability science window to multiple IR’s and prepares data sets for deposit. The data sets published to the SVA are registered with SEAD VIVO where associations are made between researchers, publications, and data sets. This relationship information is fed back to ACR as discovery information. Users of ACR (not just the ingester) can then use the discovery information to work with the data to improve it further.
  7. The Virtual Archive offers users a single view into data, with easy deposit.The software extends the open source software code developed by the Data Conservancy. Currently, the SEAD VA provides mechanisms to automatically deposit data into the Indiana University repository, IUScholarWorks1, and the University of Illinois repository, IDEALS2, with growth to other repositories. I’m going to first discuss the SEAD Virtual Archive’s design, which is influenced heavily by policies related to both the IRs that support the federation, and the specific needs of our user group (discussed previously). I’ll then talk a bit about our policy decisions such as deposit and permanent IDs, then finish out with a brief update of our progress to date on the project.
  8. Now, let’s drill down into what the Virtual Archive workflow looks like. When a data set arrives in the Virtual Archive, it is presumed to be “ready to publish.” At this point, data and metadata are considered ready to be versioned/fixed .We also assume that the metadata also contains identified terms of access and use (i.e., datasets that were marked and selected for preservation, repository licensing agreements are accepted). Once data is accepted into the virtual archive, there is… an automated check of dataset integrity, A DOI is minted, A matchmaking service automates decisions on to which repository to deposit the data (more on that in a moment)Data is deposited to the IR, and thenQuery(ies) are then used to assemble metadata that’s been either entered manually, or harvested automatically by earlier services such as the ACR. The metadata object also contains provenance about the file(s). SEAD VA then transforms the RDF package into the SIP (Submission Information Package), an OAIS-compliant xml-encoded store for metadata.
  9. Next I’d like to discuss an interesting step in the data publishing process that helps to provide a single, seamless IR deposit interface for researchers: the matchmaking processMatchmaking is a technical solution for automated deposit that reconciles the needs of institutional repositories with the needs of end users and the SEAD VA. For instance, a repository may only be able to take small files of less than 150 MB, or only want data from its affiliated researchers. These policy statements are encoded in the matchmaker, which is a seamless query process that evaluates the rules when deciding where to deposit a package. For obvious reasons, the matchmaker is (for the most part) hidden to the researcher—all researchers encounter is a single deposit interface, which might include some questions that the matchmaker might use to create rulesAs an easy example, the question “To which university are you affiliated?” is a means of matching authors from Indiana, Illinois, or Michigan with the appropriate repository. For authors unaffiliated with any of those universities, a second question could be posed, “To which research centers are you affilliated?” If they answer something like NCED, the data could be deposited without a problem to an NCED-affiliated repository.Of course, in practice, affiliation information would be gleaned from VIVO—as you can see here, the matchmaker client queries VIVO and the repositories themselves to gather enough information to create appropriate rules.Because the SEAD VA interacts with several repositories that all have different scopes, missions, service orientations and depositor requirements, we need to make sure that such reconciliation satisfies all partners. Once the matchmaking service has made a decision on where to deposit, many processes, policies, and tools need to be in place in order to make the ingest of scientific data into an institutional repository easy and smooth. The SEAD Virtual Archive has implemented a SWORD API client for automated data deposit into the repositories. Our plan is to extend the SWORD API deposit to non-DSpace repositories (for example, repositories based on Fedora software). It’s also worth noting that we are exploring mapping current IR author data to VIVO, using BibApp
  10. Many of the rules for the matchmaking service is built upon uselogic created by the various DEPOSIT licensing agreements. When depositing to a single repository, you will often be required to accept a university-sanctioned deposit licensing agreement—a sort of Terms of Service for the repository that clearly explains the rights of the repository and also the rights the depositor retains.These DEPOSIT Licensing agreements always set terms that allow repositories certain rights, such as basic rights to store and re-format files for preservation purposes, orChanging data to protect human subjects, or That ensure the depositor has complied with research sponsors, to limit liabilityDetermining the rights that authors are willing to grant repositories is one step in reconciling researchers’ needs with those of the IRs, and also those of SEAD.
  11. Licensing agreements also often ensure rights for depositors, such asdeposits will not be changed in any way that differs from the depositor’s original intent; data under embargo will not be distributed until embargo has ended; the data creator retains copyright and/or moral rights to any depositsThese depositor rights are pretty similar across the repositories. We are continuing to track them (and attempting to reconcile them, where necessary) because we feel that a more elegant solution than the matchmaking service might be to work with each of the institutions to create a single site license for SEAD.
  12. Ultimately, a single-license solution that would satisfy all repository (and institutional) legal requirements is our goal. Such a license would also need to mitigate rights on behalf of the depositorUntil then, it’s important that the project Aims to develop clear instructions displayed during the deposit process that:explain the rights and responsibilities associated with the use of datasetsdirect users to relevant information regarding licensing It’s also important that our programmatic matchmaking service can identify when there is a match between the independently stated requirements of specific end users, repositories, and SEAD Virtual Archive.
  13. The second major policy decision that shaped the architecture of our project is that of which permanent identifier schema to use for assigning IDs to datasets, and also for identifying authors.The SEAD project applies persistent identifiers (PIDs) to datasets and authors to track usage and citations. For author identification, we currently use internal VIVO identifiers, which are unique links to each researcher profile. For dataset identification the SEAD project uses DOIs, created by Datacite, that are being resolved to the institutional repositories where data is stored.
  14. Currently, the SEAD project uses the built-in VIVO unique ID for all researcher profile information stored in our Active/Social Content Repository. This VIVO ID can be effectively used primarily at the domain or institutional level4. However, the VIVO ontology currently supports many different researcher PID system IDs, including ORCID5, and systems such as Thomson Reuter’s ResearcherID.When the SEAD project began, ORCID was still under development as an author ID system.Now that the ORCID technical infrastructure has been established, SEAD hopes to take advantage of that service by enabling ORCID ID registration as the main unique identifier system. This would enable researcher identification at a more global level.We know that more is coming from ORCID, and we look forward to helping implement resolution services between ORCID and VIVO systems.
  15. For data, we decided fairly early on to adopt DOIs as our permanent id scheme. However, DSpace primarily uses Handles as a permanent ID, so a challenge has been reconciling these two identifier schemes within our system.The Handle system is an established standard for PID resolution supported by the DSpace institutional repository software, and often managed locally. Many repository support staff are already familiar with Handles and there are over 1300 installations world-wide. Handle IDs do not store associated metadata, while DOIs do—an important concern for a project developing rich metadata and an additional preservation measure. Further, while Handles are generated primarily for use in DSpace institutional repositories, DOIs are supported by a wide range of publishers and vendors as the citation standard and are used widely for data identification. We wanted to implement a service that had the right international scope, and we felt that DOIs were that service. Using DOIs therefore allows SEAD VA to go beyond DSpace-based repositories, to federate across a larger number and variety of repositories, and to integrate with other data preservation and citation technologies. One of the limitations for the DOI standard is that even though it allows for multiple destination resolution3, most implementations do not support it. Ideally, the DOI would resolve to the data set’s locations in both the VA and ACR components of SEAD, so end users could access both the active and the fixed data version. Currently, DataCite does not allow this action, so we overcome this limitation by maintaining a DOI and a separate link to the ACR in both the SEAD VA and SEAD-VIVO. In any event, given that two permanent IDs exist for any given dataset by virtue of being deposited to a DSpace repository, we need to find a good way to cross-reference Handles and DOIs. It’s a challenge we’re working to address.
  16. Finally, I’d like to briefly let you know of some of our progress to date with the Virtual Archive.As I pointed out, sustainability science is broad, and the data are diverse. As our starting point, we have ingested all NCED data, in order to test the use case of how SEAD would handle a small-sized collection. We’ve also addressed another use case: organization for heterogeneous collection of related files. Our file organization is based on how we received the files from NCED; another file organization format be more optimal—we’re looking into finding the best way to structure organization.Tested deposit between the VA, UIUC IDEALS, and IUScholarWorks successfully. We’ve been able to demonstrate the end-to-end prototype between tri-part functionality of SEAD, including the social network, virtual archive, and repositories.
  17. Work does remain on the prototype. We have to address the other use cases for data. including the heterogenous file structure I mentioned, and also how to deal with relational databases and unique formats.Our other major task is to determine how the prototype fits researcher workflows.
  18. That is a very basic overview of the SEAD virtual archive and some of the policy work we’ve done to date. I encourage you to check out the paper, which should be included on the conference USB drives, for more information. I have a few minutes left for questions.