This slideset describes a set of best practices for archival processing or Research Objects. It is part of the Research Object Knowledge Hub (http://researchobject.org/), which has been created in the context of the Wf4Ever project (http://www.wf4ever-project.eu/)
The Ultimate Guide to Choosing WordPress Pros and Cons
Best practices for Archival Processing of Research Objects (a librarian view)
1. BEST PRACTICES FOR ARCHIVAL PROCESSING OF
RESEARCH OBJECTS (A LIBRARIAN VIEW)
Sara Pérez (UPM), Oscar Corcho (UPM), Raúl Palma
(PSNC), Piotr Holubowicz (PSNC)
2.
This work is licensed under the Creative Commons
Attribution 3.0 Unported License
You are free:
Research Object Knowledge Hub
January 2014
LICENSE
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
Under the following conditions
Attribution — You must attribute the work by inserting
“[source http://researchobject.org/]” at the end of each reused
slide
a credits slide stating
These slides are partially based on “Best practices for archival
processing of research objects (a librarian view)” by S. Pérez, O.
Corcho, R. Palma, P. Holubowicz http://researchobject.org/
2
3. Introduction: An overview of Research Objects
Research Object Model
LifeCycle of Research Objects
Storage architecture for ROs
Archival processing of ROs
Conclusions
Research Object Knowledge Hub
January 2014
OUTLINE
3
4. FROM ELECTRONIC PAPERS TO RESEARCH OBJECTS
January 2014
Research Object Knowledge Hub
Scientists
Hypothesis
Experiments
Research
Object
Electronic
paper
Annotations
Results
Provenance
4
Datasets
5. Example of a scientific
workflow that implement a
proteomic experiment
The analysis operations that
compose the workflow are
provided in this case by third
party web services.
Research Object Knowledge Hub
January 2014
WORKFLOW: EXAMPLE
5
6. A research object aggregates all elements that are
necessary to understand research investigations
Methods (experiments) are viewed as first class citizens
Promote reuse
Enable the verification of reproducibility of the results
They can be identified by a URI (like the DOI), so they
can be shared and cited.
Research Object Knowledge Hub
January 2014
BENEFITS OF RESEARCH OBJECTS
6
7. Introduction: An overview of Research Objects
Research Object Model
LifeCycle of Research Objects
Storage architecture for ROs
Archival processing of ROs
Conclusions
Research Object Knowledge Hub
January 2014
OUTLINE
7
10. GROUNDING WORKFLOW-CENTRIC RESEARCH
OBJECTS USING SEMANTIC TECHNOLOGIES
Research objects use the Object Exchange and Reuse (ORE) model, to
represent aggregation.
Research Object Knowledge Hub
Workflow-centric research objects are encoded using RDF, according to a
set of ontologies that are publicly available
January 2014
10
11.
We use the Annotation Ontology (AO), to annotate research object
resources and their relationships.
January 2014
GROUNDING WORKFLOW-CENTRIC RESEARCH OBJECTS
USING SEMANTIC TECHNOLOGIES (CONT.)
Research Object Knowledge Hub
11
12.
Research Object Knowledge Hub
The Research Object Model provides a number of basic ontologies
that are used within this aggregation/annotation framework to
describe specifics of the Workflow-centric Research Objects. These
are:
January 2014
GROUNDING WORKFLOW-CENTRIC RESEARCH OBJECTS
USING SEMANTIC TECHNOLOGIES (CONT.)
12
13. Introduction: An overview of Research Objects
Research Object Model
LifeCycle of Research Objects
Storage architecture for ROs
Archival processing of ROs
Conclusions
Research Object Knowledge Hub
January 2014
OUTLINE
13
15. RESEARCH OBJECTS AS INFORMATION PACKAGES IN OAIS
January 2014
Research Object Knowledge Hub
15
16. Introduction: An overview of Research Objects
Research Object Model
LifeCycle of Research Objects
Storage architecture for ROs
Archival processing of ROs
Conclusions
Research Object Knowledge Hub
January 2014
OUTLINE
16
18. Introduction: An overview of Research Objects
Research Object Model
LifeCycle of Research Objects
Storage architecture for ROs
Archival processing of ROs
Conclusions
Research Object Knowledge Hub
January 2014
OUTLINE
18
19. ARCHIVAL PROCESSING OF RESEARCH OBJECTS
2.0 Quality Assurance on
SIP (received files)
2.1 Run virus check
2.2 Check the format and
link relations
1.2 Add workflow,
hypothesis, datasets,
wf-runs, etc.
2.3 Run checklist for
completeness
1.3 Create/Verify
metadata
2.4 Validate naming
conventions
1.4 Identify/Approve
the file formats
2.5 Verify that workflow
runs belong to the wf in
the RO
3.1 Checklists/Monitoring
activities
3.2 Create a safe copy of
RO metadata file
3.3 Replace and format
transformation (if necessary)
3.4 Establish fixity
value/mechanism
3.5 Create Archival Metadata
Record
1.5 Run checklist for
completeness
1.6 Submit SIP
1.7 Receive
acknowledgement of
submitted files
2.6 Validate RO metadata
is completed to Archive’s
specifications
2.7 Send
acknowledgement to RO
contributor
3.6 Create AIP and add record
to Preservation Repository
4.0 Generate DIP
4.1 Create the Access
Copy
4.2 Create access aid
derivatives
4.3 Create Catalog
Entry
4.4 Add to Digital
Collection
4.5 Verify Catalog, and
Digital Collection
access
Research Object Knowledge Hub
1.1 Create an RO
3.0 Generate AIP
January 2014
1.0
Ingest/Upload/Su
bmit files
4.6 Update Archival
Metadata Record
3.7 Create security copy of
Preservation Master
19
20. Key questions for the Repository with regard to the Submission process:
How the ROs’ components have to be prepared and packed to be
ingested?
How should RO components be named for transfer to the Repository?
How should RO component files be organized for transfer to the
Repository?
What RO metadata Producers are required to include?
How will RO be validated for transfer?
Research Object Knowledge Hub
How is the mechanism to be used by content Producers to create an RO
and aggregate resources to it?
January 2014
20
21. CREATE A RESEARCH OBJECT
January 2014
1) Make a sketch workflow;
2) Use modules;
3) Think about the output;
4) Provide input and output examples;
5) Annotate;
6) Make it executable from outside the local environment;
7) Choose services carefully;
8) Reuse existing workflows;
9) Test and validate;
10) Advertise and Maintain.
Research Object Knowledge Hub
Scientists will
appreciate
guidelines and
Best Practices for
producing high
quality ROs.
Example:
•
•
•
•
•
•
•
•
•
•
21
23. ANNOTATIONS
dct:subject
dct:description
Institution
dct:publisher
Country
pav:createdAt
Author
dct:creator
Contributor
dct:contributor
RO snapshot creator
roevo:wasSnapshotedBy
RO release/archive creator
roevo:wasArchivedBy
Notes
rdfs:comment
Keywords
roterms:keywords
Title
dct:title
Creation date
dct:created
Modification date
dct:modified
Purpose
roterms:performsTask
Dependencies (e.g., execution
environment)
dct:requires
Subproperties
roterms:requiresHardware
Research Object Knowledge Hub
Description
January 2014
Scientific domain
roterms:requiresSoftware
roterms:requiresDataset
License
dct:license
Citations
dct:isReferencedBy
Technical contact
roterms:technicalContact
General documentation
cito:isDocumentedBy
23
24. VERIFY METADATA
Research Object Knowledge Hub
The author of the changes
The date
Evolution information
Automatic relations between resources
January 2014
The basic metadata may be:
24
25. IDENTIFY/APPROVE FOLDER STRUCTURE
For instance:
Research Object Knowledge Hub
- biblio
- produced
- used
- config
- scripts
- setup
- software
- web services
- datasets
- inputs
- results
- workflows
- components
- main
- nested
January 2014
Verify that the SIP
conforms to the folder
structure required for
processing in the Live
Repository.
25
26. RUN CHECKLIST FOR COMPLETENESS (OPTIONAL)
January 2014
Research Object Knowledge Hub
26
27. SUBMIT SIP
Contributor contact information who submitted the file (e.g.
Name, Agency/Institution, email)
Description of data files being ingested
Date files ingested
Number of (data) files ingested
Names of the (data) files ingested
Size of the collection of files ingested
Validity value for the data files
Format
Access restrictions (default: public record, no restrictions)
Submission Agreement identification (if one is established)
Research Object Knowledge Hub
January 2014
Administrative metadata that should be recorded as part of the
ingest process includes:
27
28. Research Object Knowledge Hub
With the completion of the ingest quality assurance tasks, the contributor should
receive an acknowledgement whether the submitted data passed or failed the
Quality Assurance process, and if failed, suggestions regarding what needs to be
corrected to provide an acceptable submission.
January 2014
RECEIVE ACKNOWLEDGMENT OF SUBMITTED FILES
28
29. ARCHIVAL PROCESSING OF RESEARCH OBJECTS
2.0 Quality Assurance on
SIP (received files)
2.1 Run virus check
2.2 Check the format and
link relations
1.2 Add workflow,
hypothesis, datasets,
wf-runs, etc.
2.3 Run checklist for
completeness
1.3 Create/Verify
metadata
2.4 Validate naming
conventions
1.4 Identify/Approve
the file formats
2.5 Verify that workflow
runs belong to the wf in
the RO
3.1 Checklists/Monitoring
activities
3.2 Create a safe copy of
RO metadata file
3.3 Replace and format
transformation (if necessary)
3.4 Establish fixity
value/mechanism
3.5 Create Archival Metadata
Record
1.5 Run checklist for
completeness
1.6 Submit SIP
1.7 Receive
acknowledgement of
submitted files
2.6 Validate RO metadata
is completed to Archive’s
specifications
2.7 Send
acknowledgement to RO
contributor
3.6 Create AIP and add record
to Preservation Repository
4.0 Generate DIP
4.1 Create the Access
Copy
4.2 Create access aid
derivatives
4.3 Create Catalog
Entry
4.4 Add to Digital
Collection
4.5 Verify Catalog, and
Digital Collection
access
Research Object Knowledge Hub
1.1 Create an RO
3.0 Generate AIP
January 2014
1.0
Ingest/Upload/Su
bmit files
4.6 Update Archival
Metadata Record
3.7 Create security copy of
Preservation Master
29
30. How to handle received ROs that are not named according to Repository
submission guidelines?
How to handle received ROs that are not organized based on Repository
submission guidelines?
What quality features of an RO must be validate / evaluate and how?
How will archives validate / evaluate the completeness of the RO metadata?
How will the Repository deal with missing metadata?
What security measures are taken to ensure the safety of the submission and
the submitted data before their storage in the Preservation Repository?
Research Object Knowledge Hub
January 2014
Key questions for the Repository with regard to the Quality Assurance
process:
30
31. Firstly, to ensure that the files are clean and free from viruses, complete a
virus check on the submitted files. When infected files are identified, they will
need to be logged and the contributor notified for some sort of remediation.
Verify that all files have been received, optimally against the manifest that
was created when the files were packaged; verify that there are no extra files
added to the package, and that the files have arrived intact and unaltered
through the transfer process
RUN CHECKLIST FOR COMPLETENESS
Research Object Knowledge Hub
CHECK THE FORMAT AND LINK RELATIONS
January 2014
RUN VIRUS CHECK
There should be a service/function for testing completeness, executability,
repeatability and other desired quality features of a Research Object.
31
32. VALIDATE RO METADATA
It would be convenient to identify a subset of required metadata elements
that serve as a common guideline for all RO infrastructure developers
(descriptive, technical and administrative metadata).
Research Object Knowledge Hub
The Wf4Ever team also recommends filing naming conventions for the ROs
and their resources (data sources/workflow names/ID names, etc.). The
Repository will need to define as part of its submission policies the actions it
will take when received ROs do not conform to the naming and/or
organizational guidelines.
January 2014
VALIDATE NAMING CONVENTIONS AND FOLDER STRUCTURE
SEND ACKNOWLEDGEMENT
An acknowledgement should be sent to the RO contributor notifying her
whether the submitted data passed or failed the Quality Assurance process,
and if failed, suggestions regarding what needs to be corrected to provide an
acceptable submission.
32
33. ARCHIVAL PROCESSING OF RESEARCH OBJECTS
2.0 Quality Assurance on
SIP (received files)
2.1 Run virus check
1.1 Create an RO
1.2 Add workflow,
hypothesis, datasets,
wf-runs, etc.
2.3 Run checklist for
completeness
1.3 Create/Verify
metadata
2.4 Validate naming
conventions
1.4 Identify/Approve
the file formats
2.5 Verify that workflow
runs belong to the wf in
the RO
3.1 Checklists/Monitoring
activities
3.2 Create a safe copy of
RO metadata file
3.3 Replace and format
transformation (if necessary)
3.4 Establish fixity
value/mechanism
3.5 Create Archival Metadata
Record
1.5 Run checklist for
completeness
1.6 Submit SIP
1.7 Receive
acknowledgement of
submitted files
2.6 Validate RO metadata
is completed to Archive’s
specifications
2.7 Send
acknowledgement to RO
contributor
3.6 Create AIP and add record
to Preservation Repository
4.0 Generate DIP
4.1 Create the Access
Copy
4.2 Create access aid
derivatives
4.3 Create Catalog
Entry
4.4 Add to Digital
Collection
4.5 Verify Catalog, and
Digital Collection
access
4.6 Update Archival
Metadata Record
Research Object Knowledge Hub
2.2 Check the format and
link relations
3.0 Generate AIP
January 2014
1.0
Ingest/Upload/Su
bmit files
3.7 Create security copy of
Preservation Master
33
34. Key questions for the Repository with regard to the Generate AIP process:
Research Object Knowledge Hub
What are the mappings between expected submitted formats and their companion
Preservation Master formats?
What tools and processes will be employed to transform an RO component received in
one format to another format to produce the Preservation Master?
What tools, techniques, or strategies will the Repository employ to populate the
missing technical fields?
What method will the Repository use to compute, store, re-compute, and compare
fixity values for both the Preservation Master and Preservation Master Security
Copy(ies)?
How would someone perform an audit check to view and/or validate that a file’s
integrity is intact?
What metadata elements comprise the archival metadata record? / What metadata, (if
any), are being created by the archive for inclusion in the AIP and DIP?
Will the Repository use a special packaging mechanism (e.g. METS)?
Who will have access to the Archival Information Packages? What type of access (read,
write)?
What will the mechanism(s) to create a security copy(ies), and on what type of media
format(s) will the security copy(ies) be stored?
What is the mechanism and frequency that the security copy(ies) will be validated?
What is the process for restoring the Preservation Master in the event that it is lost or
corrupted?
What is the process for restoring a Security Copy in the event that it is lost or
corrupted?
January 2014
34
35. One of the most important aspects to be monitored is the stability of the resources.
January 2014
MONITORING AND NOTIFICATION ACTIVITIES
Research Object Knowledge Hub
35
36. It is recommended to create a safe copy of the metadata file (manifest.rdf), which is
embedded within the RO.
Research Object Knowledge Hub
REPLACE AND FORMAT TRANSFORMATION (IF NECESSARY)
January 2014
CREATE A SAFE COPY OF RO METADATA FILE
36
37. The Preservation Repository should offer checksums checking (at file level). This
function will provide a means to verify the bit-level integrity of files managed by this
storage.
In brief, we recommend that the template should comprise:
Unique ID
Fixity info
Format
Extract RO metadata
Created descriptive metadata
Created technical metadata
Created administrative metadata
Research Object Knowledge Hub
CREATE ARCHIVAL METADATA RECORD
January 2014
ESTABLISH FIXITY VALUE/MECHANISM
37
38. CREATE SECURITY COPY OF THE AIP
Security copies of the AIP record should be created and retained. In the event that
the AIP is corrupted or lost, it may be recovered from a security copy.
Research Object Knowledge Hub
What constitutes the Archival “Record” will be defined by each Repository. However,
at a minimum, it should include the original (and reformatted, if applicable) RO
including the original collection of data files and the originally received RO metadata
file.
January 2014
CREATE AIP AND ADD RECORD TO PRESERVATION REPOSITORY
38
39. ARCHIVAL PROCESSING OF RESEARCH OBJECTS
2.0 Quality Assurance on
SIP (received files)
2.1 Run virus check
1.1 Create an RO
1.2 Add workflow,
hypothesis, datasets,
wf-runs, etc.
2.3 Run checklist for
completeness
1.3 Create/Verify
metadata
2.4 Validate naming
conventions
1.4 Identify/Approve
the file formats
2.5 Verify that workflow
runs belong to the wf in
the RO
3.1 Checklists/Monitoring
activities
3.2 Create a safe copy of
RO metadata file
3.3 Replace and format
transformation (if necessary)
3.4 Establish fixity
value/mechanism
3.5 Create Archival Metadata
Record
1.5 Run checklist for
completeness
1.6 Submit SIP
1.7 Receive
acknowledgement of
submitted files
2.6 Validate RO metadata
is completed to Archive’s
specifications
2.7 Send
acknowledgement to RO
contributor
3.6 Create AIP and add record
to Preservation Repository
4.0 Generate DIP
4.1 Create the Access
Copy
4.2 Create access aid
derivatives
4.3 Create Catalog
Entry
4.4 Add to Digital
Collection
4.5 Verify Catalog, and
Digital Collection
access
Research Object Knowledge Hub
2.2 Check the format and
link relations
3.0 Generate AIP
January 2014
1.0
Ingest/Upload/Su
bmit files
4.6 Update Archival
Metadata Record
3.7 Create security copy of
Preservation Master
39
40. Key questions for the Repository with regard to the Generate DIP process:
January 2014
Research Object Knowledge Hub
What format will be used for the Access Copy? Same format at Preservation Master?
An alternate, compressed format?
What tool(s) are necessary to create the Access/Use Copy?
What access derivatives seem appropriate for your ROs?
What tools do you need to produce these derivatives?
How will ROs be arranged / organized within your archival collection? What constitutes
a Collection?
Will individual ROs be entered into your Repository Catalog? If not, how will users find
out about the individual ROs? If yes, how will the ROs be entered into your Repository
Catalog?
What type of finding aid will you create to describe your RO datasets?
What technology platform and software will be the basis for your RO digital collection?
How will RO be added to the digital collection?
What metadata is necessary to describe and/or manage the RO record in the digital
collection?
How will the RO be organized and arranged/categorized in your digital collection?
How will users search and access ROs from your digital collection?
How will the ROs be presented to your users through the digital collection interface?
What test cases will you use to validate the access to your newly added RO?
40
41. Minimum access metadata recommended: URI; Creator; Date; Size; Number of
annotations; Relations (was output from; etc.)
Some potential access derivatives:
• Create HTML version of the Repository’s customized RO metadata file.
• Create a zip file containing all of the files in the RO.
Research Object Knowledge Hub
CREATE ACCESS AID DERIVATIVES
January 2014
CREATE THE ACCESS COPY
41
42. Some key points that we recommend offering
the user through the search interface:
Different views of the RO structure
A visualization of the evolution of a
research object
A panel showing annotations related to
the currently selected resource
Research Object Knowledge Hub
January 2014
ADD TO DIGITAL COLLECTION
42
43. UPDATE ARCHIVAL METADATA RECORD
The archival metadata may be updated with metadata regarding the use of
resources (information on user registration and possible annotations made by users
respect to the resource) and the derivative resources produced during the DIP
generation.
Research Object Knowledge Hub
After the RO entries and/or datasets have been added to the access systems, they
should be tested to verify that they are accessible. This can be as simple as
navigating through the catalog interface, issuing some test searches, and verifying
that the new items are returned in the search results lists. In the case that the
research objects are being made available online, the dataset display interface and
dataset download mechanisms should also be tested to verify that the research
objects can be downloaded.
January 2014
VERIFY CATALOG AND DIGITAL COLLECTION ACCESS
43
44. Introduction: An overview of Research Objects
Research Object Model
LifeCycle of Research Objects
Storage architecture for ROs
Archival processing of ROs
Conclusions
Research Object Knowledge Hub
January 2014
OUTLINE
44
45. Research object-infrastructure design decisions should include
these considerations in terms of functionality and policy:
Modular system implementation (support flexible
maintenance, extension, integration and interoperability).
Native RO support. It should be implemented alongside the RO
model.
Implementation of well-defined and well-documented
interfaces (suitable basis of interoperability).
Well-defined migration plans to perform a batch update of
resources from one format to another, if necessary.
Research Object Knowledge Hub
January 2014
CONCLUSIONS
45
46. Research Object Vocabularies and Ontologies Primer is
available at
http://wf4ever.github.com/ro-primer/
Research Object Knowledge Hub
Research Object Specification is available at
http://wf4ever.github.com/ro/
January 2014
RESEARCH OBJECT SPECIFICATIONS ARE PUBLICLY
AVAILABLE
46
Hinweis der Redaktion
Research is increasingly digital. Most of research results are disseminated in the form of electronic papers through traditional communication channels, such as conferences, journals, or using new mediums such as microblogging. While electronic papers have played and continue to play a primordial role in the dissemination of research results, researchers now recognize that they are by no means sufficient to communicate and share information about research investigations. Indeed, the hypothesis investigated during the research, the experiment designed to assess the validity of the hypothesis, the process (workflow) used to ran the experiment, the datasets used and the results produced by the experiment, and the conclusions drawn by the scientist, are all elements that may be needed to understand, assess the claim, or be able to re-use the results of previous research investigations. Research objects are a new abstractionthat aggregate all these elements. Research object provide an entry point to the elements that are necessary or useful to understand and reuse research results. A particular feature of research objects is that they contain workflows that specify and implements data intensive scientific experiments.
Workflows are now widely used in modern sciences, notably life sciences, as a means for specifying and enacting scientific experiments. The following slide illustrates an example of a workflow that implements an in silico experiment.The workflow consists of three operations. The IdentifyProtein operation takes as input peptide masses obtained from the digestion of a protein together with an identification error and outputs the accession number of protein with the ``best'' match. Given a protein accession, the operation GetHomologous performs a homology search and returns the list of similar proteins. The accessions of the homologous proteins are then used to feed the execution of the GetGOTerm operation to obtain their corresponding gene ontology term.
A Research Object can be also identified by a Uniform Resource Identifier (URI), like the Digital Object Identifier (DOI). In this way, an RO can be shared and cited in scholarly articles or any form of online communications, providing an essential collection of contextual information for re-users, reviewers or collaborators to understand, reuse, validate and reproduce research findings.
This figure provides an overview of the RO Model, which includes the following constructs:ResearchObject: Represents an aggregation of resources. It acts as an entry point to the research object.Resource: Represents a resource that can be aggregated within a research object. A resource can be a workflow, web service, document, data item, data set, workflow run, software or a research object. Typically, a Research Object aggregates multiple Resources.Annotation: Used for describing research objects, their aggregated resources, as well as the relationship between resources. Three kinds of elements are used to specify annotations:Annotation: Represents the annotation itself.Target: Used to specify the resource or research object subject to annotation.Body: Comprises a description of the target in the form of a set of RDF statements, which can be specifying the date of creation of the target, or its relationship with other resources or research objects.
A special kind of ROs is workflow-centric ROs, which refer to research objects that aggregate workflows, or more specifically workflow templates:A workflow template is a network in which the nodes are processes and the edges represent data links that connect the output of a given process to the input of another process, specifying that the artefacts produced by the former are used to feed the latter.A process is used to describe a class of actions that when enacted give rise to process runs. Processes specify the software component (e.g., web service) responsible for undertaking those actions.
Using ORE, a workflow-centric research object is defined as a resource that aggregates other resources, i.e., workflow(s), provenance, other objects and annotations. For ex- ample, the RDF turtle snippet illustrated below specifies that a research object identified by :wro aggregates a workflow template :pathway wf sp, a workflow run :pathway wf run, and an annotation :wfannot.
We also use the Annotation Ontology (AO), which provides a common model for annotating resources. This differs from myExperiment packs, which use a vocabulary that is mapped to Open Annotation Collaboration (OAC)910. Several types of annotations are supported by the Annotation Ontology, e.g., comments, textual annotations (classic tags) and semantic annotations which relate elements of the research objects to concepts from underlying domain ontologies.
Thero ontology: Provides basic structure for the description of aggregated resources and the annotations that are made on those resources.The roevo ontology: Enables the representation of the different stages of the Research Objects life-cycle, their dependencies, as well as the corresponding versions of Research Objects and their aggregated resources, with the associated changes in these resources.Thewfdescontology: A vocabulary for the description of workflows. It is targeted at providing an abstraction that can be mapped to different particular workflow systems. Thewfprovontology: A vocabulary for the description of provenance information about actual executions of workflows.The RO ontologies contain extension points that allow third parties to customise them in order to define specific kinds of research objects or to describe more detailed information that is specific to their tools and application scenarios.