This document provides an overview of various information packaging techniques and metadata schemes used for digital archiving and preservation. It discusses simple archive container formats like ZIP and TAR, structured packaging methods like BagIt and compound document formats. It also covers several metadata standards for describing and linking digital objects and files, such as METS, PREMIS, ORE, and LMER. The document serves as an introduction to common methods and standards for technical information management.
1. GRANT AGREEMENT: 601138 | SCHEME FP7 ICT 2011.4.3
Promoting and Enhancing Reuse of Information throughout the Content Lifecycle taking account of Evolving Semantics [Digital
Preservation]
“This project has received funding from the European Union’s Seventh Framework
Programme for research, technological development and demonstration under
grant agreement no601138”.
Information Packaging Techniques
An overview of methods and standards
Anna-Grit Eggers (University of Goettingen)
4. ● Sole purpose: packing files together.
● File containers are often combined with a compression
option to reduce the needed disk space for storage.
● All files in the containers are stored equally as
payload, and the containers have to be unzipped to
fully access the packed files.
Simple Archive Container Formats
5. ● The ZIP format was introduced by PKWARE in 1989:
https://www.pkware.com/support/zip-app-note
◦ Public domain
◦ Supports various compression algorithms Dateiicon von WinZIP
◦ Each single file in the archive container is compressed => possible to unzip single
files from the container
◦ Option to aggregate files without using compression
◦ Preserves the original file paths and offers optional encryption
ZIP
6. ◦ Size reduction: Archive containers can be divided into parts
◦ Flexibility: add or extract single files from a zip archive without
touching the other stored files
• + advantage: possibility to frequently change packages
• - disadvantage: causes overhead in the form of an additional file list
which is stored together with the content.
◦ Loss prevention: ZIP uses cyclic redundancy checks. In case a file
becomes corrupt, the other files would be still flawlessly
accessible.
ZIP (cont.)
7. ● A widespread container format in UNIX (ustar, pax), LINUX (GNU tar), and BSD
(bsdtar) environments
● Can be enabled on Windows Operating Systems for example by software
libraries such as LibArchive
● Writes files sequentially into one file, called ‘tarball’
● Was originally used for tape drives
● TAR is combined with a compression algorithm like gzip or bzip2
● In contrast to ZIP, TAR doesn’t allow extracting single files from the container.
TAR
9. Creation of an information container,
in which
the packed information can be
stored
in a well-defined and structured way.
Structured Packaging for Archiving
10. ● A standard for storing files and their metadata in a well-defined directory structure
● Developed by the California Digital Library Digital Preservation Group and the Library of
Congress
● Often used for preservation purposes, e.g. by Tate (UK).
● Data files are stored in a data directory
● Their checksums are saved in a manifest file
● The metadata, or tags, are listed together with their checksums in a tag-manifest file.
● A further BagIt file stores the used BagIt version and the file encoding.
● BagIt is often combined with a simple archiving format, such as TAR or ZIP, for the
serialisation of the bag directory, or used only as directory structure technique for
sensible content.
● See: http://www.cdlib.org/cdlinfo/2008/07/02/bagit-transferring-digital-content/
BagIt
11. ● Container files, which contain file aggregations serving a specific purpose
● Often used to store all files belonging to a video, and to group them as a single
self-describing video file.
● Popular examples for video containers are AVI and Ogg Media.
Xiph.Org Foundation
Compound Documents
12. ● The source code of a computer program is often stored together with other project-
related resources, such as images, in a package with a well-defined directory structure.
● Structured source code packages are often executable (=> run the computer program).
● Examples: Java’s JARs, Ruby Gems and Python Eggs.
◦ The JAR format is derived from the ZIP format.
◦ JAR be seen as compound document similar to the video containers,
because the Java program which is represented by the JAR can be
executed by running the JAR.
◦ It contains a well-defined path structure and an optional manifest file,
which can be regarded as metadata file.
◦ Therefore the passage to the subsequent category of metadata schemes becomes fluent.
Structured Source Code Packages
14. ● Mostly used in combination with packaging
● But also be kept beside the described content and linked to it
● Or embedded with the content
● Most common is the use of the XML format to define a scheme for a use domain.
Metadata schemes
15. ● METS standS for Metadata Encoding and Transmission Standard maintained by the METS
Editorial Board
● It provides an XML schema for encoding different types of metadata
● It simplifies the administration and exchange of digital objects between data collections.
● A METS-file serves as a hub file that links together the digital object with all its belonging files
and the metadata to create a digital entity.
● A METS XML-file consists of:
◦ Header: Contains metadata of the METS file itself, like the creation date and the authors.
◦ Descriptive metadata: Provides links to external metadata documents.
◦ Administrative metadata: Stores the data concerning storage, rights and creation.
◦ File section: Manages a list of all files belonging to the DO.
◦ Structural map: Describes the inner structure of the DO and provides the linkage between data and metadata.
◦ Structural links: Provides hyperlinks and is useful for the archiving of websites.
◦ Behaviour: Stores executable instructions for the behaviour.
● See: http://www.loc.gov/standards/mets/
METS
16. ● ORE is a standard for Object Reuse and Exchange by the Open Archives Initiative OAI.
● It implements two new types of resources: Aggregations and Resource Maps.
● An Aggregation is a representation of a set of associated web resources.
◦ It is like a Semantic Web resource, hence has no representation by itself.
● A Resource Map belongs to an Aggregation.
◦ It holds a machine-readable description of the Aggregation and a list of associated resources. In
addition, it describes the relationships and properties relevant to all resources and has some
metadata for itself.
● Both resources are addressed by an HTTP URI in the Web.
◦ Aggregations can be used by applications to visualise all associated resources processing them
as a collection.
◦ This simplifies the exchange and archiving of resource sets.
◦ Various formats for the Resource are available: Atom XML, RDF/XML, and RDFa.
◦ All of these formats support serialisation.
● See: http://www.openarchives.org/ore/
OAI-ORE
17. ● Developed by the PREservation Metadata: Implementation Strategies
(PREMIS) group of the Library of Congress
● It supports the preservation and long-term usability of digital objects and
their metadata
● The Data Dictionary is a specification for metadata handling in digital
archiving systems.
● The data model provides five entities: intellectual, object, event, agent and
rights.
● See: https://www.era.lib.ed.ac.uk/bitstream/handle/1842/3339/Higgins PREMIS_V-2-1-2009-
03.pdf?sequence=1&isAllowed=y
PREMIS Data Dictionary
18. ● Used to describe and bundle research data in a way that supports citation and
sharing in a machine-readable fashion.
● The initiative includes a number of techniques that have a set of principles in
common:
◦ Identity
◦ Aggregation
◦ Annotation
● The metadata is described in the RO ontology.
● Bundling can be done using different techniques, including the RO bundling and
BagIt.
● See: http://www.researchobject.org/
Research Object (RO)
19.
20. ● The Long-term preservation Metadata for Electronic Resources project provides an XML schema
particularly for long-term preservation purposes, based on the preservation implementation
schema by the National Library of New Zealand.
● The schema was developed by the DNB (Deutsche National Bibliothek) as a schema for technical
metadata.
● It is used, in combination with METS, for defining the packaging format UOF.
● It is designed for cooperating with standard exchange formats, and can be integrated in METS.
● The LMER-schema consists of the following sections:
◦ Object: The object with an URN as persistent identifier.
◦ Process: Protocol of technical changes.
◦ Metadata: Metadata for each file that belongs to the digital object.
◦ Metadata modifications: Protocol of changes of the metadata.
● See: http://www.dnb.de/DE/Standardisierung/LMER/lmer_node.html
LMER
21. ● Timothy DiLauro and Jonathan Petters introduced the Data Conservancy
Package Tool, at the International Digital Curation Conference (IDCC) 2015
(http://www.dcc.ac.uk/sites/default/files/documents/IDCC15/196.pdf).
● The tool facilitates the creation of packages for research data objects in the
conservation domain
● It provides a user interface for the definition of packages.
● It focusses on curation activities.
● See: http://dataconservancy.org/wp-content/uploads/2014/10/DCSDOCPKG-
PackageToolsDocumentationHome-Full.pdf
The Data Conservancy Package Tool