Axa Assurance Maroc - Insurer Innovation Award 2024
Pairtrees for object storage: an SEO-optimized title
1. Pairtrees for object storage
John Kunze and Stephen Abrams, California Digital Library (CDL)
Summary
The deadly embrace Objects in a pairtree Pairtree is the thinnest smear we can add to our very well-
• Digital repositories tend to require a surrender of storage A pairtree is especially useful if, for each contained object, understood filesystems and their universal tools (the
transparency that creates unhealthy system dependency all of the object’s parts, and nothing but its parts, are universal “API”) to create a very well-understood,
• Internally objects are often broken up so that they can be enclosed in the object’s directory platform-independent object storage substrate
difficult to piece together in case of trouble Import such a pairtree and, knowing nothing about the Pairtree is not a complete repository system, but it is
objects’ structure and semantics, you can reliably complete for object storage and makes it easier to build
Fig. 1. Object storage should not systems and to share objects between institutions
need a fearful entanglement with • Enumerate all objects and their identifiers
software. Since objects have to • Produce any object by requested id
be parked in a filesystem before
repository software upgrade, what
• Maintain and back it up with ordinary OS tools Why pairs of characters?
• Rebuild the collection in case of database corruption Taking two chars at a time balances path depth and
if we left them in there and built simply by walking the pairtree fanout (number of possible entries in any directory)
our repositories around them?
To walk a pairtree requires knowing path termination rules • Example: ab2def3 ⇒ ab/2d/ef/3/
Jim B L
• A pairpath terminates when you reach a file or reach a • Each pair, letters+digits, has 36x36 possibilities
directory name with 1 char or more than 2 chars Compared to taking one char at a time
ab/ • Only 36 possibilities, but path depth grows rapidly
A pairtree maps ids to paths, --- cd/ • Example: ab2def3 ⇒ a/b/2/d/e/f/3/
At another extreme, taking seven characters at a time
two characters at a time |--- foo/
| | README.txt • Short paths, but 78 billion (367) possible items
A pairtree is a filesystem hierarchy that uses an identifier | | thumbnail.gif • Example: ab2def3 ⇒ ab2def3/
string to derive an object directory (or folder) location | |--- master_images/
• The derivation takes successive pairs of characters and | | | ...
creates a succession of directories, called a pairpath | |
Pairtree credits and details
| --- gh/ Pairtree specification:
ab2def3 ⇒ ab/2d/ef/3/
--- e/ www.ietf.org/internet-drafts/draft-kunze-pairtree-01.txt
• A pairpath ends at directory containing an object’s files; www.cdlib.org/inside/diglib/pairtree/pairtreespec.html
--- bar/
most systems do variation of this (is variation needed?) Authors from CDL and University of Michigan (UM):
| metadata
• Reverse the mapping to find all ids/objects in a pairtree; Martin Haye, Erik Hetzner, John Kunze, Mark Reyes,
| 54321.wav
pairpath termination rules permit variable length ids and Cory Snavely; many thanks to Stephen Abrams,
| index.html Sebastien Korner, Brian Tingle, et al
Pre-converting problematic characters Fig. 2. Example pairtree containing two objects: Pairtree origins include
Some identifier characters are inconvenient or illegal in abcd and abcde. The first object is enclosed in • Prototype: UCSF tobacco control
filenames and must be hex-encoded (e.g., *→^2a) directory foo/, the second in bar/. While foo/ documents and CDL digitized books
id: what-the-*@?#! does not subsume e/ at the same level, by • Early production: digitized books
→ what-the-^2a@^3f#! enclosure, it does subsume the gh/ underneath it. for UM and Hathi Trust
⇒ wh/at/-t/he/-^/2a/@^/3f/#! cyocum
But to keep paths short, 3 common chars are converted to 3
rare chars (at cost of complexity): /→= :→+ .→, Sample software implementation For further information
id: ark:/13030/xt12t3 http://search.cpan.org/~jak/Pairtree-0.2/lib/File/Pairtree.pm Please contact jak@ucop.edu or stephen.abrams@ucop.edu
→ ark+=13030=xt12t3 A Perl module that implements two mappings: id2ppath() takes an For information on CDL’s Preservation Program, see
⇒ ar/k+/=1/30/30/=x/t1/2t/3/ id into a pairpath and ppath2id() performs the inverse mapping. http://www.cdlib.org/programs/digital_preservation.html