Moving an Archive from Tape to Disk: A Case-Study at ICPSR
1. Moving an Archive from Tape to Disk
A Case Study at ICPSR
IASSIST 2008
Stanford University
Bryan Beecher
IT Director
ICPSR
2. Overview of today’s talk
• Where we were
Background info
Digital Preservation @ ICPSR in 2006
• Where we went
Digital objects
Physical objects
• Where we want to go
Fedora
2
3. What is ICPSR?
• Collect digital objects – primarily
social science data
• Add value to the objects
• Preserve and disseminate
• Other programs too
Summer Program in Quantitative
Methods
Digital Preservation workshop
• Clients
Higher-education
Data producers who don’t want to
preserve or disseminate
3
4. A peak inside ICPSR
• Computer & Network Services
ICPSR’s technology shop
System and network management
Software, service, and database development
• Data Library
Manage off-line storage of digital objects
Manage off-site collection of paper records
Service staff requests for digital and physical objects
• Historically had little interaction
4
5. DigiPres at ICPSR in 2006
• The Good • The Bad
Two copies of each digital Using low-density tape for
object; one off-site archival storage
Metadata stored in a Metadata not stored with
relational database the objects
Stable processes Manual processes
Large collection of “old Large collection of “old
stuff” (paper records and stuff” (paper records and
media) media)
5
6. DigiPres at ICPSR in 2006
• September 2006
ICPSR hires its first Digital
Preservation Officer
• Nancy McGovern
Data Library team joins
Computer & Network
Services
• DPO sets policies
• The newly expanded CNS
implements those policies
and operates the technology
6
7. Policy changes
• Do NOT need to preserve original media
Preservation commitment is to the intellectual content
Media is only a container holding that content
• Do NOT need to preserve paper records except
where there is value
• Do need a digital copy outside of Ann Arbor
• Do need to collect key metadata about deposits
Provenance
Digital fingerprints
7
8. The Plan
• Track service requests via help desk software
Who’s asking for materials?
How many requests for digital materials v. paper v both?
How many requests each month?
• Wherever possible automate digital preservation
operations
Completeness and correctness increases
Staff become available for retrospective projects
Also automate ICPSR staff access to materials
8
9. The Plan (more)
• Transition ALL digital content from tape to disk
A copy on tape too is OK, but not primary copies
• Expensive to access
• Difficult to tell if copy A and copy B are in sync
• Discard extraneous administrative documents
Just the “low hanging fruit”
• Turn over remaining documents to records
management professionals
9
10. Interlude - Comcast
• An Internet connection at the Warehouse would be
very helpful
Access to databases, Intranet
• Thought we might purchase a broadband connect
• We started with Comcast….
Comcast: “We’ll need to include an installation surcharge
to cover a few extra installation costs.”
ICPSR: “How much?”
10
11. Our reaction
• Comcast: “Thirty-two
thousand dollars.”
• ICPSR: “Uh, no.”
• The Warehouse now has an
AT&T DSL connection
11
12. Execution – moving to disk
• DLT tape - bulk of our content – approx 275 unique
Two copies of each tape
• ICPSR HQ
• The Warehouse
Each tape holds up to 20Gb to 40Gb
• During Feb – Jun 2007 ICPSR moved the content of
these tapes to spinning disk
• Starting in Jan 2007 ICPSR stopped using DLT tape
for archival storage
12
13. Execution – moving to disk
• Approx 5TB of unique content across all tapes
• How many copies?
(1) ICPSR – on-line
(1) ICPSR – off-line
(1-3) Chronopolis (SDSC, NCAR, UMd)
(2) IU HPSS
(0-5) LOCKSS-based, NDIIPP-funded syndicated storage
More?
• Intending to destroy the DLT media at end of 2008
13
14. Execution – moving to disk
• Also have 2000 cartridge (3480) and 9-track tapes
• Have been reading 50/week for many months now;
will finish these before the end of 2008
High success rate for reading (> 80%)
• Also had a stash of over 10k tapes that had already
been migrated, but not discarded
For this we used extra special, extra gentle treatment……
14
19. Costs - media
Numbers are in thousands
40
30
20
10 Were
0 Now
Master Backup Media
copy per copy per mgmt
TB TB
19
20. Costs – media (notes)
• Were spending approx
$2000/TB/copy on DLT tape
$65k/year staff to read, write, migrate and manage tapes
• Now spending approx
$2000/TB/copy for “expensive” SATA disk in our EMC
$100/TB/copy for LTO-3 tape
$0/TB/copy for off-site, on-line copies with our friends
Staff cost for plain old file and tape management can live on
the margins
20
21. Execution – paper documents
• Stored at the Warehouse
• 3200 sq ft facility located near Ann Arbor airport
2500 sq ft manufacturing space
600 sq ft of office space (the three “Front Rooms”)
100 sq ft of kitchette, rest room
• $35k year for rent; $5k for utilities
21
25. Execution – paper documents
• Phase I (“clean up”)
Identify, gather and recycle paper with no archival value
• File listings
• Census 2000
Completed in 2007; recycled 40 cubic yards
• Phase II (“clean out”)
Consolidate Administrative and Archival materials into an
acid-free folder stored in an archival quality box
In progress; expect to complete by the end of August 2008
25
26. Costs – paper documents
Numbers are in thousands
$200
$150
$100 Current
$50 Planned
$0
Storage & Retrieval & Supplies &
Management Returns Misc
26
27. Execution – automation
• Digital Object Database
Database of metadata about every identified file in the
archives
• Digital fingerprint
• Location
• Source
• Plugged into our ingest system and our
dissemination system
• Powers some really useful tools…
27
28. Execution – automation
• Goodies for ICPSR staff
Download page has extra knob to view ALL files
Intranet tools that link
• Internal Study Tracking System
• Public-facing study download system
• Private-facing digital preservation system
• Immediate and direct access to all digital objects
28
29. Looking forward
• Lots of good progress so far…
Better access for ICPSR staff
More robust preservation
Reduced costs
• But does the IT guy ever give up $ once he gets it?
• But not done yet
Still need a “proper” digital preservation system
• Fedora
29
30. Looking forward (continued)
• Long-term, off-site, on-line copies
Heavily subsidized today
What about the future costs?
• What if we start preserving and disseminating much
larger digital objects?
• Restricted-access materials
Balancing good preservation v. securing sensitive data
30