2. how’s and what’s of a
digital archive / library
• what is a (good) digital library ?
• how should a digital library be designed ?
• how should a digital library be created ?
• how is a digital library measured ?
• how should a digital project be executed ?
• how should a digital library or a digital project be
managed ?
2
3. why a digital project?
• to enhance accessibility of the content in libraries
and archives
• to increase collaboration and cooperation between
libraries and archives around the world
• to promote research
• to provide opportunities for entrepreneurs
3
8. digital projects overview
• collections: organized groups of digital
objects
• objects: digital materials
• metadata: information about objects and
collections
8
11. assess
• select the collection or content
• define the goals
• identify the users
• identify ownership and legal risks
• identify applicable standards
• evaluate capabilities
11
12. design: standards
• METS XML for descriptive, structural, technical,
and administrative metadata
• descriptive metadata
• Metadata Object Description Standard
(MODS) selected metadata from MARC
• Dublin Core fundamental group of text
elements for describing and cataloging
• technical metadata
• ALTO for OCR text
• PREMIS for digital preservation
• MIX for images
12
14. design: access
• user community
• user interface (UI)
• search
• authentication and user
management
• digital object presentation
• portability
• administration
14
15. implement: pilot
create requirements and acceptance criteria
repeat
{
digitize (small) pilot batch
test data against acceptance criteria
adjust requirements and acceptance criteria
}
until (no more adjustments are necessary)
digitize more data
NB: pilot batches are VERY VERY important!!
15
16. implement: in-house
reasons for in-house production
• collection cannot be moved
• collection is badly organized
• digitization must be done slowly over a long
period
• digitization is very simple
16
17. implement: outsource
reasons for outsourced production
• originals can’t be scanned in-house because…
• equipment is too expensive
• output data is beyond staff experience
• labor is too expensive
• large volume of work in a short time
• insufficient space, infrastructure, or staff
17
19. implement: crowd sourcing
• FamilySearch.org
• National Library of Australia
Newspapers Digitisation Program
• Library and Archives Canada
• Wikipedia
19
20. measure: acceptance criteria
• automatic quality checks
• is the digital object complete?
• is the digital object verifiable?
• is the digital object uncorrupted?
• manual quality checks
• does the metadata meet accuracy
specifications?
• does the text meet accuracy
specifications?
• is the image quality satisfactory?
20
21. measure: image quality
“…images which are ultimately to be viewed by human
beings, the only “correct” method of quantifying visual image
quality is through subjective evaluation. in practice,
however, subjective evaluation is usually too inconvenient,
time-consuming and expensive…”
“…best way to assess the quality of an image is to look at it
because human eyes are the ultimate viewers of most
images…”
Zhou Wang and Hamid R. Sheikh. Image Quality Assessment: From Error Visibility to Structural Similarity.
IEEE Transactions on Image Processing. April 2004
Zhou Wang, Alan Bovick, and Ligang Lu. Why is image quality assessment so difficult? IEEE Transactions
on Image Processing. April 2004
21
22. measure: use
• who is using the collection?
• what is the collection being used for?
• how many page views per day / week /
month?
• how long do visitors to the collection stay?
• how many repeat visitors to the collection?
22
23. preserve
• bit rot
• format obsolescence
• media obsolescence / decay
• migration to new media or hardware
• standards obsolescence
23
24. preserve: bit rot
gradual decay of …
• storage media because of media quality
• storage media because of improper storage
• data due to random events (bit-flip,
• software due to interface changes
• software due to non-obvious or inadvertent
configuration changes
24
25. preserve: media decay
a report by NIST and the Library of Congress says
that
• virtually all CD-Rs tested indicated an
estimated life expectancy beyond 15 years
• only 47 percent of recordable DVDs indicated
an estimated life expectancy beyond 15 years,
some had a life expectancy as short as 1.9 years
• in practice actual lifetimes may be considerably
shorter
25
26. preserve: media obsolescence
• 5 ¼” floppy disks
• 8 track tapes
• 3 ½” floppy disks
• ZIP drives
• CD-R, CD-RW, Blu-Ray
• microfilm
26
27. preserve: migration
• file format changes
• file name differences: case sensitive /
insensitive
• extended file attributes
• file permissions
• soft links / hard links
27
31. the problem
the 2009 CHAOS Report (The Standish Group)
reports that of all software projects surveyed, 44%
are “challenged”, 24% failed, and only 32%
succeeded
31
32. the problem
Roger Sessions estimates that the worldwide cost
of IT failure is USD $500 billion per month
Roger Sessions: CTO of ObjectWatch. He has written seven books including
Simple Architectures for Complex Enterprises and many articles. He is a
founding member of the Board of Directors of the International Association of
Software Architects. 32
33. the problem
in a recent survey of 1230 IT professionals
conducted by Embarcadero Technologies, 2 of the
3 biggest project challenges cited by the IT pros
are “poor planning” and “poor or no requirements”
33
34. the problem
in a March 2007 web poll conducted by the
Computing Technology Industry Association "nearly
28 percent of the more than 1,000 respondents
singled out poor communications as the number one
cause of project failure"
34
35. the problem
in a white paper written for Project Perfect by Taimour al
Neimat, he lists
• poor planning
• unclear goals and objectives
• objectives changing during the project
• unrealistic time or resource estimates
• lack of executive support and user involvement
• failure to communicate and act as a team
• inappropriate skills
as primary causes for the failure of complex IT projects
35
36. the problem
a recent tender from an (anonymous) government agency
• project to convert ~ 170,000 text images to xml
• value of project ~ USD $180,000
• 19 pages of definitions, governing law, proposal
evaluation criteria, contractual conditions, instructions
about tender response format, etc
• technical requirements description? < 1 page
• data acceptance criteria? “a high level of accuracy”
36
37. the problem
a recent program established by a prominent national
library
• digitize more than 20 million text pages
• high level image and xml requirements
• value of work awarded? > USD $5,000,000
• after award of work, technical requirements
expand to 43+ pages from ~3 pages
• acceptance criteria? added as an afterthought
and not well defined
37
38. the problem
typical tender evaluation criteria in priority order
1. understanding of requirements
2. reputation of service bureau
3. price
38
45. communication
“projects are about
communication, communication,
and communication”
Elenbass,
B.
(2000).
“Staging
a
Project:
Are
You
Se>ng
Your
Project
Up
for
Success?”.
Proceedings
of
the
Project
Management
InsItute
Annual
Seminars
&
Symposiums.
45
46. references
• METS, MODS, ALTO, PRISM, etc :
http://www.loc.gov/standards
• OAIS : http://public.ccsds.org/publications/RefModel.aspx
• NISO standards and guidelines :
http://www.niso.org/publications/rp
• good practice guides : http://www.ukoln.ac.uk
• And many, many more
46
47. preguntas?
Frederick Zarndt
frederick@frederickzarndt.com
This work is licensed under the Creative Commons
Attribution-ShareAlike (CC by SA)
License. To view a copy of this license visit
http://creativecommons.org/licenses/by-sa/3.0/
47
Hinweis der Redaktion
digital libraries are (relatively) new. best practices are still (rapidly) evolving. computing technologies, storage media, communication protocols, and standards are changing.iArchives story.
this talk will probably not give you answers but rather a bunch of questions that you should ask as you undertake a digitization project. it will also give you a list of things to do before, during, after a digitization project, but not tell you how to do them.mention communications, requirements, acceptance criteria
primarily to enhance access. access to a digital collection is not restricted to 1 user in 1 place. now it is possible for many users in many places to concurrently access the collection.may also be to preserve a deteriorating collection
digital collections are similar to analog collections – books, newspapers, magazines, photographs, records – only in digital form. digital collections differ from analog collections in that they are more flexible.A digital collection consists of digital objects that are selected and organized to facilitate their discovery, access, and use.Digital objects, metadata, and the user interface together create the user experience of a collection.
A digital object represents a discrete unit and is comprised of a digital file or files as well as descriptive metadata. Digital objects begin life in one of two ways: As a digitized file produced as a surrogate for materials that exist in analog format.As a "born digital" entity, with no analog counterpart.digital objects are either digital surrogates for analog objects or born digital objects scanned text, scanned photos born digital text, digital photos archived websites census records, land records
metadata is similar to a card catalog but more flexible. richer descriptive and administrative metadata. may contain data about the digital objects themselves.metadata is structured information associated with an object for purposes of discovery, description, use, management, and preservation.
phases implies separation / sequential. not necessarily sequential! more about this later…
digital collection users may be different from analog collection users (genealogists)digital collection users may be different from analog collection users (genealogists)copyright holders are generally not happy about digital surrogates! know Turkish / EU copyright law! collaborate with copyright holder if possible.examples: Singapore, Australia, USA
METS XML since version 1.1 ~2001. administered by LOC but developed by libraries around the world. METS editorial board. METS now at version 1.9METSsections:header, descriptive, administrative, files, structural map (heart of METSstructural links (between elements of structural map), behaviorMARC not often used with digital collections. replaced by MODS (administered by LOC) and / or Dublin Core (administered by OCLC)
TIFF since 1986. last update (version 6.0) 1992. now under control of AdobeJPEG2000 since 2000. intended to supersede JPEG.PDF, PDF/A under control of Adobe. PDF/A subset of PDF version 1.4 and an ISO standard. latest PDF version is 1.7. In 2008 Adobe granted a royalty-free rights for all patents owned by Adobe that are necessary to make, use, sell and distribute PDF compliant implementationslook for open, community developed, tried and tested standard formats
Crowdsourcing is a distributed problem-solving and production model. Problems are broadcast to an unknown group of solvers in the form of an open call for solutions. Users—also known as the crowd—typically form into online communities, and the crowd submits solutions. The crowd also sorts through the solutions, finding the best ones.FamilySearch documents are drawn primarily from a collection of 2.4 million microfilms made of historical documents from 110 countries.130,000+ volunteers from around the world. Records based data.Australia NDP 5,800,000 newspaper pages online. 50,000,000+ lines of newspaper text corrected, 2,000,000+ per month in 2011.Wikipedia founded 2001. 90,000 active contributors. Website ranks 6th in the world usage according to Alexa. Editions in 282 languages.
Recognizing that MARC is no longer fit for the purpose, work with the library and other interested communities to specify and implement a carrier for bibliographic information that is capable of representing the full range of data of interest to libraries, and of facilitating the exchange of such data both within the library community and with related communities.