Webinar given on November 12, 2008 as part of an O'Reilly Tools of Change series on publishing and technology.
More information on Liza Daly and threepress can be found at http://www.threepress.org/
1. What publishers need to
know about digitization
Liza Daly
Consultant, Threepress Consulting Inc.
http://threepress.org/
Thursday, November 13, 2008
2. Introduction
Liza Daly liza@threepress.org
Software engineer and consultant specializing in
web-based publishing applications
Digitization projects for Ford Foundation, Arnold
Arboretum, Rosen Publishing and SAGE Publications
Online reference products for Oxford University Press
and Columbia University Press
Current: ebook applications and consulting
Thursday, November 13, 2008
3. Introduction
What I’ll cover
1. Digitization 101: from scanning to OCR to XML
2. Smart vendor selection
3. A gentle introduction to XML
4. I’ve got digital content: now what?
?
Thursday, November 13, 2008
4. What we talk about
when we talk about digitization
Turning printed content... text
...or microfilm archives
...or documents in legacy systems
...into modern digital forms.
(sometimes starting from print is easier)
<text>
Thursday, November 13, 2008
5. Digitization 101
Assume that we’re starting from a print archive.
(If you’re starting from a digital file, congratulations,
your costs just went down -- but not to zero!)
Thursday, November 13, 2008
6. Scan
From paper to digital images...
Thursday, November 13, 2008
7. OCR
...to digital text...
Thursday, November 13, 2008
8. XML
...to reusable markup.
Thursday, November 13, 2008
9. Digitization 101
Scanning
http://www.flickr.com/photos/heather-dietz/448629362/
Thursday, November 13, 2008
10. Digitization 101
Scanning
Scan
http://www.flickr.com/photos/heather-dietz/448629362/
Thursday, November 13, 2008
11. Digitization 101
Scanning methods
Destructive scanning
Pages are cut out of the binding and
machine-fed into the scanner in batch.
(Imagine a huge office copier.)
Scanned copies are normally destroyed.
Thursday, November 13, 2008
12. Digitization 101
Scanning methods
Non-destructive scanning
Pages kept in their original binding
Manual page-turning
Originals are returned to the source
Primarily for rare or historical works
Thursday, November 13, 2008
13. Digitization 101
Scanning methods
High-volume,
non-destructive
automated
scanning also
exists.
Thursday, November 13, 2008
14. Digitization 101
OCR
Optical Character Recognition
OCR software “guesses” the letters that appear in an
image. A dictionary is used to help correct errors.
Common errors include wordsruntogether or
speling mistakes.
Thursday, November 13, 2008
15. Digitization 101
OCR
OCR quality is sensitive to a number of factors.
Is the document in good condition with clear type?
Is the layout simple or complex?
Is a custom dictionary required for proper names or
obscure terms?
Thursday, November 13, 2008
19. Digitization 101
OCR
Better OCR Worse OCR
Multicolumn,
Layout Simple text
sidebars
Vocabulary Common Specialized
Damaged, dirty or
Source quality Clean and legible
partial
Thursday, November 13, 2008
20. Digitization 101
OCR
Limitations and cautions:
Documents with specialized jargon, such as medical
journals or archaic texts, will require custom
dictionaries.
Tables and equations aren’t suitable for OCR.
A human check is always advisable.
Thursday, November 13, 2008
21. If the goal of digitization is to
make content findable on
the web, the text needs to
be correct.
Thursday, November 13, 2008
22. SCAN the documents to
convert to digital files
Apply OCR to the scans to get
computer-ready text
Convert the text into XML X
Thursday, November 13, 2008
23. Digitization 101
XML
Not all digitization projects end with XML.
Why?
Thursday, November 13, 2008
26. Consider: But also:
Quantity of material Project management
Quality of the originals Shipping
Layout complexity Heterogeneous content
Vocabulary Front/back matter &
indexes
Thursday, November 13, 2008
27. Consider: But also:
Quantity of material Project management
Quality of the originals Shipping
Layout complexity Heterogeneous content
Vocabulary Front/back matter &
indexes
Thursday, November 13, 2008
28. Vendor tips
Send samples before considering any estimate
...and have the output evaluated.
Compare not just cost-per-page but estimated time.
Feel comfortable with their project management.
Check references!
Thursday, November 13, 2008
32. It’s too early to say whether
Google Books is right for all
publishers.
But you’re certainly giving up:
1. Control
2. Revenue share
3. Ownership
Thursday, November 13, 2008
33. Creative partnerships
Consider whether some of
your backlist is public
domain or can be released
under a Creative
Commons license.
Thursday, November 13, 2008
35. XML 101
What’s XML?
XML is just plain text, with markers to
tell a computer what the text means
and how it should be laid out.
Thursday, November 13, 2008
36. XML 101
What’s XML?
Text with “markup” is an old idea.
This is a paragraph.¶
This is another paragraph.
Thursday, November 13, 2008
37. XML 101
What’s XML?
XML just changes the symbols around.
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
Thursday, November 13, 2008
38. XML 101
What’s XML good for?
1. Everybody speaks it.
2. Once you have one kind of XML,
it’s easy to turn it into another kind.
Thursday, November 13, 2008
39. When you decide to digitize to XML,
you’ll need to pick what kind of XML you want.
Thursday, November 13, 2008
42. Kinds of XML
Language
DTD
Thursday, November 13, 2008
43. Kinds of XML
Language
DTD
Format
Thursday, November 13, 2008
44. Kinds of XML
Language
DTD
Schema
Format
Thursday, November 13, 2008
45. Kinds of XML
Language
DTD
Schema
Format
XSD
Thursday, November 13, 2008
46. Kinds of XML
Language
DTD
Schema
Format
XSD
Thursday, November 13, 2008
47. XML 101
Schema vocabulary
The schema defines the list of <tags> that appear in a
document, and what they mean.
A paragraph ¶ in one schema might be <p>, but in
another it might be <para>.
Thursday, November 13, 2008
49. METS/
DocBook
ALTO
ePub XML PRISM
DAISY TEI
Thursday, November 13, 2008
50. XML 101
Choosing a schema
Books DocBook, DAISY, ePub, TEI
Magazines/
Newspapers METS/ALTO, PRISM
Scholarly TEI, MathML
Thursday, November 13, 2008
51. XML 101
DIY schemas
Creating your own schema
should be a last resort.
Expensive to build and maintain.
High training and hiring costs.
Reduced opportunities for interoperability.
Regulatory compliance.
Thursday, November 13, 2008
52. XML 101
DIY schemas
Creating your own schema
should be a last resort.
Expensive to build and maintain.
High training and hiring costs.
Reduced opportunities for interoperability.
Regulatory compliance.
Thursday, November 13, 2008
53. Complex schemas cost more...
$$$
$
Low High
...but also provide more opportunity
for product development.
Thursday, November 13, 2008
60. Remixing content
XML allows content
to be distributed, altered,
and recontextualized
in unexpected ways.
http://flickr.com/photos/thomashawk/2492298772/
Thursday, November 13, 2008