Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Austrian Books Online. The Austrian National Library's Large-Scale Digitisation Public-Private Partnership with Google
1. @maxkaiser
Austrian Books Online
The Austrian National Library’s
large-scale digitisation public-private partnership
with Google
Max Kaiser
Head R&D, Austrian National Library
Library Science Talk
Geneva, 15 October 2012
Bern, 16 October 2012
13. @maxkaiser
→ Picture Archives and Graphics Department
→ Map Department
→ Music Department
→ Literary Archives
→ Papyri Department
→ Department of Planned Languages
→ Department of Rare Books and Manuscripts
43. @maxkaiser
Vision 2025Knowledge for the world of tomorrow
Our holdings are digitized
We collect and sustain knowledge
Access to our knowledge is simple
With us, research is more faceted and effective
We enrich cultural and social life
44. @maxkaiser
→substantial parts of holdings digitized
→cooperation with private partners
→full text search
→added-value services like semantic search
→unified access system
Our holdings are digitized
45. @maxkaiser
→focal point of collection policy is digital
→preference for digital versions of publications
→user generated content and social networks
→digital photography
→preservation of analogue and digital
collections
→scalable digital archive
We collect and sustain knowledge
46. @maxkaiser
→unified access system for all collections
→focus of cataloguing: metadata enrichment
→linking of metadata with external resources
→open data
→APIs and support for third party apps
Access to our knowledge is simple
47. @maxkaiser
→integration of digital content in virtual
research environments
→support for digital humanities
→strong research collections and libraries
→cooperation with universities and research
centres
With us, research is more faceted and simple
48. @maxkaiser
→digital services, reading rooms and
museums
→innovative interfaces
→mobile services
→cooperation with private partners: reuse
of data for innovative services
→reinforce library as social space
We enrich cultural and social life
60. @maxkaiser
service contract or service outsourcing
→long duration of the relationship
→substantial investment by private
partner
→distribution of risks
≠
61. @maxkaiser
rationales for PPPs
→private funding for Public Sector
→benefit from know-how and working
methods of the private sector
→but not a „miracle solution“
for the public sector
(EC Green Paper on Public Private Partnerships, 2004)
63. @maxkaiser
objectives for public partners
→funding for digitisation
→enhanced access
→engaging new audiences
→access to technology
→access to private sector competencies
→commercial income through user fees,
royalties or revenue share
→lobbying effort to increase public funding
64. @maxkaiser
objectives for private partners
→commercial objectives
→access to new markets or customer groups
→association with strong public brands
→access to (rare, unique) content
→corporate social responsibility
65. @maxkaiser
benefits for citizens
→increased online access
→democratisation of access to knowledge
→added-value services
→benefit for learning and tourism
→new creative endeavours
67. @maxkaiser
„Stimulating the flow of private funds
for the digitisation of cultural assets through
equitable public private partnerships
appears as a viable and sustainable way
of tackling the pressing question
of making Europe’s cultural wealth
accessible online and preserving it
for future generations.“
68. @maxkaiser
„The key question is not
whether public-private
partnerships for digitisation
should be encouraged, but
how‚ and under which
conditions.“
70. @maxkaiser
„(...) recommends that Member States (...)
encourage partnerships between cultural
institutions and the private sector in
order to create new ways of funding
digitisation of cultural material and to
stimulate innovative uses of the material,
while ensuring that public private
partnerships for digitisation are fair and
balanced (…).“
71. @maxkaiser@maxkaiser
key principles:
1. respect for intellectual property rights
→ ONB-Google: only public-domain works
digitised
2. non-exclusivity
→ ONB-Google: ONB free to digitise material
with other partners
3. transparency of the process
→ ONB-Google: public tender
72. @maxkaiser
key principles:
4. transparency of agreements
→ ONB-Google: Very detailed FAQs online
5. accessibility through Europeana
→ ONB-Google:
→ all files available for non-commercial use
→ access via platforms like Europeana
→ provision to research partners
6. key criteria
→ [Next slide]
73. @maxkaiser
key criteria for assessing PPPs
→ total investment by private partner / effort of
public partner
→ (free) access to material for general public,
including through Europeana
→ cross-border access
→ length of any period of preferential commercial
use by private partner
→ quality of digital copies for public partner
→ usage conditions for public partner in non-
commercial context
→ time-scale of project
74. @maxkaiser
additional key elements in
ONB-Google cooperation:
→selection of books by library
→Institute for Conservation involved
→termination
76. @maxkaiser
„Genuine PPPs currently not a widespread
method for financing digitisation by cultural
institutions in Europe.“
Commission Staff Working Paper Accompanying the document Commission Recommendation
on the digitisation and online accessibility of cultural material and digital preservation, p18
http://ec.europa.eu/information_society/activities/digital_libraries/doc/recommendation/recom28nov_all_versions/staff_working_paper.pdf
77. @maxkaiser
aim to maximize access
and re-use via digitisation
access restrictions /
re-Use limitations in PPPs
79. @maxkaiser
Cultural Commons
→Body of work freely available to the public for
legal use, sharing, repurposing, and remixing
→Source for cultural creativity
→http://creativcommons.org/culture
83. @maxkaiser
Public Domain Mark
„This work has been identified
as being free of known
restrictions under copyright
law, including all related and
neighbouring rights.
You can copy, modify,
distribute and perform the
work, even for commercial
purposes, all without asking
permission.“
http://creativecommons.org/publicdomain/mark/1.0/
84. @maxkaiser
Public Domain Charter
„Public-Private Partnerships have become one
option for funding large scale digitisation efforts.
Commercial content aggregators pay for the
digitisation in exchange for privileged access to the
digitised collections. These activities are seen as a
reason for attempting to exercise as much control as
possible over digital reproductions of Public Domain
works. Organisations are claiming exclusive rights in
digitised versions of Public Domain works and are
entering into exclusive relationships with commercial
partners that hinder free access.”
87. @maxkaiser
PSI Directive
→EC “Directive on the Re-Use of Public Sector
Information” (31 Dec. 2003)
→aim: Foster re-use of PSI
→legally binding document
→implemented by all Member States in 2008
→currently: Cultural & research institutions
excluded from directive
88. @maxkaiser
key provisions of PSI Directive
→clear procedures for re-use requests
→upper limit for charging
→transparency of conditions and standard
charges for re-use
→avoid discrimination between players
→prohibition of exclusive agreements
90. @maxkaiser
proposed changes
→withdraw current exemption for cultural
institutions
→restrict public sector bodies to only apply
charges for re-used based on marginal
costs
→exemption for libraries, archives, museums
→prohibit agreement of terms for re-use
which grant exclusive rights to any one
party
99. @maxkaiser
Austrian National Library:
→ provision of Metadata
→ selection
→ internal logistics
→ conservational assessment
→ barcoding
→ metadata adjustments
→ data download and control
→ data storage & digital preservation
→ Digital Library
156. @maxkaiser
digitisation
→ scanning Center in Germany
→ procedures agreed
→ Austrian Federal Office for Monuments involved
→ each volume checked after return
→ books unavailable to users for ~ 3 months
168. @maxkaiser
quality control
→goal: Automated jobs
→representative samples
→IT assisted discovery of error clusters
→error candidates checked manually
→detect systematic
and critical errors
169. @maxkaiser
error model
→ level 1: data / information
→ image (thick, broken)
→ illustration (scanner effects, tone, color etc)
→ full-text (OCR errors per page-image)
→ level 2: entire page
→ blur / warp / skew
→ cropping
→ obscure / cleaned
→ colorization
→ full-text (OCR error patterns at page level)
Informed by „Validating Quality in
Large-Scale Digitization“ project
of Univ. of Michigan & Univ. of Minesota,
http://hathitrust-quality.projects.si.umich.edu/
170. @maxkaiser
error model
→ level 3: whole volume
→ order of pages
→ missing pages
→ duplicate pages
→ false pages
→ full text (OCR error patterns at volume level)
Informed by „Validating Quality in
Large-Scale Digitization“ project
of Univ. of Michigan & Univ. of Minesota,
http://hathitrust-quality.projects.si.umich.edu/
171. @maxkaiser
use cases
→reading online images
→printing on demand
→processing full text data
→managing collections
Informed by „Validating Quality in
Large-Scale Digitization“ project
of Univ. of Michigan & Univ. of Minesota,
http://hathitrust-quality.projects.si.umich.edu/
186. hadoop / map reduce
SLAVE 1
Task Tracker
Data Node
SLAVE 2
Task Tracker
Data Node
SLAVE n
Task Tracker
Data Node
MASTER
Job Tracker
Name Node
Hadoop Distributed File System (HDFS)
→ experimental 5 server cluster at ONB:
→ 40 cores in total
→ 30 cores assigned to task trackers
187. @maxkaiser
use case 1: duplicate pages
in one book
→books with duplicated pages
→due to scanning process & post processing
→use key points of images to determine
structural image similarity
190. @maxkaiser
use case 2: book comparison
based on image similarity
→different instances of one book, coming
→e.g. from different downloads of one book
at different points in time
→book similarity measure
→based on comparison of book page images
from two different book instances
191. use case 2: book comparison
based on image similarity
measure for book similarity
based on book page image
similarity
helps finding prominent
changes in book re-
downloads
192. @maxkaiser
large scale document processing
→extract image metadata using Exiftool
→large scale batch processing using Apache
Hadoop Streaming API
→bash script using Exiftool is executed on the
cluster
→book page image data is accessible from
each node of the cluster
→parallelisation of batch processing
194. @maxkaiser
large scale document processing
→ store once in HDFS and read many times
→ small files (TXT, HTML) stored in HDFS
→ files of each file type stored as one big file
(SequenceFile)
→ principle: store once in HDFS and read many times
→ example:
→ storing OCR results of 24 mio pages (ca. 60.000
books) reading data from file server and storing on
cluster takes more than 1 day
→ subsequent processing of a Map/Reduce job (calculate
average block width) takes 6 hours
203. @maxkaiser
outlook
→ full-text: new possibilities for research
→ data enrichment
→ named entity recognition
→ linked data
→ new data centric research in the Humanities
& Social Sciences
→ http://www.diggingintodata.org/
205. @maxkaiser
DM2E
→http://dm2e.eu/
→European Commission co-funded project
→stimulate creation of new tools and
services for re-use of Europeana data in
the Digital Humanities
→implementation of semantic annotation
tool
→Austrian Books Online data part of the
project
206. @maxkaiser
next steps
→80.000 books already accessible via
Google Books
→Spring 2013: launch of Austrian Books
Online Viewer
→full text search