Libraries collect books, magazines and newspapers. Yes, that's what they always did. But today, the amount of digital information resources is growing at dizzying speed. Facing the demand of digital information resources available 24/7, there has been a significant shift regarding a library's core responsibilities. Today's libraries are curating large digital collections, indexing millions of full-text documents, preserving Terabytes of data for future generations, and at the same time exploring innovative ways of providing access to their collections. This is exactly where Hadoop comes into play. Libraries have to process a rapidly increasing amount of data as part of their day-to-day business and computing tasks like file format migration, text recognition, linguistic processing, etc., require significant computing resources. Many data processing scenarios emerge where Hadoop might become an essential part of the digital library's ecosystem. Hadoop is sometimes referred to as a hammer where you have to throw away everything that is not a nail. To remain in that metaphor: we will present some actual use cases for Hadoop in libraries, how we determine what are the nails in a library and what not, and some initial results.
5. Our libraries
• The Hague, Netherlands • Vienna, Austria
• Founded in 1798 • Founded in 14th century
• 120.000 visitors per year • 300.000 visitors per year
• 6 million documents • 8 million documents
• 260 FTE • 300 FTE
www.kb.nl www.onb.ac.at
10. Our data – cultural heritage
• Traditionally
• Bibliographic and other metadata
• Images (Portraits/Pictures, Maps, Posters, etc.)
• Text (Books, Articles, Newspapers, etc.)
• More recently
• Audio/Video
• Websites, Blogs, Twitter, Social Networks
• Research Data/Raw Data
• Software? Apps?
11. 2. Numbers
“A good decision is based on knowledge
and not on numbers”
Plato, 400 BC
12. Numbers (I)
National Library of the Netherlands
• Digital objects
• > 500 million files
• 18 million digital publications (+ 2M/year)
• 8 million newspaper pages (+ 4M/year)
• 152.000 books (+ 100k/year)
• 730.000 websites (+ 170k/year)
• Storage
• 1.3 PB (currently 458 TB used)
• Growing approx. 150 TB a year
13. Numbers (II)
Austrian National Library
• Digital objects
• 600.000 volumes being digitised during the next
years (currently 120.000 volumes, 40 million pages)
• 10 million newspapers and legal texts
• 1.16 billion files in web archive from
> 1 million domains
• Several 100.000 images and portraits
• Storage
• 84 TB
• Growing approx. 15 TB a year
14. Numbers (III)
• Google Books Project
• 2012: 20 million books scanned
(approx. 7,000,000,000 pages)
• www.books.google.com
• Europeana
• 2012: 25 million digital objects
• All metadata licensed CC-0
• www.europeana.eu/portal
16. Numbers (V)
• What can we expect?
• Enumerate 2012: only about 4% digitised so far
• Strong growth of born digital information
Source: www.idc.com Source: security.networksasia.net
19. SCAPE
• SCAPE = SCAlable Preservation Environments
• €8.6M EU funding, Feb 2011 – July 2014
• 20 partners from public sector, academia, industry
• Main objectives:
• Scalability
• Automation
• Planning
www.scape-project.eu
20. Use cases (I)
• Document recognition: From image to XML
• Business case:
• Better presentation options
• Creation of eBooks
• Full-text indexing
21. Use cases (II)
• File type migration: JP2k TIFF
• Business case:
• Originally migration
to JP2k to reduce
storage costs
• Reverse process
used in case JP2k
becomes obsolete
22. Use cases (III)
• Web archiving: Characterization of web content
• Business case:
• What is in a Top Level Domain?
• What is the distribution of file formats?
• http://www.openplanetsfoundation.org/blogs/2013-01-
09-year-fits
xkcd.com/688
23. Use cases (IV)
• Digital Humanities: Making sense of the millions
• Business case:
• Text mining & NLP
• Statistical analysis
• Semantic enrichment
• Visualizations Source: www.open.ac.uk/
26. Execution environment
Cluster
Taverna Server
File server (REST API)
Hadoop Apache Tomcat
Jobtracker Web Application
27. Scenarios (I)
Log file analysis
• Metadata log files generated by the web crawler
during the harvesting process
(no mime type identification – just the mime types
returned by the web server)
20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 200
20110830130709 9684 46 16 3 image/jpeg http://URL at IP 22123 200
20110830130710 9684 46 16 3 image/gif http://URL at IP 9794 200
20110830130707 9684 46 16 3 image/jpeg http://URL at IP 40056 200
20110830130704 9684 46 16 3 text/html http://URL at IP 13149 200
20110830130712 9684 46 16 3 image/gif http://URL at IP 2285 200
20110830130712 9684 46 16 3 text/html http://URL at IP 415 301
20110830130710 9684 46 16 3 text/html http://URL at IP 7873 200
20110830130712 9684 46 16 3 text/html http://URL at IP 632 302
20110830130712 9684 46 16 3 image/png http://URL at IP 679 200
28. Scenarios (II)
Web archiving: File format identification
→ Run file type identification on archived web content
(W)ARC Container
JPG (W)ARC RecordReader MapReduce
Apache Tika
GIF JPG image/jpg
detect MIME
based on
HTM HERITRIX Map
Web crawler Reduce
read/write (W)ARC
image/jpg 1
HTM image/gif 1
text/html 2
audio/midi 1
MID
29. Scenarios (II)
Web archiving: File format identification
→ Using MapReduce to calculate statistics
DROID 6.01 TIKA 1.0
30. Scenarios (III)
File format migration
• Risk of format obsolescence
• Quality assurance
• File format validation
• Original/target image
comparison
• Imagine runtime of 1 minute
per image for 200 million
pages ...
32. ●Feature extraction
requires sharing
resources between
processing steps
●Challenge to model
more complex image
comparison scenarios,
e.g. book page
duplicates detection
or digital book
comparison
45. What have WE learned?
• We need to carefully assess the efforts for data
preparation vs. the actual processing load
• HDFS prefers large files over many small ones,
is basically “append-only”
• There is still much more the Hadoop ecosystem
has to offer, e.g. YARN, Pig, Mahout
46. What can YOU do?
• Come join our “Hadoop in cultural heritage”
hackathon on 2-4 December 2013, Vienna
(See http://www.scape-project.eu/events )
• Check out some tools from our github at
https://github.com/openplanets/ and help
us make them better and more scalable
• Follow us at @SCAPEProject and spread the word!
47. What’s in it for US?
• Digital (free) access to centuries of cultural
heritage data, 24x7 and from anywhere
• Ensuring our cultural history is not lost
• New innovative applications using cultural
heritage data (education, creative industries)