The Elephant in the Library

SCAP
E

The Elephant in the Library
Integrating Hadoop

Clemens Neudecker Sven Schlarb
@cneudecker @SvenSchlarb

Contents

1. Background: Digitization of cultural heritage

2. Numbers: Scaling up!

3. Challenges: Use cases and scenarios

4. Outlook

1. Background

“The digital revolution is far more
significant than the invention of
writing or even of printing”
Douglas Engelbart

Our libraries

• The Hague, Netherlands • Vienna, Austria
• Founded in 1798 • Founded in 14th century
• 120.000 visitors per year • 300.000 visitors per year
• 6 million documents • 8 million documents
• 260 FTE • 300 FTE
www.kb.nl www.onb.ac.at

Digitization

Libraries are rapidly transforming from physical…

to digital…

Transformation

Curation Lifecycle Model from Digital Curation Centre www.dcc.ac.uk

Our data – cultural heritage

• Traditionally
• Bibliographic and other metadata
• Images (Portraits/Pictures, Maps, Posters, etc.)
• Text (Books, Articles, Newspapers, etc.)
• More recently
• Audio/Video
• Websites, Blogs, Twitter, Social Networks
• Research Data/Raw Data
• Software? Apps?

2. Numbers

“A good decision is based on knowledge
and not on numbers”
Plato, 400 BC

Numbers (I)
National Library of the Netherlands

• Digital objects
• > 500 million files
• 18 million digital publications (+ 2M/year)
• 8 million newspaper pages (+ 4M/year)
• 152.000 books (+ 100k/year)
• 730.000 websites (+ 170k/year)
• Storage
• 1.3 PB (currently 458 TB used)
• Growing approx. 150 TB a year

Numbers (II)
Austrian National Library

• Digital objects
• 600.000 volumes being digitised during the next
years (currently 120.000 volumes, 40 million pages)
• 10 million newspapers and legal texts
• 1.16 billion files in web archive from
> 1 million domains
• Several 100.000 images and portraits
• Storage
• 84 TB
• Growing approx. 15 TB a year

Numbers (III)

• Google Books Project
• 2012: 20 million books scanned
(approx. 7,000,000,000 pages)
• www.books.google.com

• Europeana
• 2012: 25 million digital objects
• All metadata licensed CC-0
• www.europeana.eu/portal

Numbers (IV)

• Hathi Trust
• 3,721,702,950 scanned pages
• 477 TBytes
• www.hathitrust.org

• Internet Archive
• 245 billion web pages archived
• 10 PBytes
• www.archive.org

Numbers (V)

• What can we expect?
• Enumerate 2012: only about 4% digitised so far
• Strong growth of born digital information

Source: www.idc.com Source: security.networksasia.net

3. Challenges

“What do you do with a million books?”
Gregory Crane, 2006

Making it scale

Scalability in terms of …
• size
• number
• complexity
• heterogeneity

SCAPE

• SCAPE = SCAlable Preservation Environments
• €8.6M EU funding, Feb 2011 – July 2014
• 20 partners from public sector, academia, industry
• Main objectives:
• Scalability
• Automation
• Planning

www.scape-project.eu

Use cases (I)

• Document recognition: From image to XML

• Business case:
• Better presentation options
• Creation of eBooks
• Full-text indexing

Use cases (II)

• File type migration: JP2k  TIFF

• Business case:
• Originally migration
to JP2k to reduce
storage costs
• Reverse process
used in case JP2k
becomes obsolete

Use cases (III)

• Web archiving: Characterization of web content

• Business case:
• What is in a Top Level Domain?
• What is the distribution of file formats?
• http://www.openplanetsfoundation.org/blogs/2013-01-
09-year-fits

xkcd.com/688

Use cases (IV)

• Digital Humanities: Making sense of the millions

• Business case:
• Text mining & NLP
• Statistical analysis
• Semantic enrichment
• Visualizations Source: www.open.ac.uk/

Enter the Elephants…

Source: Biopics

Execution environment

Cluster
Taverna Server
File server (REST API)

Hadoop Apache Tomcat
Jobtracker Web Application

Scenarios (I)
Log file analysis

• Metadata log files generated by the web crawler
during the harvesting process
(no mime type identification – just the mime types
returned by the web server)
20110830130705 9684 46 16 3 image/jpeg http://URL at IP 17311 200
20110830130710 9684 46 16 3 image/gif http://URL at IP 9794 200
20110830130704 9684 46 16 3 text/html http://URL at IP 13149 200
20110830130712 9684 46 16 3 image/gif http://URL at IP 2285 200
20110830130712 9684 46 16 3 image/png http://URL at IP 679 200

Scenarios (II)
Web archiving: File format identification
→ Run file type identification on archived web content

(W)ARC Container

JPG (W)ARC RecordReader MapReduce

Apache Tika
GIF JPG image/jpg
detect MIME
based on
HTM HERITRIX Map
Web crawler Reduce
read/write (W)ARC
image/jpg 1
HTM image/gif 1
text/html 2
audio/midi 1

MID

Scenarios (II)
Web archiving: File format identification
→ Using MapReduce to calculate statistics

DROID 6.01 TIKA 1.0

Scenarios (III)
File format migration

• Risk of format obsolescence
• Quality assurance
• File format validation
• Original/target image
comparison
• Imagine runtime of 1 minute
per image for 200 million
pages ...

Parallel execution of
file format validation
using Mapper
●Jpylyzer (Python)
●Jhove2 (Java)

●Feature extraction
requires sharing
resources between
processing steps

●Challenge to model
more complex image
comparison scenarios,
e.g. book page
duplicates detection
or digital book
comparison

Scenarios (IV)
Book page analysis

Create text file containing JPEG2000 input file paths and read
image metadata using Exiftool via the Hadoop Streaming API

Reading image metadata
Jp2PathCreator HadoopStreamingExiftoolRead
reading files from NAS

/NAS/Z119585409/00000001.jp2 Z119585409/00000001 2345
/NAS/Z119585409/00000002.jp2 Z119585409/00000002 2340
/NAS/Z119585409/00000003.jp2 Z119585409/00000003 2543
… …
/NAS/Z117655409/00000001.jp2 Z117655409/00000001 2300
/NAS/Z117655409/00000002.jp2 Z117655409/00000002 2300
/NAS/Z117655409/00000003.jp2 Z117655409/00000003 2345
…
find …
/NAS/Z119585987/00000001.jp2 Z119585987/00000001 2300
/NAS/Z119585987/00000002.jp2 Z119585987/00000002 2340
/NAS/Z119585987/00000003.jp2 Z119585987/00000003 2432
… …
/NAS/Z119584539/00000001.jp2 Z119584539/00000001 5205
NAS /NAS/Z119584539/00000002.jp2 Z119584539/00000002 2310
/NAS/Z119584539/00000003.jp2 Z119584539/00000003 2134
… …
/NAS/Z119599879/00000001.jp2l Z119599879/00000001 2312
/NAS/Z119589879/00000002.jp2 Z119589879/00000002
... 2300
/NAS/Z119589879/00000003.jp2 Z119589879/00000003 2300
... ...

1,4 GB 1,2 GB

60.000 books
: ~5h + ~ 38 h = ~ 43 h
24 Million pages

Create text file containing HTML input file paths and create
one sequence file with the complete file content in HDFS

SequenceFile creation
HtmlPathCreator SequenceFileCreator
reading files from NAS

/NAS/Z119585409/00000707.html
/NAS/Z119585409/00000708.html
Z119585409/00000707
/NAS/Z119585409/00000709.html
…
/NAS/Z138682341/00000707.html
Z119585409/00000708
/NAS/Z138682341/00000708.html
/NAS/Z138682341/00000709.html
find …
Z119585409/00000709
/NAS/Z178791257/00000707.html
/NAS/Z178791257/00000708.html
/NAS/Z178791257/00000709.html
… Z119585409/00000710
/NAS/Z967985409/00000707.html
NAS /NAS/Z967985409/00000708.html
/NAS/Z967985409/00000709.html Z119585409/00000711
…
/NAS/Z196545409/00000707.html
/NAS/Z196545409/00000708.html Z119585409/00000712
/NAS/Z196545409/00000709.html
...

1,4 GB 997 GB (uncompressed)

60.000 books
: ~5h + ~ 24 h = ~ 29 h
24 Million pages

Execute Hadoop MapReduce job using the sequence file created
before in order to calculate the average paragraph block width

HTML Parsing
HadoopAvBlockWidthMapReduce
Map Reduce
Z119585409/00000001 2100
Z119585409/00000001 2200
Z119585409/00000001 2250
Z119585409/00000001 2300
Z119585409/00000001 2400

Z119585409/00000001 Z119585409/00000002 2100
Z119585409/00000002 2200 Z119585409/00000002 2250
Z119585409/00000002 2300
Z119585409/00000002 2400
Z119585409/00000002
Z119585409/00000003 2100
Z119585409/00000003 2200
Z119585409/00000003 2250
Z119585409/00000003 2300
Z119585409/00000003
Z119585409/00000003 2400

Z119585409/00000004 2100
Z119585409/00000004 Z119585409/00000004 2200 Z119585409/00000004 2250
Z119585409/00000004 2300
Z119585409/00000004 2400
...
Z119585409/00000005 2100
Z119585409/00000005 Z119585409/00000005 2200
Z119585409/00000005 2250
Z119585409/00000005 2300
Z119585409/00000005 2400

SequenceFile Textfile

60.000 books
: ~6h
24 Million pages

Create Hive table and load generated data into the Hive database

Analytic Queries
HiveLoadExifData & HiveLoadHocrData
htmlwidth

hid hwidth
Z119585409/00000001 1870 Z119585409/00000001 1870
Z119585409/00000002 2100 CREATE TABLE htmlwidth
Z119585409/00000003 2015 Z119585409/00000002 2100
Z119585409/00000004 1350
(hid STRING, hwidth INT)
Z119585409/00000005 1700 Z119585409/00000003 2015

Z119585409/00000004 1350

Z119585409/00000005 1700

jp2width

jid jwidth
Z119585409/00000001 2250 Z119585409/00000001 2250
Z119585409/00000002 2150 CREATE TABLE jp2width
Z119585409/00000003 2125 Z119585409/00000002 2150
Z119585409/00000004 2125 (hid STRING, jwidth INT)
Z119585409/00000005 2250 Z119585409/00000003 2125

Z119585409/00000004 2125

Z119585409/00000005 2250

60.000 books
24 Million pages : ~6h

Analytic Queries
HiveSelect
jp2width htmlwidth

jid jwidth hid hwidth
Z119585409/00000001 2250 Z119585409/00000001 1870

Z119585409/00000002 2150 Z119585409/00000002 2100

Z119585409/00000003 2125 Z119585409/00000003 2015

Z119585409/00000004 2125 Z119585409/00000004 1350

Z119585409/00000005 2250 Z119585409/00000005 1700

select jid, jwidth, hwidth from jp2width inner join htmlwidth on jid = hid

jid jwidth hwidth
Z119585409/00000001 2250 1870

Z119585409/00000002 2150 2100

Z119585409/00000003 2125 2015

Z119585409/00000004 2125 1350

Z119585409/00000005 2250 1700

60.000 books
: ~6h
24 Million pages

Perform a simple Hive query to test if the
database has been created successfully

Outlook

“Progress generally appears much
greater than it really is”
Johan Nestroy, 1847

What have WE learned?

• We need to carefully assess the efforts for data
preparation vs. the actual processing load

• HDFS prefers large files over many small ones,
is basically “append-only”

• There is still much more the Hadoop ecosystem
has to offer, e.g. YARN, Pig, Mahout

What can YOU do?

• Come join our “Hadoop in cultural heritage”
hackathon on 2-4 December 2013, Vienna
(See http://www.scape-project.eu/events )

• Check out some tools from our github at
https://github.com/openplanets/ and help
us make them better and more scalable

• Follow us at @SCAPEProject and spread the word!

What’s in it for US?

• Digital (free) access to centuries of cultural
heritage data, 24x7 and from anywhere

• Ensuring our cultural history is not lost

• New innovative applications using cultural
heritage data (education, creative industries)

Thank you! Questions?
(btw, we’re hiring)

www.kb.nl
www.onb.ac.at
www.scape-project.eu
www.openplanetsfoundation.org

The Elephant in the Library

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie The Elephant in the Library

Ähnlich wie The Elephant in the Library (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Elephant in the Library