Lita national forum 2012

Building the New
Open Linked
Library
(Revisited)

Joel Richard
LITA National Forum 2012
October 5, 2012

Smithsonian Libraries
• Founded in 1846
• 1.5 m volumes in collection, plus assorted
archival collections
• 15,000 volumes scanned and online
• 20 libraries serving ~500 researchers/curators
+ hundreds of fellows and interns
• 105 library staff
• 1.5 web staff
• Founding member of the Biodiversity
Heritage Library

Le Garde-meuble, ancien et moderne [Furniture repository, ancient and modern], 1839-1935

(From 2011)
Drupal and Linked Data
• Native support for RDFa in Drupal 7.
• RDF Extensions (rdfx) – even more features.
• Vocabularies can be imported and cached for
reuse.
• Few or no modifications to HTML to support
RDFa.

What’s the difference between RDF,
RDF/XML and RDFa?
LITA National Forum, September 30,
2011

(From 2011)
RDF/XML Sample
URI: http://library.si.edu/book/origin-of-species.rdf

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/terms/"
xmlns:bibo="http://purl.org/ontology/bibo/">

<rdf:Description rdf:about="http://localhost:8087/content/
origin-species">

<rdf:type rdf:resource="http://purl.org/ontology/bibo/Book"/>
<dc:title>The Origin of Species</dc:title>
<dc:created>November 24, 1859</dc:created>
<bibo:numPages>1000</bibo:numPages>
<dc:language>english</dc:language>
<bibo:authorList
rdf:resource="http://localhost:8087/content/darwin-charles"/>

<owl:sameAs rdf:resource=“http://www.worldcat.org/oclc/1184647”>
</rdf:Description>
</rdf:RDF>

2011

TL-2 Page Sample (From 2011)

http://library.si.edu/tl2/author/darwin

tl2:creatorOf
http://library.si.edu/tl2/book/1313

owl:sameAs
http://viaf.org/viaf/27063124

dc:creator

owl:sameAs
http://www.archive.org/details/
originofspecies00darwuoft

2011

TL-2 Page Sample Results (From 2011)

http://library.si.edu/tl2/author/darwin http://library.si.edu/tl2/book/1313

tl2:creatorOf dc:creator
“http://library.si.edu/tl2/book/1313” “http://library.si.edu/tl2/author/darwin”

owl:sameAs owl:sameAs
“http://viaf.org/viaf/27063124” ”http://www.archive.org/details/
originofspecies00darwuoft”
foaf:lastName “Darwin”
tl2:bookNumber “1313”
foaf:familyName “Darwin”
bibo:shortTitle “On the origin of species”
foaf:firstName “Charles”
dc:title “On the origin of species by means
foaf:givenName “Charles” of natural selection, or the preservation
of favoured races in the struggle for
foaf:name “Darwin, Charles Robert” life.”

skos:prefLabel “Darwin, Charles Robert” event:place “London”

tl2:birthYear “1809” dc:publisher “John Murray”

tl2:deathYear “1882” dc:created “1859”

tl2:description “British evolutionary biologist” tl2:bookAbbreviation “Origin sp.”

tl2:personAbbrev “Darwin”

2011

(From 2011)

2011

(From 2011)
Who is reusing our data?
Ryan Schenk – http://ryanschenk.com/2011/02/visualizing-taxonomic-synoymns/

2011

(From 2011)
Who is reusing our data?
Encyclopedia of Life – http://eol.org/

2011

Linked Data Review
• Publishing structured data on the web
• RDF (Resource Description Framework)
• Enables queries computer 2 computer
• Uses standard ontologies (vocabularies)
• Data in is presented as “triples”

URI http://library.si.edu/tl2/author/charles-darwin
Predicate owl:sameAs
Object http://viaf.org/viaf/27063124

Linked Data In Action
Google Knowledge Graph

Linked Data Review

“Feb 12 1809”
Born On
Type City
Born In
Charles Darwin Shrewsbury
Is In

England
Type
Person Type

Country

Our Website
Organically grown since 1995

• 83,000 HTML pages
• 3,700 ColdFusion pages
• 253,000 JPEG files
• 27,000 PNG files
• 46,000 PDFs

No CMS for legacy information

Now using Drupal for “Brochure-ware”

Content Analysis
• 400+ Online “books”
• Exhibitions
• Research Tools
• Image Collections (16,000+ images)
• “Brochure” content (About us, Locations, Hours)
• Bibliographies, Fact Sheets, Subject Guides
• Databases, inventories, and database-like books

 Collections not on our website:
• ~15,000 digitized volumes, with many more planned
• Other analog collections that will be digitized

Bureau of American Ethnology Bulletin 164; Sewing Machine Trade Literature; Underwater Web Exhibition, Smithsonian Libraries

Linked Data in our Library
Books (and book-like objects)
• Expose bibliographic data for reuse
• Consume links to other internal
content and external authoritative
data
Databases
• Expose data previously unavailable
• Provide authoritative data
• Consume our data and others’ to
create new aggregate websites

Linked Data in our Books
RDF Type = foaf:Person

foaf:lastName, foaf:familyName

foaf:firstName, foaf:givenName

foaf:name, skos:prefLabel

tl2:birthYear

tl2:deathYear

tl2:description

tl2:personAbbrev

RDF Type = bibo:Book

tl2:bookNumber

dc:title

event:place

dc:publisher

tl2:bookAbbreviation

dc:created

Linked Data Tools (Drupal)
• Fields, Views, Views UI
• Node Reference
• SPARQL Endpoint , SPARQL API
• RESTful Web Services
• SPARQL Views
• RDF External Vocabulary Importer

Caveat: Some modules not ready for Drupal 7
• i.e., Biblio module (no CCK, RDF capabilities)

Disclaimer
We are still learning!

How to effectively use Drupal

What goes into a Digital Library

How to best leverage
Linked Open Data

(Also: We will always be learning.)

J. L. Hammett Illustrated Catalogue of School Merchandise 1872-1873…, 1872-1874

What is a Digital Library?
 More than a virtual stack of books
 Digital allows more capabilities, access
 Interlinked Content (See more from this item)

What content will be in our digital library?

 Digitized Books  Lists / Bibliographies
 Image Library  Smithsonian Publications
 Collections (of things)  Videos
 Exhibitions  “Trade Literature” and
 Databases other non-cataloged items

Knowledge/Data Sharing
Taxonomic Literature II Index Animalium
 Essential botanical  35 Volumes
reference  430,000 Scientific
 15 volumes
Names
 Each with a citation to
 9,000 Botanists
first description
 37,000 Titles authored  7000+ items in the
by these botanists bibliography, many
 More modern, simpler to linked to WorldCat
handle  Older, challenging in
nature

Our Process for TL-2
Scanned the pages

Hired contractor for OCR and correction
(99.97% accuracy)

Received XML dataset from Contractor

Verified and Imported to SQL Server
Built a website to search the data

Before we import…

What exactly does 99.97% accuracy mean?

~12,000 Errors

Importing
Millions of records are no problem for
modern databases. But, how to get data
into Drupal?

 Use existing tools?

 Create my own import?

The Muralo Company Muralo: Sanitary Wall Coatings in the Home, 1912

Importing
Import via existing tools

 Used Drupal’s Feeds Importer
 Typically used for importing RSS or similar
 Fast to set up (< 5 minutes)
 Slow to import (47,000 records = 8+ hours)
 Poor error recovery (imported 5 times)
 What if the data changes in the future?

Faster ≠ Better

Importing
Write my own import. But how?

 Make a Drupal Module!
 Steep Learning Curve (many APIs)
 Faster to set up (48,000 records = 85 minutes)
 Added bonus: Modules can be versioned
 Can use the “version update” code to update our data
 Versioned modules good for Dev / Prod servers

Importing
Digitized Books Online

 Similar module for importing
 Module also handles a page for reading books online
 Uses Internet Archive book reader in an <IFRAME>
 Links to WorldCat / VIAF
 FAST Subjects
 Table of Contents Navigation
 Eligible for Linked Open Data

http://archive.org/details/smithsonian

Data Schema: British Library

http://talis-systems.com/wp-content/uploads/2011/07/British-Library-Data-Model-v1.01.pdf

Data Schema
What data model are we going to use?
 British Library
 Schema.org
 Something else?

What vocabularies are we using?
 Dublin Core  FOAF
 OWL  Event?
 SKOS  Org?
 BIBO  Geo?
 BIO  Our own vocabulary for TL-2

Other Content
Galaxy of Images
 Image collection of plates from our digitized books
 18,000 images and growing
 Richer set of metadata
 Data needs to be massaged / imported
 Images served from another system

http://www.sil.si.edu/imagegalaxy/

Other Content
Videos
 All are currently on YouTube
 Will remain there for now
 Metadata to be imported to Digital Library
 Will eventually be served from our network

http://www.youtube.com/smithsonianlibraries

Other Content
 Collections and Exhibitions
 Bibliographies, lists, subject guides
 Trade Literature
 Sewing machines!
 Scientific equipment!
 Seed Catalogs!
 Smithsonian Publications (DSpace)
 Smithsonian Libraries Blog
 Art and Artist Vertical Files

W. Atlee Burpee & Co. Burpee's New Annual for 1910, 1910

Future Work
 More planning!
 Developing a LOD Vocabulary for
TL-2
 Continued parsing of content in
TL-2
 Continuing the development of
the Index Animalium content
 Publishing the Index Animalium
on the web as LOD

 How to leverage linked data to
create… what?

Leopoldo Galluzzo Altre scoverte fatte nella luna dal Sigr. Herschel , 1836

Thank you!

Joel Richard
richardjm@si.edu
@cajunjoel
http://slideshare.net/joelrichard
http://library.si.edu/staff/richardjm

Lita national forum 2012

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Lita national forum 2012

Ähnlich wie Lita national forum 2012 (20)

Lita national forum 2012

Hinweis der Redaktion