A followup on our 2011 presentation on the new Linked Open Digital Library, discussing how we are creating a digital library centered around LInked Open Data. Include details on how we are creating a dataset of botanists and their publications that is to be shared as linked open data.
1. Building the New
Open Linked
Library
(Revisited)
Joel Richard
LITA National Forum 2012
October 5, 2012
2. Smithsonian Libraries
• Founded in 1846
• 1.5 m volumes in collection, plus assorted
archival collections
• 15,000 volumes scanned and online
• 20 libraries serving ~500 researchers/curators
+ hundreds of fellows and interns
• 105 library staff
• 1.5 web staff
• Founding member of the Biodiversity
Heritage Library
Le Garde-meuble, ancien et moderne [Furniture repository, ancient and modern], 1839-1935
3. (From 2011)
Drupal and Linked Data
• Native support for RDFa in Drupal 7.
• RDF Extensions (rdfx) – even more features.
• Vocabularies can be imported and cached for
reuse.
• Few or no modifications to HTML to support
RDFa.
What’s the difference between RDF,
RDF/XML and RDFa?
LITA National Forum, September 30,
2011
6. TL-2 Page Sample Results (From 2011)
http://library.si.edu/tl2/author/darwin http://library.si.edu/tl2/book/1313
tl2:creatorOf dc:creator
“http://library.si.edu/tl2/book/1313” “http://library.si.edu/tl2/author/darwin”
owl:sameAs owl:sameAs
“http://viaf.org/viaf/27063124” ”http://www.archive.org/details/
originofspecies00darwuoft”
foaf:lastName “Darwin”
tl2:bookNumber “1313”
foaf:familyName “Darwin”
bibo:shortTitle “On the origin of species”
foaf:firstName “Charles”
dc:title “On the origin of species by means
foaf:givenName “Charles” of natural selection, or the preservation
of favoured races in the struggle for
foaf:name “Darwin, Charles Robert” life.”
skos:prefLabel “Darwin, Charles Robert” event:place “London”
tl2:birthYear “1809” dc:publisher “John Murray”
tl2:deathYear “1882” dc:created “1859”
tl2:description “British evolutionary biologist” tl2:bookAbbreviation “Origin sp.”
tl2:personAbbrev “Darwin”
LITA National Forum, September 30,
2011
8. (From 2011)
Who is reusing our data?
Ryan Schenk – http://ryanschenk.com/2011/02/visualizing-taxonomic-synoymns/
LITA National Forum, September 30,
2011
9. (From 2011)
Who is reusing our data?
Encyclopedia of Life – http://eol.org/
LITA National Forum, September 30,
2011
10. Linked Data Review
• Publishing structured data on the web
• RDF (Resource Description Framework)
• Enables queries computer 2 computer
• Uses standard ontologies (vocabularies)
• Data in is presented as “triples”
URI http://library.si.edu/tl2/author/charles-darwin
Predicate owl:sameAs
Object http://viaf.org/viaf/27063124
12. Linked Data Review
“Feb 12 1809”
Born On
Type City
Born In
Charles Darwin Shrewsbury
Is In
England
Type
Person Type
Country
13. Our Website
Organically grown since 1995
• 83,000 HTML pages
• 3,700 ColdFusion pages
• 253,000 JPEG files
• 27,000 PNG files
• 46,000 PDFs
No CMS for legacy information
Now using Drupal for “Brochure-ware”
14. Content Analysis
• 400+ Online “books”
• Exhibitions
• Research Tools
• Image Collections (16,000+ images)
• “Brochure” content (About us, Locations, Hours)
• Bibliographies, Fact Sheets, Subject Guides
• Databases, inventories, and database-like books
Collections not on our website:
• ~15,000 digitized volumes, with many more planned
• Other analog collections that will be digitized
Bureau of American Ethnology Bulletin 164; Sewing Machine Trade Literature; Underwater Web Exhibition, Smithsonian Libraries
15. Linked Data in our Library
Books (and book-like objects)
• Expose bibliographic data for reuse
• Consume links to other internal
content and external authoritative
data
Databases
• Expose data previously unavailable
• Provide authoritative data
• Consume our data and others’ to
create new aggregate websites
16. Linked Data in our Books
http://library.si.edu/tl2/author/darwin
RDF Type = foaf:Person
foaf:lastName, foaf:familyName
foaf:firstName, foaf:givenName
foaf:name, skos:prefLabel
tl2:birthYear
tl2:deathYear
tl2:description
tl2:personAbbrev
http://library.si.edu/tl2/book/1313
RDF Type = bibo:Book
tl2:bookNumber
dc:title
event:place
dc:publisher
tl2:bookAbbreviation
dc:created
17. Linked Data Tools (Drupal)
• Fields, Views, Views UI
• Node Reference
• SPARQL Endpoint , SPARQL API
• RESTful Web Services
• SPARQL Views
• RDF External Vocabulary Importer
Caveat: Some modules not ready for Drupal 7
• i.e., Biblio module (no CCK, RDF capabilities)
18. Disclaimer
We are still learning!
How to effectively use Drupal
What goes into a Digital Library
How to best leverage
Linked Open Data
(Also: We will always be learning.)
J. L. Hammett Illustrated Catalogue of School Merchandise 1872-1873…, 1872-1874
19. What is a Digital Library?
More than a virtual stack of books
Digital allows more capabilities, access
Interlinked Content (See more from this item)
What content will be in our digital library?
Digitized Books Lists / Bibliographies
Image Library Smithsonian Publications
Collections (of things) Videos
Exhibitions “Trade Literature” and
Databases other non-cataloged items
20. Knowledge/Data Sharing
Taxonomic Literature II Index Animalium
Essential botanical 35 Volumes
reference 430,000 Scientific
15 volumes
Names
Each with a citation to
9,000 Botanists
first description
37,000 Titles authored 7000+ items in the
by these botanists bibliography, many
More modern, simpler to linked to WorldCat
handle Older, challenging in
nature
21. Our Process for TL-2
Scanned the pages
Hired contractor for OCR and correction
(99.97% accuracy)
Received XML dataset from Contractor
Verified and Imported to SQL Server
Built a website to search the data
23. Before we import…
What exactly does 99.97% accuracy mean?
~12,000 Errors
24. Importing
Millions of records are no problem for
modern databases. But, how to get data
into Drupal?
Use existing tools?
Create my own import?
The Muralo Company Muralo: Sanitary Wall Coatings in the Home, 1912
25. Importing
Import via existing tools
Used Drupal’s Feeds Importer
Typically used for importing RSS or similar
Fast to set up (< 5 minutes)
Slow to import (47,000 records = 8+ hours)
Poor error recovery (imported 5 times)
What if the data changes in the future?
Faster ≠ Better
26. Importing
Write my own import. But how?
Make a Drupal Module!
Steep Learning Curve (many APIs)
Faster to set up (48,000 records = 85 minutes)
Added bonus: Modules can be versioned
Can use the “version update” code to update our data
Versioned modules good for Dev / Prod servers
27. Importing
Digitized Books Online
Similar module for importing
Module also handles a page for reading books online
Uses Internet Archive book reader in an <IFRAME>
Links to WorldCat / VIAF
FAST Subjects
Table of Contents Navigation
Eligible for Linked Open Data
http://archive.org/details/smithsonian
28. Data Schema: British Library
http://talis-systems.com/wp-content/uploads/2011/07/British-Library-Data-Model-v1.01.pdf
29. Data Schema
What data model are we going to use?
British Library
Schema.org
Something else?
What vocabularies are we using?
Dublin Core FOAF
OWL Event?
SKOS Org?
BIBO Geo?
BIO Our own vocabulary for TL-2
30. Other Content
Galaxy of Images
Image collection of plates from our digitized books
18,000 images and growing
Richer set of metadata
Data needs to be massaged / imported
Images served from another system
http://www.sil.si.edu/imagegalaxy/
31. Other Content
Videos
All are currently on YouTube
Will remain there for now
Metadata to be imported to Digital Library
Will eventually be served from our network
http://www.youtube.com/smithsonianlibraries
32. Other Content
Collections and Exhibitions
Bibliographies, lists, subject guides
Trade Literature
Sewing machines!
Scientific equipment!
Seed Catalogs!
Smithsonian Publications (DSpace)
Smithsonian Libraries Blog
Art and Artist Vertical Files
W. Atlee Burpee & Co. Burpee's New Annual for 1910, 1910
33. Future Work
More planning!
Developing a LOD Vocabulary for
TL-2
Continued parsing of content in
TL-2
Continuing the development of
the Index Animalium content
Publishing the Index Animalium
on the web as LOD
How to leverage linked data to
create… what?
Leopoldo Galluzzo Altre scoverte fatte nella luna dal Sigr. Herschel , 1836
(2-3 min) Open with an introduction of who SIL is and what we do? (Old Slide 1 and 2)Questions: How many know SI has libraries? How many have visited the libraries? How many want to visit?
To recap from last year, we covered a solid introduction on linked data and how Drupal 7 supports it out of the box via the built-in RDF and RDFx modules.
We talked about what RDFa might look like in a webpage or RDF/XML stream that we are creating.
We discussed this TL-2, taxonomic literature, reference tool for botanists and how we are converting it to Linked Open Data.
And finally for TL-2 we offered some idea of the kind of data that we might be producing in RDF. This is yet another representation of the linked data, this time in N-Tuples format.
Finally, we talked about how Open data, (not linked open data) is benefiting the Biodiversity Heritage Library. If you spend any amount of time around me, you’ll find that I will eventually come around to talking about this.
And some examples of how people have used open data. This person mapped the usage of certain animal names over time and how they fall in or out of favor as time progresses. Those bars are time periods of 200 years of natural history literature.
SLIDE: Overview of Linked Data (concept, statistics)
This is linked data in action. Google knowledge graph. Google acquired Metaweb in 2010 and in that process, they got Freebase, which eventually was used to create this new pane of information on Google.
SLIDE: Details of Linked Data (diagram of triple)
(5 min) Review our discussion from last year. Sharing knowledge is our prime directiveLinked Data is a no-brainer Not going to to review what linked data is (unless we need to?)SLIDE: Overview of our website (statistics, content, etc) (LITA 2011 page 6)Questions: How many know what linked data is? Do we need to review?
SLIDE: Content that could be linked data (LITA 2011 page 9)Quick review of what things have good metadata for likingWe said we would have something up in about one year. (Ha!)Last year I reviewed some of the details of how we are converting to linked data
(Show this again, but only briefly) (Old Slide 20, 21 22)SLIDE: Details of Darwin's linked data fields (LITA 2011 page 22)TAKEAWAY: Know your data (or whatever it is you’re sharing). Become intimately familiar with it. Take it on a date.
List some of the modules we are using (Old Slide 15)SLIDE: List of Drupal Modules (LITA 2011 page 15)Questions: How many of you are using linked data? What data do you have that could be useful if linked? Know that if you raise your hand, I'm going to pick on you throughout the rest of the talk. :)Disclaimer: We are still learning as we go! Even we, the Smithsonian, are figuring things out. We are also constrained by budgets, personnel and other requirements, possibly more as government entity.
SLIDE: We are still learningFirst we had to decide what a Digital Library was. Our instinct is to go online and see what other people are doing. This is fine and all, but I think it's safe to say that we know what data we have, we know what we are doing as we move from an old website to a new. What of it belongs in the digital library? Well... here's what we have.It’s safe to say that we know our data, though we may go to others to see how to present that data. We’ll also use focus groups and usability studies to analyze the site once we have a beta.TAKEAWAY: You’ll always be learning. :) If you stop, you become irrelevant.
SLIDE: What is a digital library? Books? Images? Exhibitions? Databases? Research papers? All of these things?Question: How many of you have a “digital library”. Want one?Question: Is anyone out there working with data that doesn't fall in these? I'm curious as to what else might be out there.
As far as vast amounts of linked data goes, currently there are two that stand out as really good useful datasets:SLIDE: Two data sets: TL2 and Index Animalium, numbers of records, types of dataFor us, the things that make sense to publish as linked data are TL2 (47k records) and Index Animalium (500k records). TL2 is almost there. IA has a long, long way to go. We'll come back to that.The first phase of our process was to get us on drupal. This actually took longer than we'd hoped due to the planning required by our nature as a government institution. We have a certain level of planning and security analysis that must be done. That said, we have a simple brochure-ware website that is online at library.si.edu.CHM licensed their base metadata for their collections as CC0. Talk to the lawyers first.TAKEAWAY: Creative Commons (or CC0) licensing of metadata is becoming popular. We have a CC-BY license for TL2. Index Animalium is public domain. I think. We are libraries and we have a lot to share to the internet. Let’s make it happen so that others don’t.
SLIDE: Content that could be linked data (LITA 2011 page 9)Quick review of what things have good metadata for likingWe said we would have something up in about one year. (Ha!)Last year I reviewed some of the details of how we are converting to linked data
SLIDE: TL2 website as it is today. How do we get it into Drupal? We use a module!Drupal is capable of handling millions of records, but getting those records into Drupal is not the easiest thing in the world. How do we import 430,000 species names for Index Animalium?Question: How many others are using a CMS? Drupal? (what is the name of that MS Technology to compete with Drupal?) PHP? ASP? Java? Others?Question: Is anyone developing in Drupal? Modules? Themes?
Now that we are on drupal, we can move forward with some data! Yeah! Bring on the import!Disclaimer: The actual steps are specific to drupal, but you may find yourself in a similar situation of trial and error.Last time I reviewed how we were going to take this taxonomic literature thing to linked data. We have something almost online, but let's review where we are...We first imported via Feeds Importer (Question: anyone familiar?). Then we had to import again. Oops, the data was wrong again, so we had to import AGAIN. Three weeks later, I gave up. It was too slow and too painful. SLIDE: Feeds importer: 7 hours. 47,000 records in 7 hours? 1.8 rec/sec - Dismal!
So I wrote a module! Yay! Module development! This makes sense. But there was one major challenge: I didn't know how to build modules in Drupal. So I learned. And then I realized that I could import the data as part of the installation of the module. Import times dropped to 81 minutes. (an improvement as I could control what the database was doing and minimize database traffic.)SLIDE: Drupal Module development is hard! Steep learning curve. List APIs that I had to become familiar with: Field. Node. Theme. Styling. Preprocess Functions. Render Elements.And THEN I learned that we could use the versioning of modules to update the data down the road. Either to create new database fields, munge the data, etc. This is a nice feature since we couldn't do that before. (12-15 hour downtime for our TL-2 site would have been a bad thing indeed)TAKEAWAY: Consider your options, the easy way is not always faster/better.
So, we decided to use another module! Home grown! Versioned Data! We needed something to manage the delivery of the books using the IFRAME version of the Internet Archive book-reader. But uploading the data is even better. This time we were able to import in about 5 minutes. This handles the books, authors, vocabularies, subjects (FAST?), places as subject, timeframe as subject. It also handles the links between them. Much of this data came out from the MARCXML record, but sometimes we used MODS (where it was easier)Synchronization issues regarding the book metadata between IA, SIRIS, Picklist and Drupal. FUN!What do you do when your data lives in multiple places. One master many slaves? Multi-master? Mixed bag of drunken cats?SLIDE: And books have linked data, too! We're not sure how we are going to link it, but at least we'll have OCLC number, author name to VIAF, etc.
Before we began really building our site, we needed to firm up our data model and make sure we had a good idea on how everything is going to relate to each other. This is an example of what the British Library created. I think they were very thorough and included a ot of detail. It is probably overkill for what we want to do, but who knows, we may end up in the same place, but maybe not in such an explicit manner.
How do we structure our data? How do we organize it? What vocabularies will we be using? QUESTION: For those who are familiar with LOD, are you using any vocabularies other than these? Anyone making their own?
Talk about Galaxy of Images, Other elements in the digital libraryPlates and other pretty pictures. Show the website for GOI. Search page, etc.Highlight the balloon that was StumbleUpon-ed and boosted our traffic 100-fold. Show a picture of the GA chart of the traffic.The data needs some cleanup. Standardization of the subjects metadata.Images need to be moved into DAMS (Artesia digital asset management system)This is being done in coordination with the manage of the GOI and our metadata team who is As an aside, one of the things we do to get new pretty images is to capture the plates from our metadata collection thingy for the BHL. We divert a stream of data of the "pretty pictures" from there into the Galaxy of Images through a mostly automated process. This will automatically upload (For ongoing projects, stress automation where possible. Take humans out of the equation. As smart as we are, we make mistakes. Code doesn't unless we make mistakes in our code and it frees us to do other things.)
Talk about VideosOver 8 or 10 years of them, we needed to round them all up and get them organized. Lectures, animations, videos, interviews, demos, informational things. 30-40 of them? All are (or should be) on YouTube at this point in time. Ultimatley we will serve them from our DAMSCentered around our content, exhibitions,etc
Collections / Exhibitions Arbitrary Collections of things. Exhibitions, tooCollections: arbitrary grouping of things under a heading (category) with maybe some introductory text.Exhibition: Same thing, but more sequential, telling a story of narrative. Order becomes important than in collections. Possibly more words.Bibliographies, lists of things, subject guidesLegacy content. Not sure if we need to keep it alive. Is it something that people continue to use. We’ll check out our analytics. HOWEVER, as they are tied to the library itself, we’ve already had to migrate them to the new site. Perhaps a bit of wasted effort, but at least it’s easier to manage now.Trade LiteratureDescribe them – Scientific Instruments, sewing machines!How are they catalogued (they are not) Catalogued by Manufacturer, well, inventoried. Nothing is scanned, we would like to scan them, but it poses some of its own challenges in how we organize the content. Each catalog can’t be a record in our, um, catalog, can it? SI PublicationsCollecting the output of the researchers at the smithsonian to gauge their … effectiveness, reach, influence, (Klout?)Currently in Dspace, will likely stay there, but we want to index and search it via the website, see: Summon Discovery LayerBlogThe blog is part of the website, too, but as it lives off in its own world, we don’t really need to concern ourselves with it because it’s not really part of the digital library per se.TAKEAWAY: Each set of content that you have may be different from the others. Creating a digital library is not going to be an easy task.
Todo in the future:Made our own vocabulary for TL2: turns out we only needed two or three terms. The Biography vocabulary had much of what we needed already.Plan the migration of our exhibitions, which will lay the foundation for other online collections.Migrate our image into our DAMS systems, refining the metadata in the process, which will preclude us from having to store all these images on our web server.Figure out a method of handling collections and the arbitrary ordering of things. Is there a module? Should we make one? Should we reuse things that already exist (yes!)List some of the other tools that people might use for LOD. Take from my talk at SLA.Discuss Summon and the giant black box that it isIt’s on the way, it will be the discovery layer for our entire site. All our data needs to get into it. Including our catalog, our licensed content, all website content, blog content. API development, Integration with Drupal is a big mystery. Do I see another module in my future? :) If so, it will be similar to that of the Google Search Appliance module.How to leverage LOD for more stuff. Artists Files, Trade Lit, etc. linking to our catalog, history books, etc.TAKEAWAY: A website is a living, breathing, growing beast. I needs care and feeding and love and attention to keep it going.