The document summarizes Joel Richard's presentation on implementing a Linked Open Data set for the Taxonomic Literature II dataset. The presentation covers converting the dataset to Linked Data by assigning HTTP URIs to identifiers, choosing appropriate vocabularies, and generating RDF triples. It also discusses challenges with storing the large Linked Data set, potentially in a database rather than Drupal, and provides examples of other Linked Data projects and resources.
1. Implementing a Linked
Open Data set
Joel Richard
Smithsonian Libraries
richardjm@si.edu
SLA Annual Conference, July
2. Who are the Smithsonian Libraries?
• 20 Libraries in the U.S. and Panama
• Supports research of staff and the public
• Strong effort to digitize pre-1923 texts
• Taxonomic Literature II is one of these
texts
Joel Richard, SLA Annual Conference, July
3. Summary of Agenda
• Our data set and process
• Conversion to Linked Data
• Storing Linked Data
• Examples and More Info
• Summary
• … and Best brew pubs in Chicago
Joel Richard, SLA Annual Conference, July
4. Disclaimer
We are still learning.
Joel Richard, SLA Annual Conference, July
5. What is Linked Data?
HTTP URIs identify things to Humans and
computers
Identifiers are related to other identifiers (or
values) via predicates in a “triple”:
Charles Darwin // Creator // On the Origin of Species
See also :
http://linkeddata.org/
http://en.wikipedia.org/wiki/Linked_Data
http://richard.cyganiak.de/2007/10/lod/
Joel SLA Annual Conference, July
9. Our process
Scanned the pages
Hired contractor for OCR and correction
(99.97% accuracy)
Received XML dataset from Contractor
Verified and Imported to SQL Server
Built a website to search the data
Joel Richard, SLA Annual Conference, July
11. Great! Let’s make some linked data!
First...what does 99.97% accuracy mean?
~12,000 Errors
Joel Richard, SLA Annual Conference, July
12. Great! Let’s make some linked data!
Select Identifiers for your data
http://library.si.edu/tl-2/author/darwin
http://library.si.edu/tl-2/title/origin_of_species
http://library.si.edu/tl-2/title/1313
Choose vocabularies for
predicates(harder than it sounds)
OWL, FOAF, DublinCore, OpenGraph,
SIOC, SKOS, BIBO, etc.
Joel SLA Annual Conference, July
13. Mondeca Labs
Linked Open
Vocabularies (LOV)
Vocabulary of a Friend
(VOAF)
A vocabulary for
describing other
vocabularies
http://labs.mondeca.com/dataset/lov
Joel SLA Annual Conference, July
14. http://library.si.edu/tl2/author/darwin
tl2:creator
http://library.si.edu/tl2/title/1313
owl:sameAs
http://viaf.org/viaf/27063124
http://library.si.edu/tl2/title/origin…
dc:creator
http://library.si.edu/tl2/author/darwin
owl:sameAs
http://www.archive.org/details/
originofspecies00darwuoft
Joel Richard, SLA Annual Conference, July
15. http://library.si.edu/tl2/author/darwin
RDF Type = foaf:Person
foaf:lastName, foaf:familyName
foaf:firstName, foaf:givenName
foaf:name, skos:prefLabel
tl2:birthYear
tl2:deathYear
skos:definition
tl2:personAbbreviation
http://library.si.edu/tl2/title/origin…
RDF Type = bibo:Book
tl2:titleNumber
dc:title
event:place
dc:publisher
tl2:titleAbbreviation
dc:created
Joel SLA Annual Conference, July
16. Great! Let’s make some linked data!
How are we going to store all this?
We’re using Drupal. RDFa is built-in, RDF
extensions is an add-on module.
Probably not a good idea for very large
datasets.
TL-2: 10,000 authors + 37,000 titles
becomes about 400,000 triples.
Joel SLA Annual Conference, July
17. Storage considerations
Performance of Drupal Import:
Feeds Import: 7 Hours for 35k Records
Other options? Still searching…
Our linked data set will grow to at least
600-700k Drupal nodes.
Is Drupal the best way to do this?
Joel SLA Annual Conference, July
18. Storage considerations
2000 US Census
19 million households received “long form”
Joshua Tauberer: converted to 1bln triples
http://www.rdfabout.com/demo/census/
Carefully consider your storage options!
Joel SLA Annual Conference, July
19. Storage
ARC2 used by Drupal 7
RDBMS via D2RQ
RDBMS via Triplify
OpenLink Virtuoso
See Also:
http://www.w3.org/2001/sw/rdb2rdf/use-cases/
Joel Richard, SLA Annual Conference, July
20. Linked Data. What’s the point?
Disambiguation
Connecting Relevant Information
More visible via search
Enrichment of your data
Easier reuse of data
Joel Richard, SLA Annual Conference, July
26. Other Examples and Info
Library of Congress: Linked Data Services
http://id.loc.gov/
Schema.org
http://www.schema.org
Data.gov / Semantic
http://www.data.gov/semantic
Linked Data.org
http://linkeddata.org/
Stephen Dale: Linked Data in Action
http://www.slideshare.net/stephendale/linked-data-in-action-4487244
Joel Richard, SLA Annual Conference, July
27. Thank you!
richardjm@si.edu
http://slideshare.net/joelrichard
?
Joel Richard, SLA Annual Conference, July
Hinweis der Redaktion
Originally this presentation was going to center around a discussion of our conversion of TL2 to linked data and what we learned, but I felt that it would be better to use it as an example of things to keep in mind when creating your own data sets.
Situated at the center of the world's largest museum complex, the Smithsonian Libraries forms a vital part of the research, exhibition, and educational enterprise of the Institution. The Libraries unites 20 libraries into one system supported by central collections support services. We maintain publication exchanges with more than 4,000 institutions worldwide that supply Smithsonian scientists and curators with current periodicals, exhibition catalogs, and professional society publications. Through preservation treatments, experts work to save the Smithsonian's 1.5 million printed books and manuscripts for future generations. Our Digital Library creates electronic versions of rare books and other distinctive collections, as well as exhibitions and specialized finding aids. We can be found on the web at http://library.si.edu
A brief summary of what this presentation includes.
I dislike disclaimers, but we’re still new to linked open data and are learning as we go. The idea of LOD has been around for several years now, so we are also playing a bit of catch-up.Our first goals are to get some data online and then start linking our dataout to other sources, and encourage others to link to us. We don’t yet know how our data relates to others. It’s not scientific datacreated as part of a research project per se, but initially we see it as valuable, useful information at least for some segements of the research world.
Since this presentation doesn’t center around the idea of what linked data is, we’re not going to spend any time on it. But just in case…Question: How many are familiar with linked data? Have linked data online? Wish they had linked data? Wish you had a website?This page is a quick summary for those who don’t know what linked data is. RTFM (Read The Friendly Manual!)
This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are probably more today and more being added every day.It is likely that not all data sets are represented here, so this is only a sample of what’s available.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
So as an example of how to create a data set, I’ll use Taxonomic Literature II. It is a fifteen volumes guide to the literature of systemic botany published between 1753 and 1940. It contains almost 10,000 authors and about 37,000 publications.The reason to focus on TL2 is that we aim to be the authority on the web for this information. We have received permission from the IAPT (Intl Assoc for Plant Taxonomy) to digitze and release this information on the web under an open license. TL-2 is used by most? botanists and their work is made easier by this data being online. Prior to 2012 this information was either located in a library or locked behind a paywall of sorts.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data.You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
So how do we create linked data? Basically this is the approach we are using. There’s probably more that needs to be done, but today, this is what we know we need to do.The choice of identifier is important because if possible, it should be human friendly, but numbers are also common in places such as OCLC WorldCat. Additionally, the TL-2 Number is a strong component, so we will very likely go with that as our primary identifier of publications.
Mondeca, an indormation management company based in Paris, as part of their “labs”, created a directory of linked open vocabularies and grouped them together by similar disciplines. Starting from largest to smallest, they are General and Meta, Library, City, Web, Space-Time, Science, Market (and finance) and Media. Library is the second largest on this list, which may be a matter of how the visualization is created, but may also be that libraries are playing a big part in the LOD movement.This might be helpful in helping figure out which vocabularies might be useful to you.
A sample of our TL-2 Identifiers and four triples. Note that “tl2:creator” is not the same as “dc:creator” and indicates that we will likely need to create our own ontology for describing the TL-2 dataset.(dc:creator is a reference from a title to an author. We also need the reverse, author to title)Also note that we’ve crosslinked our two idenifiers, and as an example, linked out to other information on the web. The link to the Internet Archive may not be appropriate as it is not a LOD data set, but there is likely a predicate available to “read more” or “see also” for non-LOD websites that are related to the identifier.
A further breakdown of our data into linked data showing the predicates we might use for each. Again, the items in orange are specific to TL2 and may not exist in other LOD data sets. For example, the FOAF vocabulary has date of birth, but can we use only a year in that field? Will that foul up other computers? FOAF also doesn’t include date of death, which we definitely have. What predicate do we use? Do we create our own ontology and publish it? (probably)Finally, we haven’t yet begun a formal analysis of which existing ontologies might fit our needs.
Storage is a consideration. We’re not using a triplestore per se, but are instead relying on Drupal and ARC2 to handle the magic for us. This may or may not be a good solution for the long term.The next four slides are all text. You’ve been warned.
Performance is also a concern. It’s been challenging enough to get 47,000 records imported into Drupal. When we start to talk about an additional 500K items, then we have some serious concerns about how well Drupal will hold up, just on the import side of things. We may need to invesigate other methods of getting this data into Drupal, or other systems altogether, but that may create added complexity.
Another example to be clear about how much data you are creating and how to manage it. The US Census sent the “long form” to a subset of 19 million households. These responses were converted to LOD by Joshua Tauberer and resulted in over a billion triples. I’m going to think very carefully before I start working with a billion of anything.
A few notes on software that can be used to open up your existing data to linked data. I have not had the opportunity to use any of this data yet, but we may still use it in the future.ARC2 – Provides parsers, content negotiation, RDF storage, SPARQL endpointD2RQ – Allows accessing relational databases as virtual RDF graphsTriplify – Plugin for Web applications to expose your data as RDF, Linked Data or JSON.Virtuoso – Enterprise level product for normalizing all of your data sources, includes providing that data as RDF
Why should we create linked data? Disambigutation – Are you searching for Venus the planet, Venus the sculpture, Venus the painting or Venus the tennis player. Connecting Relevant Info – Linking your data to other data may reveal things that are related to your data that you were unware of. Search Visibility – Search engines, via schema.org and Google’s purchase of Freebase is enhanching search. Things will only get better as we move forward.Enrichment of your Data – Mentioned earlier, you may learn things you didn’t know about your data or provide greater context to your data via LOD.Easier Reuse – This is one of the central tenets of LOD. I, as a human, no longer need to say that your Column B in your spreadsheet corresponds to the first_name field in my database.
Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
Example of LOD in action. Combines data from the Energy Information Administration (EIA) on Data.gov with data from OpenEI.org, the U.S. Census and SmartGrid in a mashup that’s easier to create with LOD. http://en.openei.org/apps/mashathon2010/
Example of LOD in action. NYTimes is offering a large dataset as LOD. As an example, they provided a tool to enter a university or college and find those people from their database who attended that institution. From there, we are able to see links to other databases and articles from NYTimes that refer to that person. All linked together.From the site: “As of 13 January 2010, The New York Times has published approximately 10,000 subject headings as linked open data under a CC BY license. We provide both RDF documents and a human-friendly HTML versions. The table below gives a breakdown of the various tag types and mapping strategies on data.nytimes.com.”http://data.nytimes.com/
This is an example of the “Raw data” available at NYTImes, presented in auser readable form. I could also make the argument that the identifier at NYTimes is not as good as it should be. A human readable version would be better, but we see that is one of the owl:sameAs links.
At OCLC Worldcat, they have begun publishing the data about an individual item in Linked Open Data using schema.org. This is an example from Darwin’s Origin of Species. You’ll find the “Linked Data” section at the bottom of the page for the details of any individual book on WorldCat.http://www.worldcat.org/oclc/7619054
Finally, a few other examples of places where you can learn more about linked data, examples of other tools built with and for linked open data. The Library of Congress has made available their subject headings in linked data form to both humans and machines. Schema.org encourages the use of your metadata as a variant of linked data in your webpages. The US Government’s source for open data. Other countries are also making their data open on similar websites. There are many, many more sources, so search the web and see what you can find.
Thank you!As for brew pubs, I don’t live in Chicago and this is only my second time here, so I’m open to suggestions. There are a lot of bars in this town (as seen in the map)