A short talk in which I briefly discuss the Smithsonian Libraries' plans for Linked Open Data related to our Taxonomic Literature II and Index Animalium digitization projects.
Why Teams call analytics are critical to your entire business
Linked Open Data and Systematic Taxonomy
1. Linked Open Data and
Systemic Taxonomy
Joel Richard
Smithsonian Libraries
richardjm@si.edu
A tale of two publications
In three acts
2. Who are the Smithsonian Libraries?
• 20 Libraries in the U.S. and Panama
• Supports research of staff and the public
• Strong effort to digitize pre-1923 texts
• Index Animalium and Taxonomic
Literature II are two examples
Joel Richard,
4. Joel Richard,
Act I: The Players
(or, identifying the data with which
we are working and their meaning
and usefulness to the scientific
community.)
5. Taxonomic Literature II
Essential Reference
Tool for Botanists
Botanists/Authors
and Publications
from 1753–1940
Multiple indexes, “unique identifiers”
It is a “database in book form”
Joel Richard,
8. Joel Richard,
Index Animalium
Genus name, author
& citation for
430,000 animals
Covers Publications
from 1758–1850
Also a database, but
many challenges
still exist in the data.
10. Joel Richard,
Act II: The Linking
(or, identifying those data elements to
be linked, inherent challenges of
parsing OCR text, and identifying
linkable remote data sources)
12. Joel Richard,
foaf:lastName, foaf:familyName
foaf:firstName, foaf:givenName
foaf:name, skos:prefLabel
bio:birth
bio:death
skos:definition
tl2:personAbbreviation
tl2:titleNumber
dc:title
event:place
dc:publisher
dc:created
tl2:titleAbbreviation
http://library.si.edu/tl2/author/darwin
RDF Type = foaf:Person
http://library.si.edu/tl2/title/origin…
RDF Type = bibo:Book
13. Joel Richard,
Challenges with Our Data
• Errors in the Corrected OCR
• Challenges in Parsing Citations
• The 80/20 rule: manually making
connections unable to be made by
automated means
• Finding suitable sources of data to
link to. (DBPedia? VIAF? EOL? Others?)
14. Joel Richard,
Linked Data Sources
Low-Hanging Fruit:
• DBPedia
• OCLC WorldCat
• Biodiversity Heritage Library
• Virtual International Authority File
• Encyclopedia of Life
• Library of Congress Subject Headings
• GeoNames
• Open Library
15. Joel Richard,
Act III: The Sum of the Parts
(or, our goals and desires for this
data, what it means to the linked
data world and the scientific
community in general)
16. Joel Richard,
What’s the point?
• This data may already exist online.
• It may also not always be as accurate
as needed for science.
• We are in a position to be the
authoritative source for this
information.
• Linked Data allows it to be easily
reused and shared.
Originally this presentation was going to center around a discussion of our conversion of TL2 to linked data and what we learned, but I felt that it would be better to use it as an example of things to keep in mind when creating your own data sets.
Situated at the center of the world's largest museum complex, the Smithsonian Libraries forms a vital part of the research, exhibition, and educational enterprise of the Institution. The Libraries unites 20 libraries into one system supported by central collections support services. We maintain publication exchanges with more than 4,000 institutions worldwide that supply Smithsonian scientists and curators with current periodicals, exhibition catalogs, and professional society publications. Through preservation treatments, experts work to save the Smithsonian's 1.5 million printed books and manuscripts for future generations. Our Digital Library creates electronic versions of rare books and other distinctive collections, as well as exhibitions and specialized finding aids. We can be found on the web at http://library.si.edu
I dislike disclaimers, but we’re still new to linked open data and are learning as we go. The idea of LOD has been around for several years now, so we are also playing a bit of catch-up.Our first goals are to get some data online and then start linking our dataout to other sources, and encourage others to link to us. We don’t yet know how our data relates to others. It’s not scientific datacreated as part of a research project per se, but initially we see it as valuable, useful information at least for some segements of the research world.
So as an example of how to create a data set, I’ll use Taxonomic Literature II. It is a fifteen volumes guide to the literature of systemic botany published between 1753 and 1940. It contains almost 10,000 authors and about 37,000 publications.The reason to focus on TL2 is that we aim to be the authority on the web for this information. We have received permission from the IAPT (Intl Assoc for Plant Taxonomy) to digitze and release this information on the web under an open license. TL-2 is used by most? botanists and their work is made easier by this data being online. Prior to 2012 this information was either located in a library or locked behind a paywall of sorts.
This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data.You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
Index Animalium, published in the late 1800s and early 1900s, contains 430,000 species names for 7000 scientific volumes published between 1758 and 1840. Charles Davies Sherborn dedicated much of his life to this work. The volumes consist of the index to species with one species + citation per line and a bibliography listing the titles that Sherborn read. Challenges in the data include inconsistent citation formats, two kinds of abbreviations, both in the index and in the bibliography, as well as errors introduced during the printing process.
This is one example of a page from Index Animalium for Papilio (Danaus) plexippus, AKA the Monarch Butterfly. The abbreviations:Linnaeus: Carl LinnaeusSyst. Nat.: SystemaNaturaeEd 10: 10th edition1758: Publication Year471: Page 471Also 12th Edition, published in 1767, page 767.
Identified here are the “easy” to identify data elements that can be brought to linked data. We still need to contend with the challenges associated with the parsing of these into actual citations. The TL-2 data at the top has already been parsed and loaded into a database. Index Animalium is posing a greater challenge and will take longer to complete.
A further breakdown of our data for TL-2 into linked data showing the predicates we might use for each. Again, the items in orange are specific to TL2 and may not exist in other LOD data sets. For example, the FOAF vocabulary has date of birth, but can we use only a year in that field? Will that foul up other computers? FOAF also doesn’t include date of death, which we definitely have. What predicate do we use? Do we create our own ontology and publish it? (probably)Finally, we haven’t yet begun a formal analysis of which existing ontologies might fit our needs.
80/20 Rule: You spend 20% of your time on 80% of the work and 80% of your time on the 20% of the work. We are at that point with Index Animalium. We would like to do further parsing of data with TL-2 but it will pose similar challenges to that of Index Animalium.
Some potential sources of data that we can link to. We’d like to one day have some of these link back to us, thereby competing the circuit for a linked data web of knowledge.
This is what we would like to do:A researcher enters a botanist name or a species name and is taken directly to the page in the book referenced by that entry. If the book is not known to be digitized and online, then we can redirect them to OCLC worldcat to find a copy of that book in their local library.This is a great improvement for those who wouldn’t normally have access to these books in their local library.