SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Implementing a Linked
    Open Data set


          Joel Richard
      Smithsonian Libraries
        richardjm@si.edu



                              SLA Annual Conference, July
Who are the Smithsonian Libraries?
   • 20 Libraries in the U.S. and Panama
   • Supports research of staff and the public
   • Strong effort to digitize pre-1923 texts
   • Taxonomic Literature II is one of these
     texts




Joel Richard,                    SLA Annual Conference, July
Summary of Agenda

   • Our data set and process
   • Conversion to Linked Data
   • Storing Linked Data
   • Examples and More Info
   • Summary
   • … and Best brew pubs in Chicago


Joel Richard,                   SLA Annual Conference, July
Disclaimer




                We are still learning.




Joel Richard,                   SLA Annual Conference, July
What is Linked Data?
   HTTP URIs identify things to Humans and
    computers
   Identifiers are related to other identifiers (or
     values) via predicates in a “triple”:

            Charles Darwin // Creator // On the Origin of Species


   See also :
       http://linkeddata.org/
       http://en.wikipedia.org/wiki/Linked_Data
       http://richard.cyganiak.de/2007/10/lod/

Joel                                                SLA Annual Conference, July
http://richard.cyganiak.de/2007/10/lod/


Joel Richard,                              SLA Annual Conference, July
Taxonmic Literature II

Essential Reference
  Tool for Botanists

Authors and their
 Publications from
 1753 to 1940

It is a “database in book form.”
Joel Richard,   SLA Annual Conference, July
Our process

   Scanned the pages
   Hired contractor for OCR and correction
    (99.97% accuracy)
   Received XML dataset from Contractor

   Verified and Imported to SQL Server
   Built a website to search the data

Joel Richard,                  SLA Annual Conference, July
Joel Richard,   SLA Annual Conference, July
Great! Let’s make some linked data!



   First...what does 99.97% accuracy mean?



                ~12,000 Errors


Joel Richard,                 SLA Annual Conference, July
Great! Let’s make some linked data!

   Select Identifiers for your data
       http://library.si.edu/tl-2/author/darwin
       http://library.si.edu/tl-2/title/origin_of_species
       http://library.si.edu/tl-2/title/1313


   Choose vocabularies for
    predicates(harder than it sounds)

   OWL, FOAF, DublinCore, OpenGraph,
    SIOC, SKOS, BIBO, etc.

Joel                                                        SLA Annual Conference, July
Mondeca Labs
 Linked Open
 Vocabularies (LOV)

 Vocabulary of a Friend
 (VOAF)

 A vocabulary for
 describing other
 vocabularies




 http://labs.mondeca.com/dataset/lov

Joel                                   SLA Annual Conference, July
http://library.si.edu/tl2/author/darwin

                  tl2:creator
                  http://library.si.edu/tl2/title/1313

                  owl:sameAs
                  http://viaf.org/viaf/27063124




                http://library.si.edu/tl2/title/origin…
                  dc:creator
                  http://library.si.edu/tl2/author/darwin

                  owl:sameAs
                  http://www.archive.org/details/
                      originofspecies00darwuoft




Joel Richard,        SLA Annual Conference, July
http://library.si.edu/tl2/author/darwin
       RDF Type = foaf:Person

         foaf:lastName, foaf:familyName

         foaf:firstName, foaf:givenName

         foaf:name, skos:prefLabel

         tl2:birthYear

         tl2:deathYear

         skos:definition

         tl2:personAbbreviation



         http://library.si.edu/tl2/title/origin…
         RDF Type = bibo:Book

         tl2:titleNumber

         dc:title

         event:place

         dc:publisher

         tl2:titleAbbreviation

         dc:created


Joel        SLA Annual Conference, July
Great! Let’s make some linked data!

   How are we going to store all this?
   We’re using Drupal. RDFa is built-in, RDF
    extensions is an add-on module.

   Probably not a good idea for very large
     datasets.

   TL-2: 10,000 authors + 37,000 titles
     becomes about 400,000 triples.


Joel                              SLA Annual Conference, July
Storage considerations

   Performance of Drupal Import:
       Feeds Import: 7 Hours for 35k Records
       Other options? Still searching…

   Our linked data set will grow to at least
    600-700k Drupal nodes.

       Is Drupal the best way to do this?



Joel                                        SLA Annual Conference, July
Storage considerations

   2000 US Census
       19 million households received “long form”

       Joshua Tauberer: converted to 1bln triples
       http://www.rdfabout.com/demo/census/


   Carefully consider your storage options!

Joel                                SLA Annual Conference, July
Storage

   ARC2 used by Drupal 7
   RDBMS via D2RQ

   RDBMS via Triplify
   OpenLink Virtuoso
   See Also:
   http://www.w3.org/2001/sw/rdb2rdf/use-cases/


Joel Richard,                      SLA Annual Conference, July
Linked Data. What’s the point?

   Disambiguation
   Connecting Relevant Information

   More visible via search
   Enrichment of your data

   Easier reuse of data


Joel Richard,                 SLA Annual Conference, July
Joel Richard,   SLA Annual Conference, July
http://en.openei.org/apps/mashathon2010/

Joel                       SLA Annual Conference, July
http://data.nytimes.com/schools/schools.html


Joel                       SLA Annual Conference, July
http://data.nytimes.com/N38444093941437235523



Joel                         SLA Annual Conference, July
http://www.worldcat.org/oclc/7619054
Joel Richard,                SLA Annual Conference, July
Other Examples and Info
   Library of Congress: Linked Data Services
   http://id.loc.gov/

   Schema.org
   http://www.schema.org

   Data.gov / Semantic
   http://www.data.gov/semantic


   Linked Data.org
   http://linkeddata.org/


   Stephen Dale: Linked Data in Action
   http://www.slideshare.net/stephendale/linked-data-in-action-4487244


Joel Richard,                                     SLA Annual Conference, July
Thank you!
                        richardjm@si.edu
                http://slideshare.net/joelrichard




                              ?
Joel Richard,                           SLA Annual Conference, July

Weitere ähnliche Inhalte

Was ist angesagt?

2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
Josef Petrák
 
Shrinking the silo boundary: data and schema in the Semantic Web
Shrinking the silo boundary: data and schema in the Semantic WebShrinking the silo boundary: data and schema in the Semantic Web
Shrinking the silo boundary: data and schema in the Semantic Web
Gordon Dunsire
 
Archives - DACS and EAD
Archives - DACS and EADArchives - DACS and EAD
Archives - DACS and EAD
sotrue
 
EAD, MARC and DACS
EAD, MARC and DACSEAD, MARC and DACS
EAD, MARC and DACS
millermax
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
jendibbern
 
Diane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic WebDiane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic Web
ALATechSource
 

Was ist angesagt? (20)

RDA
RDA RDA
RDA
 
Introduction to RDA
Introduction to RDAIntroduction to RDA
Introduction to RDA
 
18 ° Nexa Lunch Seminar - Lo stato dell'arte dei Linked Open Data italiani
18 ° Nexa Lunch Seminar - Lo stato dell'arte dei Linked Open Data italiani18 ° Nexa Lunch Seminar - Lo stato dell'arte dei Linked Open Data italiani
18 ° Nexa Lunch Seminar - Lo stato dell'arte dei Linked Open Data italiani
 
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
#sod14 - ok, è un endpoint SPARQL non facciamoci prendere dal panico
 
AACR2 to RDA: Using the RDA Toolkit
AACR2 to RDA: Using the RDA ToolkitAACR2 to RDA: Using the RDA Toolkit
AACR2 to RDA: Using the RDA Toolkit
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
RDA and the semantic Web
RDA and the semantic WebRDA and the semantic Web
RDA and the semantic Web
 
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
2011 4IZ440 Semantic Web – RDF, SPARQL, and software APIs
 
RDA for Original Catalogers
RDA for Original CatalogersRDA for Original Catalogers
RDA for Original Catalogers
 
Shrinking the silo boundary: data and schema in the Semantic Web
Shrinking the silo boundary: data and schema in the Semantic WebShrinking the silo boundary: data and schema in the Semantic Web
Shrinking the silo boundary: data and schema in the Semantic Web
 
Semantic web application architecture
Semantic web   application architectureSemantic web   application architecture
Semantic web application architecture
 
Archives - DACS and EAD
Archives - DACS and EADArchives - DACS and EAD
Archives - DACS and EAD
 
EAD, MARC and DACS
EAD, MARC and DACSEAD, MARC and DACS
EAD, MARC and DACS
 
Approaching Authority: A Preliminary Implementation of Encoded Archival Conte...
Approaching Authority: A Preliminary Implementation of Encoded Archival Conte...Approaching Authority: A Preliminary Implementation of Encoded Archival Conte...
Approaching Authority: A Preliminary Implementation of Encoded Archival Conte...
 
Intro to rda
Intro to rdaIntro to rda
Intro to rda
 
Federated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFedFederated Query Formulation and Processing through BioFed
Federated Query Formulation and Processing through BioFed
 
Getting Real With RDA
Getting Real With RDAGetting Real With RDA
Getting Real With RDA
 
Resource description and Access
Resource description and AccessResource description and Access
Resource description and Access
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
 
Diane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic WebDiane Hillmann: RDA Vocabularies in the Semantic Web
Diane Hillmann: RDA Vocabularies in the Semantic Web
 

Andere mochten auch

5 inventario base delle emissioni 27 gennaio iannantuono
5   inventario base delle emissioni 27 gennaio iannantuono5   inventario base delle emissioni 27 gennaio iannantuono
5 inventario base delle emissioni 27 gennaio iannantuono
formalab
 
2 il seap 20 gennaio iannantuono
2   il seap 20 gennaio iannantuono2   il seap 20 gennaio iannantuono
2 il seap 20 gennaio iannantuono
formalab
 

Andere mochten auch (11)

Linked Open Data and Systematic Taxonomy
Linked Open Data and Systematic TaxonomyLinked Open Data and Systematic Taxonomy
Linked Open Data and Systematic Taxonomy
 
Lita national forum 2012
Lita national forum 2012Lita national forum 2012
Lita national forum 2012
 
Session 56 Emma Strömblad
Session 56 Emma StrömbladSession 56 Emma Strömblad
Session 56 Emma Strömblad
 
Joe powerpoint
Joe powerpointJoe powerpoint
Joe powerpoint
 
5 inventario base delle emissioni 27 gennaio iannantuono
5   inventario base delle emissioni 27 gennaio iannantuono5   inventario base delle emissioni 27 gennaio iannantuono
5 inventario base delle emissioni 27 gennaio iannantuono
 
2 il seap 20 gennaio iannantuono
2   il seap 20 gennaio iannantuono2   il seap 20 gennaio iannantuono
2 il seap 20 gennaio iannantuono
 
Picture fix
Picture fixPicture fix
Picture fix
 
Building the New Open Linked Library
Building the New Open Linked LibraryBuilding the New Open Linked Library
Building the New Open Linked Library
 
Unlocking Taxonomic Literature II using Linked Open Data
Unlocking Taxonomic Literature II using Linked Open DataUnlocking Taxonomic Literature II using Linked Open Data
Unlocking Taxonomic Literature II using Linked Open Data
 
3 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 20173 Things Every Sales Team Needs to Be Thinking About in 2017
3 Things Every Sales Team Needs to Be Thinking About in 2017
 
How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 

Ähnlich wie Building a Linked Open Data Set

Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
giurca
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Cory Lampert
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
National Information Standards Organization (NISO)
 

Ähnlich wie Building a Linked Open Data Set (20)

Building the new open linked library: Theory and Practice
Building the new open linked library: Theory and PracticeBuilding the new open linked library: Theory and Practice
Building the new open linked library: Theory and Practice
 
Semantic Pipes and Semantic Mashups
Semantic Pipes and Semantic MashupsSemantic Pipes and Semantic Mashups
Semantic Pipes and Semantic Mashups
 
Linked Data - Exposing what we have
Linked Data - Exposing what we haveLinked Data - Exposing what we have
Linked Data - Exposing what we have
 
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
Linked data demystified:Practical efforts to transform CONTENTDM metadata int...
 
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
SPARQL1.1 Tutorial, given in UChile by Axel Polleres (DERI)
 
Linked data presentation for libraries (COMO)
Linked data presentation for libraries (COMO)Linked data presentation for libraries (COMO)
Linked data presentation for libraries (COMO)
 
RDA (Resource Description & Access)
RDA (Resource Description & Access)RDA (Resource Description & Access)
RDA (Resource Description & Access)
 
Encoding Patron Information in RDF
Encoding Patron Information in RDFEncoding Patron Information in RDF
Encoding Patron Information in RDF
 
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
NISO/DCMI Webinar: Schema.org and Linked Data: Complementary Approaches to Pu...
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Lifting the Lid on Linked Data
Lifting the Lid on Linked DataLifting the Lid on Linked Data
Lifting the Lid on Linked Data
 
Serendipity in Linked Open Data
Serendipity in Linked Open DataSerendipity in Linked Open Data
Serendipity in Linked Open Data
 
Hack U Barcelona 2011
Hack U Barcelona 2011Hack U Barcelona 2011
Hack U Barcelona 2011
 
Quick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & MicroformatsQuick Introduction to the Semantic Web, RDFa & Microformats
Quick Introduction to the Semantic Web, RDFa & Microformats
 
Metadata - Linked Data
Metadata - Linked DataMetadata - Linked Data
Metadata - Linked Data
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
 
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
 
Semantic Web - Linked Data - RDF
Semantic Web - Linked Data - RDFSemantic Web - Linked Data - RDF
Semantic Web - Linked Data - RDF
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
Web of data
Web of dataWeb of data
Web of data
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Building a Linked Open Data Set

  • 1. Implementing a Linked Open Data set Joel Richard Smithsonian Libraries richardjm@si.edu SLA Annual Conference, July
  • 2. Who are the Smithsonian Libraries? • 20 Libraries in the U.S. and Panama • Supports research of staff and the public • Strong effort to digitize pre-1923 texts • Taxonomic Literature II is one of these texts Joel Richard, SLA Annual Conference, July
  • 3. Summary of Agenda • Our data set and process • Conversion to Linked Data • Storing Linked Data • Examples and More Info • Summary • … and Best brew pubs in Chicago Joel Richard, SLA Annual Conference, July
  • 4. Disclaimer We are still learning. Joel Richard, SLA Annual Conference, July
  • 5. What is Linked Data? HTTP URIs identify things to Humans and computers Identifiers are related to other identifiers (or values) via predicates in a “triple”: Charles Darwin // Creator // On the Origin of Species See also : http://linkeddata.org/ http://en.wikipedia.org/wiki/Linked_Data http://richard.cyganiak.de/2007/10/lod/ Joel SLA Annual Conference, July
  • 7. Taxonmic Literature II Essential Reference Tool for Botanists Authors and their Publications from 1753 to 1940 It is a “database in book form.”
  • 8. Joel Richard, SLA Annual Conference, July
  • 9. Our process Scanned the pages Hired contractor for OCR and correction (99.97% accuracy) Received XML dataset from Contractor Verified and Imported to SQL Server Built a website to search the data Joel Richard, SLA Annual Conference, July
  • 10. Joel Richard, SLA Annual Conference, July
  • 11. Great! Let’s make some linked data! First...what does 99.97% accuracy mean? ~12,000 Errors Joel Richard, SLA Annual Conference, July
  • 12. Great! Let’s make some linked data! Select Identifiers for your data http://library.si.edu/tl-2/author/darwin http://library.si.edu/tl-2/title/origin_of_species http://library.si.edu/tl-2/title/1313 Choose vocabularies for predicates(harder than it sounds) OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIBO, etc. Joel SLA Annual Conference, July
  • 13. Mondeca Labs Linked Open Vocabularies (LOV) Vocabulary of a Friend (VOAF) A vocabulary for describing other vocabularies http://labs.mondeca.com/dataset/lov Joel SLA Annual Conference, July
  • 14. http://library.si.edu/tl2/author/darwin tl2:creator http://library.si.edu/tl2/title/1313 owl:sameAs http://viaf.org/viaf/27063124 http://library.si.edu/tl2/title/origin… dc:creator http://library.si.edu/tl2/author/darwin owl:sameAs http://www.archive.org/details/ originofspecies00darwuoft Joel Richard, SLA Annual Conference, July
  • 15. http://library.si.edu/tl2/author/darwin RDF Type = foaf:Person foaf:lastName, foaf:familyName foaf:firstName, foaf:givenName foaf:name, skos:prefLabel tl2:birthYear tl2:deathYear skos:definition tl2:personAbbreviation http://library.si.edu/tl2/title/origin… RDF Type = bibo:Book tl2:titleNumber dc:title event:place dc:publisher tl2:titleAbbreviation dc:created Joel SLA Annual Conference, July
  • 16. Great! Let’s make some linked data! How are we going to store all this? We’re using Drupal. RDFa is built-in, RDF extensions is an add-on module. Probably not a good idea for very large datasets. TL-2: 10,000 authors + 37,000 titles becomes about 400,000 triples. Joel SLA Annual Conference, July
  • 17. Storage considerations Performance of Drupal Import: Feeds Import: 7 Hours for 35k Records Other options? Still searching… Our linked data set will grow to at least 600-700k Drupal nodes. Is Drupal the best way to do this? Joel SLA Annual Conference, July
  • 18. Storage considerations 2000 US Census 19 million households received “long form” Joshua Tauberer: converted to 1bln triples http://www.rdfabout.com/demo/census/ Carefully consider your storage options! Joel SLA Annual Conference, July
  • 19. Storage ARC2 used by Drupal 7 RDBMS via D2RQ RDBMS via Triplify OpenLink Virtuoso See Also: http://www.w3.org/2001/sw/rdb2rdf/use-cases/ Joel Richard, SLA Annual Conference, July
  • 20. Linked Data. What’s the point? Disambiguation Connecting Relevant Information More visible via search Enrichment of your data Easier reuse of data Joel Richard, SLA Annual Conference, July
  • 21. Joel Richard, SLA Annual Conference, July
  • 22. http://en.openei.org/apps/mashathon2010/ Joel SLA Annual Conference, July
  • 26. Other Examples and Info Library of Congress: Linked Data Services http://id.loc.gov/ Schema.org http://www.schema.org Data.gov / Semantic http://www.data.gov/semantic Linked Data.org http://linkeddata.org/ Stephen Dale: Linked Data in Action http://www.slideshare.net/stephendale/linked-data-in-action-4487244 Joel Richard, SLA Annual Conference, July
  • 27. Thank you! richardjm@si.edu http://slideshare.net/joelrichard ? Joel Richard, SLA Annual Conference, July

Hinweis der Redaktion

  1. Originally this presentation was going to center around a discussion of our conversion of TL2 to linked data and what we learned, but I felt that it would be better to use it as an example of things to keep in mind when creating your own data sets.
  2. Situated at the center of the world's largest museum complex, the Smithsonian Libraries forms a vital part of the research, exhibition, and educational enterprise of the Institution. The Libraries unites 20 libraries into one system supported by central collections support services. We maintain publication exchanges with more than 4,000 institutions worldwide that supply Smithsonian scientists and curators with current periodicals, exhibition catalogs, and professional society publications. Through preservation treatments, experts work to save the Smithsonian's 1.5 million printed books and manuscripts for future generations. Our Digital Library creates electronic versions of rare books and other distinctive collections, as well as exhibitions and specialized finding aids. We can be found on the web at http://library.si.edu
  3. A brief summary of what this presentation includes.
  4. I dislike disclaimers, but we’re still new to linked open data and are learning as we go. The idea of LOD has been around for several years now, so we are also playing a bit of catch-up.Our first goals are to get some data online and then start linking our dataout to other sources, and encourage others to link to us. We don’t yet know how our data relates to others. It’s not scientific datacreated as part of a research project per se, but initially we see it as valuable, useful information at least for some segements of the research world.
  5. Since this presentation doesn’t center around the idea of what linked data is, we’re not going to spend any time on it. But just in case…Question: How many are familiar with linked data? Have linked data online? Wish they had linked data? Wish you had a website?This page is a quick summary for those who don’t know what linked data is. RTFM (Read The Friendly Manual!)
  6. This is a quick demonstration of how linked data has grown over the past five years. Back in 2007 we had only a handful of data sets, at least according to Richard Cyganiak’s searching. Between 2009 and 2010 the number of items doubles. As of Sept 2011 there are 295 data sets listed. There are probably more today and more being added every day.It is likely that not all data sets are represented here, so this is only a sample of what’s available.What’s the point? This is all data that has the potential to enhance YOUR data. This is all linked data. This is all open data.
  7. So as an example of how to create a data set, I’ll use Taxonomic Literature II. It is a fifteen volumes guide to the literature of systemic botany published between 1753 and 1940. It contains almost 10,000 authors and about 37,000 publications.The reason to focus on TL2 is that we aim to be the authority on the web for this information. We have received permission from the IAPT (Intl Assoc for Plant Taxonomy) to digitze and release this information on the web under an open license. TL-2 is used by most? botanists and their work is made easier by this data being online. Prior to 2012 this information was either located in a library or locked behind a paywall of sorts.
  8. This is a page of TL-2 showing Charles Darwin and On the Origin of Species with those items that are immediately visible that can be parsed and turned into Linked Data.There is other data in the page that could be turned into linked data, but at this time, we have only parsed the data that is highlighted on this page.Clearly, moving from something such as a printed book to a Linked Open Data set is an arduous task. If you are working on creating your own data sets, your experiences will differ depending on the source(s) of your data.One important things to note here are the “Darwin” in parentheses, which is a unique abbreviation for an author. Each author has one. Another important item is the “1313” identifying the title, On the Origin of Species. Each publication in TL-2 has its own number. There are about 9,900 authors and 37,000 titles in all.
  9. Briefly this was out process to create the data. In Jan 2011, we scanned the books and placed them online at the Internet Archive. Later, after selecting a contractor, we sent the scans and the OCR text (created at the Internet Archive) to a contractor who ultimately created a 99.97% accurate text version of TL-2. They then parsed that data to a limited degree and delivered to us an XML dataset that we then imported to a SQL Server database.Finally, we created a searchable, browseable website to access the TL-2 data, opening it up to researchers around the world. Two of them use it on a regular basis. (rimshot!) In reality in a month, we get about 500 people visiting and 6000 pageviews, with about 60% of those coming from outside of the U.S.
  10. This is the current website that we have that shows a sample of the search results for Charles Darwin. This is not Linked Data.You can find this page at: http://www.sil.si.edu/digitalcollections/tl-2/
  11. Earlier we mentioned 99.97% accuracy. This means that if we assume 38 million characters in all of TL-2 that there are upwards of 12,000 errors in our text. (In reality this is more like 5,000-6,000 due to the nature of our data)This may not be bad for the textual components of the content, but when it comes to parsing citations or more structured information, this will prove to be a challenge. Other data sets may not have this problem, but as we are scanning and converting to text, this something that will always be present for us.
  12. So how do we create linked data? Basically this is the approach we are using. There’s probably more that needs to be done, but today, this is what we know we need to do.The choice of identifier is important because if possible, it should be human friendly, but numbers are also common in places such as OCLC WorldCat. Additionally, the TL-2 Number is a strong component, so we will very likely go with that as our primary identifier of publications.
  13. Mondeca, an indormation management company based in Paris, as part of their “labs”, created a directory of linked open vocabularies and grouped them together by similar disciplines. Starting from largest to smallest, they are General and Meta, Library, City, Web, Space-Time, Science, Market (and finance) and Media. Library is the second largest on this list, which may be a matter of how the visualization is created, but may also be that libraries are playing a big part in the LOD movement.This might be helpful in helping figure out which vocabularies might be useful to you.
  14. A sample of our TL-2 Identifiers and four triples. Note that “tl2:creator” is not the same as “dc:creator” and indicates that we will likely need to create our own ontology for describing the TL-2 dataset.(dc:creator is a reference from a title to an author. We also need the reverse, author to title)Also note that we’ve crosslinked our two idenifiers, and as an example, linked out to other information on the web. The link to the Internet Archive may not be appropriate as it is not a LOD data set, but there is likely a predicate available to “read more” or “see also” for non-LOD websites that are related to the identifier.
  15. A further breakdown of our data into linked data showing the predicates we might use for each. Again, the items in orange are specific to TL2 and may not exist in other LOD data sets. For example, the FOAF vocabulary has date of birth, but can we use only a year in that field? Will that foul up other computers? FOAF also doesn’t include date of death, which we definitely have. What predicate do we use? Do we create our own ontology and publish it? (probably)Finally, we haven’t yet begun a formal analysis of which existing ontologies might fit our needs.
  16. Storage is a consideration. We’re not using a triplestore per se, but are instead relying on Drupal and ARC2 to handle the magic for us. This may or may not be a good solution for the long term.The next four slides are all text. You’ve been warned.
  17. Performance is also a concern. It’s been challenging enough to get 47,000 records imported into Drupal. When we start to talk about an additional 500K items, then we have some serious concerns about how well Drupal will hold up, just on the import side of things. We may need to invesigate other methods of getting this data into Drupal, or other systems altogether, but that may create added complexity.
  18. Another example to be clear about how much data you are creating and how to manage it. The US Census sent the “long form” to a subset of 19 million households. These responses were converted to LOD by Joshua Tauberer and resulted in over a billion triples. I’m going to think very carefully before I start working with a billion of anything.
  19. A few notes on software that can be used to open up your existing data to linked data. I have not had the opportunity to use any of this data yet, but we may still use it in the future.ARC2 – Provides parsers, content negotiation, RDF storage, SPARQL endpointD2RQ – Allows accessing relational databases as virtual RDF graphsTriplify – Plugin for Web applications to expose your data as RDF, Linked Data or JSON.Virtuoso – Enterprise level product for normalizing all of your data sources, includes providing that data as RDF
  20. Why should we create linked data? Disambigutation – Are you searching for Venus the planet, Venus the sculpture, Venus the painting or Venus the tennis player. Connecting Relevant Info – Linking your data to other data may reveal things that are related to your data that you were unware of. Search Visibility – Search engines, via schema.org and Google’s purchase of Freebase is enhanching search. Things will only get better as we move forward.Enrichment of your Data – Mentioned earlier, you may learn things you didn’t know about your data or provide greater context to your data via LOD.Easier Reuse – This is one of the central tenets of LOD. I, as a human, no longer need to say that your Column B in your spreadsheet corresponds to the first_name field in my database.
  21. Example of LOD in action. Google’s knowledge graph knows that Darwin is a person and that Shrewsbury is a place, allowing it to offer different, more specialized results in your search. As LOD becomes available your data may be used to enhance these results. Google is also able to help disambigutate common terms, such as “Lafayette” (college, various U.S.cities, or Marquis de)http://google.com/
  22. Example of LOD in action. Combines data from the Energy Information Administration (EIA) on Data.gov with data from OpenEI.org, the U.S. Census and SmartGrid in a mashup that’s easier to create with LOD. http://en.openei.org/apps/mashathon2010/
  23. Example of LOD in action. NYTimes is offering a large dataset as LOD. As an example, they provided a tool to enter a university or college and find those people from their database who attended that institution. From there, we are able to see links to other databases and articles from NYTimes that refer to that person. All linked together.From the site: “As of 13 January 2010, The New York Times has published approximately 10,000 subject headings as linked open data under a CC BY license. We provide both RDF documents and a human-friendly HTML versions. The table below gives a breakdown of the various tag types and mapping strategies on data.nytimes.com.”http://data.nytimes.com/
  24. This is an example of the “Raw data” available at NYTImes, presented in auser readable form. I could also make the argument that the identifier at NYTimes is not as good as it should be. A human readable version would be better, but we see that is one of the owl:sameAs links.
  25. At OCLC Worldcat, they have begun publishing the data about an individual item in Linked Open Data using schema.org. This is an example from Darwin’s Origin of Species. You’ll find the “Linked Data” section at the bottom of the page for the details of any individual book on WorldCat.http://www.worldcat.org/oclc/7619054
  26. Finally, a few other examples of places where you can learn more about linked data, examples of other tools built with and for linked open data. The Library of Congress has made available their subject headings in linked data form to both humans and machines. Schema.org encourages the use of your metadata as a variant of linked data in your webpages. The US Government’s source for open data. Other countries are also making their data open on similar websites. There are many, many more sources, so search the web and see what you can find.
  27. Thank you!As for brew pubs, I don’t live in Chicago and this is only my second time here, so I’m open to suggestions. There are a lot of bars in this town (as seen in the map)