1. How to Become a First Class Citizen of the Web Linked Data and the LOCAH project Jane Stevenson & Adrian Stevenson
2. Remit This session will give a brief overview of the concepts behind Linked Data and will explain how we are applying these ideas to archival and bibliographic data. Archives Hub: merged catalogue of archival descriptions from 200 institutions across the UK Copac: merged catalogue of bibliographic records from libraries across the UK
4. The goal of Linked Data is to enable people to share structured data on the Web as easily as they can share documents today. [The creation of] a space where people and organizations can post and consume data about anything. Bizer/Cyganiak/Heath Linked Data Tuturial, linkeddata.org
5. In essence, it marks a shift in thinking from publishing data in human readable HTML documents to machine readable documents. That means that machines can do a little more of the thinking work for us. http://www.linkeddatatools.com/semantic-web-basics
6. Linked Data encourages open data, open licences and reuse. âŠbut Linked Data does not have to be open.
7. Core questions Is it achievable? Will it bring substantial benefits? âIt is the unexpected re-use of information which is the value added by the webâ
8. What is Linked Data? 4 ârulesâ of for the web of data: Use URIs as names for things Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) Include links to other URIs. so that they can discover more things. http://www.w3.org/DesignIssues/LinkedData.html
9. Giving Things identifiers We can make statements about things and establish relationships by assigning identifiers to them. Jane Stevenson = http://archiveshub.ac.uk/janefoaf.rdf Manchester = http://dbpedia.org/resource/manchester English = http://lexvo.org/id/iso639-3/eng
10. URIs Uniform Resource Identifiers (URIs) are identifiers for entities (people, places, subjects, records, institutions). They identify resources, and ideally allow you to access representations of those resources. Think not of locations, but of identifiers! For Linked Data you use HTTP URIs Jane Stevenson = http://archiveshub.ac.uk/janefoaf.rdf Manchester = http://dbpedia.org/resource/manchester English = http://lexvo.org/id/iso639-3/eng
14. An RDF Graph Title has Archival Resource Repository heldAt describedBy encodedAs Finding Aid EAD document
15. So...? If something is identified, it can be linked to We can then take items from one dataset and link them to items from other datasets BBC Copac VIAF DBPedia GeoNames Archives Hub
16. The Linking benefits of Linked Data BBC:Cranford Copac:Cranford VIAF:Dickens DBPedia: Gaskell Hub:Gaskell Geonames:Manchester DBPedia: Dickens Hub:Dickens
17. The Web of âDocumentsâ Global information space (for humans) Document paradigm Hyperlinks Search engines index and infering relevance Implicit relationships between documents Lack of semantics
18. The Web of Linked Data Global data space (for humans and machines) Making connections between entities across domains (people, books, films, music, genes, medicines, health, statistics...) LD is not about searching for specific documents or visiting particular websites, it is about things - identifying and connecting them. Closely aligned to the general architecture of the Web
19. From one thingâŠto the same thing <sameAs> http://dbpedia.org/resource/manchester http://sws.geonames.org/2643123 http://data.archiveshub.ac.uk/id/concept/ncarules/manchester Are they the same?
21. Vocabularies & Ontologies Vocabulary: set of terms Ontology: organisation of terms â hierarchy, relationships
22. Shared vocabularies Problems of data integration: information exchange across independently designed systems Two different databases: one for films one for actors To collaborate using their current databases, the owners of either site would have to decide on a common data format by which to share information that they could both understand by using a common film and actor unique ID scheme of their own invention.
23. Need âfilm titleâ; âactor nameâ; âactor birthdateâ, etc. to mean the same thing to each Use the same vocabulary Query both databases. No need for transformations, mappings, contracts
24. Vocabularies in Linked Data Common vocabulary to describe the data, e.g. âfilm-titleâ means the same thing Adopt the same ontologies for expressing meaning Use semantics to link data Want to avoid transformation, mapping, contracts between data providers
25. Shared use of vocabularies DC DC Copac Hub Hub RDF Copac RDF foaf bibo foaf skos skos dcterms:title dcterms:identifier
26. Ontologies Many widely used ontologies Use others as far as possible Use your own where necessary Dublin Core Friend of a Friend (FOAF) Simple Knowledge Organisation System (SKOS) Bibo Open Cyc
27. Linked Data on the Hub & Copac Linked Open Copac and Archives Hub: Locah JISC funded project August 2010 â July 2011 Mimas UKOLN Eduserv
28. What is LOCAH doing? Part 1: Exposing the Linked Data Part 2: Creating a prototype visualisation Part 3: Reporting on opportunities and barriers
29. How are we exposing the Data? Model our âthingsâ into RDF Transform the existing data into RDF/XML Enhance the data Load the RDF/XML into a triple store Create Linked Data Views Document the process, opportunities and barriers on LOCAH Blog
30. 1. Modelling âthingsâ into RDF Hub data in âEncoded Archival Descriptionâ EAD XML form Copac data in âMetadata Object Description Schemaâ MODS XML form Take a step back from the data format Think about your âthingsâ What is EAD document âsayingâ about âthings in the worldâ? What questions do we want to answer about those âthingsâ? http://www.loc.gov/ead/ http://www.loc.gov/standards/mods/
31. 1. Modelling âthingsâ into RDF Need to decide on patterns for URIs we generate Following guidance from W3C âCool URIs for the Semantic Webâ and UK Cabinet Office âDesigning URI Sets for the UK Public Sectorâ http://data.archiveshub.ac.uk/id/findingaid/gb1086skinner âthingâ URI ⊠is HTTP 303 âSee Otherâ redirected to ⊠http://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner document URI ⊠which is then content negotiated to ⊠http://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.htmlhttp://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.rdf http://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.turtlehttp://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.json http://www.w3.org/TR/cooluris/http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector
32. 1. Modelling âthingsâ into RDF Using existing RDF vocabularies: DC, SKOS, FOAF, BIBO, WGS84 Geo, Lexvo, ORE, LODE, Event and Time Ontologies Define additional RDF terms where required, hub:ArchivalResource copac:BibiographicResource hub:maintenanceAgency copac:Creator It can be hard to know where to look for vocabs and ontologies Decide on licence â CC BY-NC 2.0, CC0, ODC PDD
33. Archives Hub Model (as at 14/2/2011) in Finding Aid Place PostcodeUnit Repository(Agent) administeredBy/administers maintainedBy/maintains encodedAs/encodes hasPart/partOf EAD Document accessProvidedBy/providesAccessTo Level Biographical History topic/page hasBiogHist/isBiogHistFor Language level ArchivalResource language at time topic/page origination hasPart/partOf TemporalEntity Creation product of associatedWith extent inScheme Extent Concept ConceptScheme Agent representedBy Object foaf:focus Is-a Is-a associatedWith Family Person Organisation Place Book participates in Genre Function Birth Death TemporalEntity at time
35. Feedback Requested! We would like feedback on the model Appreciate this will be easier when the data available Via blog http://blogs.ukoln.ac.uk/locah/2010/09/28/model-a-first-cut/ http://blogs.ukoln.ac.uk/locah/2010/11/08/some-more-things-some-extensions-to-the-hub-model/ http://blogs.ukoln.ac.uk/locah/2010/10/07/modelling-copac-data/ Via email, twitter, in person
36. 2. Transforming in RDF/XML Transform EAD and MODS to RDF/XML based on our models Hub: created XSLT Stylesheet and used Saxon parser http://saxon.sourceforge.net/ Saxon runs the XSLT against a set of EAD files and creates a set of RDF/XML files Copac: created in-house Java transformation program
37. 3. Enhancing our data Language - lexvo.org Time periods - reference.data.gov.uk Geolocation - UK Postcodes URIs and Ordnance Survey URIs Names - Virtual International Authority File Matches and links widely-used authority files - http://viaf.org/ Names (and subjects) - DBPedia Subjects - Library of Congress Subject Headings
38. 4. Load RDF/XML into triple store Using the Talis Platform triple store RDF/XML is HTTP POSTed Weâre using Pynappl Python client for the Talis Platform http://code.google.com/p/pynappl/ Store provides us with a SPARQL query interface
39. 5. Create Linked Data Views Expose âboundedâ descriptions from the triple store over the Web Make available as documents in both human-readable HTML and RDF formats (also JSON, Turtle, CSV) Using Paget âLinked Data Publishing Frameworkâ http://code.google.com/p/paget/ PHP scripts query Sparql endpoint
42. Can I access the Locah Linked Data? Will be releasing the Hub data very soon! Copac data will follow approx 1 month later Release will include Linked Data views, Sparql endpoint details, example queries and supporting documentation
43. Reporting on opportunities and barriers Locah Blog (tags: âopportunitiesâ âbarriersâ) Feed into #JiscEXPO programme evidence gathering More at: http://blogs.ukoln.ac.uk/locah/2010/09/22/creating-linked-data-more-reflections-from-the-coal-face/ http://blogs.ukoln.ac.uk/locah/2010/12/01/assessing-linked-data
44. Creating the Visualisation Prototype Based on researcher use cases Data queried from Sparql endpoint Use tools such as Simile, Many Eyes, Google Charts For first Hub visualisation using Timemap â Googlemaps and Simile http://code.google.com/p/timemap/
45. Visualisation Prototype Using Timemap â Googlemaps and Simile http://code.google.com/p/timemap/ Early stages with this Will give location and âextentâ of archive. Will link through to Archives Hub
46. Sir Ernest Henry Shackleton http://archiveshub.ac.uk/data/gb15sirernesthenryshackleton Archives related to Shackleton: VIAF URL: http://viaf.org/viaf/12338195/ Books related to Shackleton: Biographical History: Ernest Henry Shackleton was born on 15 February 1874 in Kilkea, Ireland, one of six children of Anglo-Irish parents. The family moved from their farm to Dublin, where his father, Henry studied medicine. On qualifying in 1884, Henry took up a practice in south London, and between 1887 and 1890, Ernest was educated at Dulwich College. On leaving school, he entered the merchant service, serving in the square-rigged ship Hoghton Tower until 1894 when he transferred to tramp steamers. In 1896, he qualified as first mate, and two years later, was certified as master, joining the Union Castle line in 1899. [more]
48. The learning process Model the data, not the description The description is one of the entities Understand the importance of URIs Think about your world before others âŠbut external links are important Try to get to grips with terminology
49. Names 6947115KNAPPF F Knapp associated with record 6947115 /id/agent/6947115KNAPPF <copac:isCreatorOf rdf:resource="http://data.copac.ac.uk/id/mods/6947115"/> 6957115KNAPPF 6947115 <isCreatorOf>
50. Index terms (names, subjects, places) âAssociatedWithâ as the relationship Benefits of structured index terms Use /person/ and /organisation/ in the URI Distinguish /person/pilkingtonâ the person and /organisation/pilkington Distinguish place/reading/ and subject/reading/
51. Problems with source data EAD very permissive: whole range of finding aids Copac more consistent but still wide variety Hub EAD: We limited the tags we worked with Large files (around 5Mb) tend to need splitting up
52. Duplication of data âSo statements which relate things in the two documents must be repeated in each. This clearly is against the first rule of data storage: don't store the same data in two different places: you will have problems keeping it consistent.â (T B-L www.w3.org/designissues/linkeddata.html)
53. Archival Inheritance âDo not repeat information at a lower level of description that has already been given at a higher level.â ISAD(G) Many elements do not apply to âchildâ descriptions Simple rule of inheritance not always appropriate LD does assert hierarchical relationships but no requirement to follow these links
54. Copac Larger community: more potential vocabularies/documentation/support/confusion/inconsistencies Merged catalogues: a unique scenario âCreatorâ and âOthersâ (editor, authors, illustrator) Learning from Hub / Doing what is appropriate Usually not right or wrong answers
55. Copac model Groundwork done with Archives Hub. Then had to decide what we wanted to say about the data Challenges over what a ârecordâ is â âBleak Houseâ from each contributor? or one merged record? In many ways simpler than archival data; but also can decide to create a simpler model
57. Copac specification Hard to start but proved to be very crucial Very iterative process between spec and RDF output Important to establish the structure of the spec (we used tabs for each âentityâ)
59. Copac decisions Where to create Copac URIs â copac:creator copac:contributor copac:heldBy When to create URIs Title = literal Publication place = URI How to deal with problematic/ambiguous data Date? = productionDate
61. Risks Can you rely on data sources long-term? Persistence of persistent URIs? New technologies Investment of time â unsure of benefits Licensing issues
62. Provenance Track which data comes from our sources: URIs identify your entities Linked Data tends towards disassembling Copac/Hub as trusted sourcesâŠis DBPedia (for example) as reliable? Contributors may want data to be identified Issues around administrative/biographical history Benefits of trust? Users may want to know where data is from
63. Licensing Nature of Linked Data: each triple as a piece of data âOwnershipâ of data? Data often already freely available (M2M interfaces)
64. Licensing Public Domain Licences: simple, explicit, and permit widest possible reuse. Waive all rights to the data BL, British National Bibiography uses public domain licence Limit commercial uses? Build in community norms: attribution, share alike - to reinforce desire for acknowledgement Legal situation?
66. Attribution and CC licence Sections of this presentation adapted from materials created by other members of the LOCAH Project This presentation available under creative commonsNon Commercial-Share Alike:http://creativecommons.org/licenses/by-nc/2.0/uk/
Hinweis der Redaktion
Has been described as a âdata commonsâ, or more usually a Web of Data.
Problem for machines to extract meaning. At present, the raw data is not really available.
Persitent URIs for names of things â http URIs are names, not addressesProvide information â properties and classes for a URIMore links
Things are resources because someone created a URI to identify them, not because they have some particular properties in and of themselves.HTTP URIs provide a simple way to create globally unique names without centralized management; and URIs work not just as a name but also as a means of accessing information about a resource over the Web
In a data graph, there is no concept of roots (or a hierarchy). A graph consists of resources related to other resources, with no single resource having any particular intrinsic importance over another.
This subject â the archive itself â has a page (foaf:page being the property) with name âfinding aidâ. The âfinding aidâ is the object of this statement, but is also itself a subject. A subject in an RDF document may also be referenced as an object of a property in another RDF statement.
We have four âthingsâ here: unit of description; repostiory; finding aid; EAD document. We have given Unit of description a number of properties. Other things can also have properties (this is simplified)These properties are indicated in the green boxes. They are also called predicates.
In hypertext web sites it is considered generally rather bad etiquette not to link to related external material. Â The value of your own information is very much a function of what it links to, as well as the inherent value of the information within the web page. Â So it is also in the Semantic Web.Remember, this is about machines linking â machines need identifiers; humans generally know when something is a place or when it is a person. BBC + DBPedia + GeoNames + Archives Hub + Copac + VIAF = the Web as an exploratory space
Once you say that they are the same, the implication is that they share the same classes and properties.
Ontology defines a âknowledge domainâ
Encoded Archival Description is an XML standard for encoding archival finding aidsThe Object Description Schema (MODS) is an XML-based bibliographic description schemaMODS - Metadata Object Description Schema (MODS) is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications.EAD - Thingsâ include concepts and abstractions as well as material objects We want location â archives physical things so location importantAlso wanted event data, partly steered by the visualisation prototypeAlso âextentâ data â number of boxes
303 and Content Neg from âCool URIs for the Semantic Webâ
Open Data Commons Public Domain DedicationCreative Commons CC0 license
e.g. index terms may not always apply down the hierarchy of the descriptionWe are pulling <repository> down into lower-level descriptions