1. Forging New Links:
Libraries in the Semantic Web
Lisa Goddard & Gillian Byrne
Memorial University Libraries
Computers in Libraries, Washington D.C.
March 23rd, 2011
2. The Gist
General Semantic Web
• How it works.
• A few tools.
• Who’s involved?
Lisa
Libraries & Linked Data
• What it solves.
• Issues & obstacles.
• Where we are now.
Gillian
10. Complex Queries
Find all soccer players, who played as goalkeeper
for a club that has a stadium with more than
40,000 seats and who are born in a country with
more than 10 million inhabitants.
15. Structured Data: RDF
Data model for writing simple statements about web objects.
RDF statements are written as “triples”.
Subject Object
Shakespeare Macbeth
Predicate
Wrote
V
Statement
16. RDF Triples
Subject Predicate Object
Shakespeare
Shakespeare
Anne Hathaway
Shakespeare
Stratford
Macbeth
England
Scotland
Wrote
Wrote
Married
Lived in
Is in
Set in
Part of
Part of
King Lear
Macbeth
Shakespeare
Stratford
England
Scotland
UK
UK
17. RDF Graph: A Semantic Net
AnneHathaway
Shakespeare
Stratford
UK
Macbeth
KingLear
Scotland
Englandwrote
isIn
setIn
18. Unique Identifiers: URIs
Shakespeare Macbeth
wrote
http://www.mun.ca/project#shakespeare http://www.mun.ca/project#macbeth
http://www.mun.ca/project#wrote
URIs should resolve.
22. An ontology describes a particular domain of
knowledge (e.g. bikes, whiskey).
• Establishes controlled vocabulary.
• Models relationships between entities & concepts.
• Built-in rules and datatypes that support reasoning.
Ontologies
23. Controlled Vocabulary
Terms and definitions are posted online, so they can be shared by
many different organizations.
http://www.mun.ca/lit.owl
#wrote
#setIn
#play #book
#poem
#narrated
http://www.mun.ca/lit.owl#wrote
http://www.mun.ca/lit.owl/#book
http://www.mun.ca/lit.owl#play
24. SharedOntologies
Namespaces allow us to combine several vocabularies
while maintaining distinct meaning of each element.
#person
#partOf
http://gmu.edu/bio.owl
http://mit.edu/geo.rdf
http://mun.ca/lit.owl
#wrote
#setIn
#married
#play #book
#poem
#narrated
#isIn
#country
#city
#region
#birthdate
#deathdate
46. Growth of RDFa
Usage of RDFa increased 510% between
Mar 2009 and Oct 2010
From: http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/
64. Competing Vocabularies
...how many ways to describe a book,
journal article or a place?
Ian Millard, Hugh Glaser, Manuel Salvadores, Nigel
Shadbolthttp://eprints.ecs.soton.ac.uk/21681/5/cold2010-slides.pdf
71. Trust
The largest hurdle to library adoption of
Linked Data, though, may not be
educational or technological …The
sticking point for librarians may be an
issue of trust.
- Ross Singer, “Linked Data Now!”
74. Licensing
“You shall not use the data made available
through the GC Open Data Portal in any way
which, in the opinion of Canada, may bring
disrepute to or prejudice the reputation of
Canada."
75. VoID
:DBpedia a void:Dataset ;
dcterms:license<http://www.gnu.org/copyleft/fdl.h
tml> .
• schema to describe linked
datasets
76. Oh - One more thing…
“who’s minding the ranch?”
77. RDA
• Works with in MARC, but also works as a linked
data Metadata Vocabulary
82. Age of Chaotic Innovation?
LIBRIS (Swedish Union
Catalog)
Library of Congress (LCSH,
OSI)
German National Library
Hungarian National Library
British Library
Europeana
Linked Periodicals Data
Virtual International Authority File
Dewey Decimal Classification
BIBSYS’ authority files
Thesaurus for Economics
Rameau
Swedish Subject Headings
German Subject Headings
Metadata Authority Description
Schema (MADS)
83. The Chaos Tamers
• W3C Linked Library Data Incubator Group
• IFLA Semantic Web Interest Group
• CKAN Linked Library Data Group
• LITA/ALCTS Linked Library Data Interest Group
I’m going to talk generally about the semantic web – what it is, what it can do – then Gillian will talk specifically about the potential of the semantic web to solve some of the challenges faced by libraries.
Why do we need a new web at all? Let’s review some of the things that our current search engines don’t do well.
Current web search engines operate on string matching. They have no way to extract meaning from unstructured masses of textual data.http://semanticweb.com/happy-birthday-wikipedia-but-dbpedia-has-reason-to-celebrate-too_b17320
When the web is structured more like a database then computers will be able to do a lot more filtering, grouping, and reasoning. The first thing that we need to do is to stop publishing big unstructured blurbs of HTML information, and provide better metadata. @udcmrk is Martin Kalfatovic
One element in the FOAF ontology is “workplaceHomepage”. Like all RDF data, this element has it’s own unique URI, which you see in in pink at the bottom of the screen
When I enter this URI in a browser, I get information back about the object. http://xmlns.com/foaf/spec/#term_workInfoHomepage
Reasoning is one very powerful aspect of the semantic web. The ability to reason allows computers to infer new information from explicit statements. It’s a complex concept, so the easiest way to describe it is to give you a few examples of computer-based reasoning.
Firefox Plugin - good for viewing RDF and OWL files. http://dig.csail.mit.edu/2007/tab/Using Tablulator to view the Family Ontology: http://protege.cim3.net/file/pub/ontologies/family.swrl.owl/family.swrl.owl
We’ve talked about how RDF allows us to create structured data, and how ontologies provide controlled vocabularies that can be shared. The last step is to link all of that data together in as many ways as possible.
The first step towards the semantic web vision as articulated by the W3C is for organizations to publish their data as RDF, using shared vocabularies and ontologies.
The second step is to establish links between the data exposed by different organizations. In order for linked data to become a web-scale discovery solution it is really important to link your own RDF data with other people’s RDF data.
One of the challenges is finding relevant RDF links from many different sources. You especially want to be connected with major linking hubs. http://www4.wiwiss.fu-berlin.de/bizer/silk/Discovering links between data items across data sets requires record linkage and duplicate detection techniques (e.g. Jaro-Winkler).Interlinking DBpedia movies with LinkedMDB directors. Silk was fed with the 50000 movies from DBpedia and 2500 directors from LinkedMDB. Silk was configured to set a dbpedia:director link from the movie to its director.Identifying duplicate person descriptions in a data stream. owl:sameAs links for URIs which effectively identify the same entity.http://www.wiwiss.fu-berlin.de/en/institute/pwo/bizer/research/publications/JentzschIseleBizer-Silk-Poster-ISWC2010.pdf LinQuer is a tool for semantic link discovery over relational data, based on string and semantic matching techniques and their combinations.Discovering links between different entities in data sources is a challenging task and an attractive research area. Existence of links add value to data sources, enhance data access and information discovery, and allow or enhance many increasingly important data mining tasks. When data sources are not linked, they resemble islands of data (or data silos), where each island maintains only part of the data necessary to satisfy a user's information needs. Penetrating these silos to both understand their contents and understand potential semantic connections is a daunting task. What users and data publishers need is automated support for creating referential links between data that reside in different sources and that are semantically related. Finding such links often requires the use of approximate matching (to overcome syntactic representational differences and errors) and semantic matching (to find specific semantic relationships). Furthermore, both types of matching must be tightly integrated to accommodate for the tremendous heterogeneity found in the data that reside in today's information systems.http://dblab.cs.toronto.edu/project/linquer/OneOKKAM http://www.okkam.org/okkam-more"Entities should not be multiplied beyond necessity" [Ockham's razor, XIV century]"Entity identifiers should not be multiplied beyond necessity" [OKKAM's razor, XXI century]OKKAM will contribute to this vision by supporting the convergence towards the use of a single and globally unique identifier for any entity which is named on the Web. Therefore, OKKAM will make available to content creators, editors and developers a global infrastructure and a collection of new tools and plugins which support them to easily find public identifiers for the entities named in their contents/services.The ENS will be a distributed service which permanently stores identifiers for entities and provides a collection of core services (e.g. entity matching, ID mapping and resolution) needed to support their pervasive reuse;provide a general service for entity-level integration of virtually any type of data and service into the global Web of Entities of the challenges is finding relevant RDF links from many different sources. You especially want to be connected with major linking hubs.
This is a graphical representation of the linked data cloud. Every circle represents the RDF data set that has been exposed by a specific organization. The lines shows how those datasets have been linked together. Some of the circles have a lot of inbound and outbound links, and we refer to those nodes as linking hubs.
Some linking hubs represent entities from a particular knowledge domain - information about music, or protein sequencing, for example. Some linking hubs, like Freebase or Dbpedia are more general, and contain RDF representations covering a lot of different subjects. Dbpedia is an interesting example, because it harvests the Wikipedia database, and converts it into RDF.
One of the problems that we have at the moment is that masses of unstructured text already exist on the internet, and we need ways to insert RDF links into that existing data.
DBpedia Spotlight is an example of a web service that performs semantic annotation of unstructured text. You can see this on the web by simply pasting a paragraph into the textbox on the spotlight demo page. http://spotlight.dbpedia.org/demo/index.xhtmlBy connecting text documents with DBpedia, our system enables a range of interesting use cases. For instance, the ontology can be used as background knowledge to display complementary information on web pages or to enhance information retrieval tasks. Moreover, faceted browsing over documents and customization of web feeds based on semantics become feasible. Finally, by following links from DBpedia into other data sources, the Linked Open Data cloud is pulled closer to the Web of Documents.
When you click the annotate button, DBpedia’s processing engine identifies concepts and entities within this text blurb, and suggests links to the RDF descriptions of those objects within DBpedia. [One of the ways DBpedia Spotlight aims at flexibility is by letting users determine what degree of precision makes the most sense for the application to which they would like to apply its semantic annotation. The current version of DBpedia Spotlight was built from a DBpedia3.6+Wikipedia dump from Oct. 2010, and users can configure the confidence value for returning annotations about content entities. Setting it higher may result in fewer annotations but the ones returned are more likely to be correct, while a lower confidence value will try to get you as many annotations as possible but the likelihood of mistakes grows. http://semanticweb.com/the-spotlight%E2%80%99s-on-dbpedia_b17942]
If we click through to the dbpedia page for Apple Corporation, we can see dbpedia’s highly structured data relating to the company, and all of the RDF links to related entities. If you link the word Apple in your text to this extended information, then semantically aware tools can use all of this data to search and reason.http://dbpedia.org/page/Apple_Inc.“Connecting your text to DBpedia enables this use case of more semantic processing or browsing of your text,” says Mendes.http://semanticweb.com/the-spotlight%E2%80%99s-on-dbpedia_b17942
A lot of this stuff may seem intimidating, but you are not expected to know how to write RDF in notepad, in the same way that you don’t need to know HTML in order to publish a blog post. Lots of semanticauthoring tools exist that allow you to produce RDF as you publish new information.
Drupal is a content management system that many libraries already use to publish their websites. The latest version of Drupal has RDF publishing tools built right into the core. As you create your website and add new content RDF data will automatically be added to your pages, even if you aren’t aware that this is happening. http://drupal.org/
Uses forms to collect information so that data is structured as it is entered.http://semantic-mediawiki.org/wiki/Semantic_MediaWikihttp://sandbox.semantic-mediawiki.org/wiki/Special:ExportRDFhttp://smwdemo.ontoprise.com/index.php/User:Lisagoddard
http://www.zemanta.com/http://lisagoddard.blogspot.comSupports RDF output, links to Linking Open Data entities and has properly defined namespace.Zemanta suggests appropriate in-text links, so if you type a name, for example it will suggest a wikipedia page, or a blog or an online portfolio for that person. Optimized for user-generated content Other semantic APIs are built to manage only well-formatted documents and texts. Zemanta is built with the fluid nature of today's Web in mind and will not fail to extract the meaning even in the most dubious of situationsImplicit disambiguation means that it never confuses Apple for apples We achieve this by comparing numerous meanings of each extracted term and acting based on that evaluation.
We’ve been hearing about semantic technologies for a long time now, and a lot of people think that linked data is just a lot of blue sky thinking that has no support on the current web.
http://tripletalk.wordpress.com/2011/01/25/rdfa-deployment-across-the-web/The data shows that the usage of RDFa has increased 510% between March, 2009 and October, 2010, from 0.6% of webpages to 3.6% of webpages (or 430 million webpages in our sample of 12 billion). This is largely thanks to the efforts of the folks at Yahoo! (SearchMonkey), Google (Rich Snippets) and Facebook (Open Graph), all of whom recommend the usage of RDFa.
Many of the major technology companies have already invested in Semantic Tech. Facebook = OpenGraph ProtocolTwitter = Twitter AnnotationsCisco inked deal with SW company DERI.Apple bought Siri personal assistant appGoogle acquired Metaweb & Freebase, supports RDFa in Rich Snippets.Microsoft bought PowerSet in 2008 to integrate with Bing. In 2010 they licensed semantic technologies from Cognition.
Facets appear down the left side that allow you to refine your search.
Now that you have command over some of the basic semantic web concepts, Gillian is going to talk about linked data specifically in libraries.