A talk from London SemWeb meetup hosted by the BBC Academy in London, Mar 30 2012.
Video of the talk: http://www.youtube.com/watch?v=_-6mhdjE1XE
See also
http://www.meetup.com/LondonSWGroup/events/56987682/
http://openglam.org/2012/03/29/libraries-media-and-the-semantic-web-event-at-the-bbc/
https://twitter.com/#!/kansandhaus/status/185064835694862337
1. 'Schema.org and One Hundred Years of Search'
Libraries, Media and The Semantic Web
BBC Academy, March 28th 2012, London
Dan Brickley <danbri@danbri.org>
Friday, March 30, 2012
2. In 20 minutes
• Introduce you to the schema.org initiative
• Revisit 'the Web before the Web' of 1912
• Use this to describe what's new with
schema.org, ... and the practical choices we
face when scaling to billions of users and
pages
Friday, March 30, 2012
3. Intro: Dan Brickley
• Ex-W3C, helped start Semantic Web project
• Worked on RDF/S, FOAF, SKOS & other
standards around W3C
• Currently danbri@google.com working on
<http://schema.org/> project
• See also <http://danbri.org/>, @danbri
Friday, March 30, 2012
6. ■ The Republic of China is proclaimed.
■ Albert Berry makes the first parachute jump from a moving airplane.
■ Prague Party Conference: Vladimir Lenin and the Bolshevik Party
break away from the rest of the Russian Social Democratic Labour
Party.
■ France establishes a protectorate over Morocco.
■ RMS Titanic strikes an iceberg in the northern Atlantic Ocean.
■ Paramount Pictures, the oldest American motion picture studio still in
operation, is founded
■ Albania declares independence from the Ottoman Empire.
■ First Balkan War
■ Alan Turing, British mathematician is born
■ Semantic search over structured data goes mainstream, in Belgium.
source: http://en.wikipedia.org/wiki/1912
Friday, March 30, 2012
8. Sample queries from 1912
Moteur Diesel. Philosophie des mathematiques. Les pecheries
au Maroc et sur la cote d'Espagne. Finances Bulgares.
Gyroscope. Culte de feu. Motocolture (garden). Evolution de
la dent humaine. Emigration italienne. Casier civil. Chemin de
fer de bagdad (railroad...). Planete Mars. Suffrage universel.
Nevrose traumatique. Eugenism. Le saumon; Saumons
manques et repeches. Boomerang. Fabrication del la
cyanamide. Emigration des Juifs. Intoxications par le tabac.
Quantite d'huile d'olive importee en Belgique. Jurisprudence
des compagnies d'assurances en Angleterre, Hollande et
Danemark...
Friday, March 30, 2012
10. Search before search
• Paul Otlet, "the man who dreamed the Internet", http://
www.youtube.com/watch?v=fmsOI5SdLkE
• "The International Centre organises collections of world-wide
importance. These collections are the International Museum, the
International Library, the International Bibliographic Catalogue and
the Universal Documentary Archives. These collections are
conceived as parts of one universal body of documentation, as an
encyclopedic survey of human knowledge, as an enormous
intellectual warehouse of books, documents, catalogues and
scientific objects."
• Start at http://en.wikipedia.org/wiki/Mundaneum for full whole story
Friday, March 30, 2012
11. Libraries, media & ...?
• Universal Decimal Classification (UDC)
used in many 1000s of libraries today
• In BBC archive for 40 years, as 'Lonclass'
• Shows the challenge and promise of
structured description
• So what's in Lonclass? What's not in Lonclass!
Friday, March 30, 2012
21. Lonclass by example
• R672:32.007(47)YELTSIN:342.518.1THATCHER
“TWO SHOTS OF MARGARET THATCHER
AND BORIS YELTSIN”
• [BRITISH AEROSPACE].007.11PEARCE:
656.881:342.518.1THATCHER “LETTER TO MRS
THATCHER FROM SIR AUSTIN PEARCE”
• 656.881:301.162.721:32.007THATCHER:
654.192.731TV-AM “MARGARET THATCHER'S
LETTER OF APOLOGY TO TV AM”
Friday, March 30, 2012
22. Compositional Semantics
• 656.881:301.162.721 “LETTERS OF APOLOGY”
• 656.881 “LETTERS (POSTAL SERVICES)”
• 656.881:06.022.6 “RESIGNATION LETTERS”
• 654.192.731TV-AM “TV AM (TELEVISION AM)”
(this work pre-dated modern linguistics, never mind computing...)
Friday, March 30, 2012
23. Archives and classification
• Lonclass tells a story of the world; of this
country at least; and a lot about the rest
• It is huge - 1000s of terms, composite
sentence-like codes, and rather sparse
• It began with UDC in 1890s, and remains
key to BBC's media archives even today
Friday, March 30, 2012
25. And now for something new.
Friday, March 30, 2012
26. Schema.org
• Search engine collaboration:
• Google, Bing,Yahoo! & Yandex
• Simple factual data for better search
• Launched June 2011, schema.org schema
• 300 classes, 261 properties & growing
• discussions: W3C WebSchemas group
Friday, March 30, 2012
27. Example: Google Rich Snippets
From: http://www.google.com/webmasters/tools/richsnippets
See also Yandex's http://webmaster.yandex.ru/microtest.xml
Friday, March 30, 2012
28. On IMDB:
<div id="content-2-wide" itemscope itemtype="http://schema.org/CreativeWork">
<div class="txt-block">
<h4 class="inline">Stars:</h4>
<a onclick="(new Image()).src='/rg/title-overview/star-1/images/b.gif?link=%2Fname
%2Fnm0010930%2F';" href="/name/nm0010930/" itemprop="actors">Douglas Adams</a>,
<a onclick="(new Image()).src='/rg/title-overview/star-2/images/b.gif?link=%2Fname
%2Fnm0048982%2F';" href="/name/nm0048982/" itemprop="actors">Tom Baker</a> and <a
onclick="(new Image()).src='/rg/title-overview/star-3/images/b.gif?link=%2Fname
%2Fnm3035100%2F';" href="/name/nm3035100/" itemprop="actors">Hans Peter Brondmo</
a>
</div>
<div class="star-box" itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
Linked Data: see http://www.imdb.com/name/nm0010930/ for schema.org markup describing Douglas
Adams as a http://schema.org/Person (jobTitle, birthDate, description, performerIn, ...).
Friday, March 30, 2012
29. What’s in the schema?
• Classes (types) e.g. LocalBusiness, Person,
Organization,VideoObject, TVSeries...
• Properties (attributes) e.g. openingHours,
transcript, productionCompany, streetAddress
• That’s all - a dictionary of terms, used for
annotating data within normal Web pages
Friday, March 30, 2012
30. CreativeWork
event
UserInteraction
LocalBusiness
intangible
place Organization
CivicStructure
Landform
Friday, March 30, 2012
33. Schema.org scope
• In-page structured data for search
• Not asking an unconstrained “so, how do we
describe cars?”, but “how can we improve
markup on existing pages that describe
cars?” (or Comics, SoftwareApps, Sports, ...)
• Simplify publisher/webmaster experience
• Record agreements between search engines
• Central use case: augmented search results
Friday, March 30, 2012
35. Schema.org and UDC
• In many ways the opposite of UDC
• Small (by contrast), pragmatic, Web-based
• Yet by Semantic Web standards and
culture, it is a big 'centralised' schema
• The art is finding ways to decentralise
without creating chaos
• We don't want to re-invent UDC, or
Wikipedia; but integrate such things into
simple descriptive templates for search
Friday, March 30, 2012
36. Lots missing! e.g. sports
• Current vocabulary emphasizes 'points of
interest' on a map and sporting activities
rather than sports content 'as entertainment'
• We also have terms to describe videos, TV
shows etc., ...but no sports-specifics yet
• How deep to go? How to integrate with
existing vocabulary? How to identify players,
teams, kinds of 'football'? Video clips for that
'hand of God' goal?
Friday, March 30, 2012
38. Everything overlaps
*
• We added JobPosting; what if the job was
sports-related?
• We're adding educational markup; does it
help describe sports education, training?
• Is there a sports perspective on the health/
medical vocabulary we're working on?
• Can't coordinate everything! Pragmatism...
* 'intertwingularity'
Friday, March 30, 2012
39. Practicalities
• Delegation to external sources for
enumerations and detail
• e.g. country codes from UN FAO or
Wikipedia/DBpedia/Wikidata
• We don’t want to create big enumerations
• all the countries? sports? things that go on maps?
• Decentralised subclassing & property values
Friday, March 30, 2012
40. Process
• Search partners retain ultimate oversight
• W3C hosts community group, discussion,
wiki and proposal tracking
• Web Schemas group - planning monthly
telecons at W3C, based around proposals
• Evolving, pragmatic, collaborative
Friday, March 30, 2012
41. Compositional
Semantics revisited
• If we have SportsCentre and Karate, we
can we describe a Karate Club?
• If we have recipes vocab, and medical
vocab, and restaurants, can we describe
allergy free food?
• If UN have country codes, Wikipedia list
religions, ... then we just re-use those
Friday, March 30, 2012
42. And libraries
• If the library world share their controlled
vocabularies as open SKOS linked data
• ...can we plug them directly into
schema.org descriptions?
• of videos? news? scholarly articles? (yes)
• Why re-invent when you can collaborate?
Friday, March 30, 2012
43. WebSchemas public-vocabs list
• Schema.org process
• Looking for rough consensus and
incremental improvements
• Realistic examples, simplicity for
publishers, and re-use of existing
vocabulary are important
• <http://www.w3.org/wiki/WebSchemas/>
Friday, March 30, 2012