7. www.sti-innsbruck.at
1. Motivation
7
1) How many hotels use schema.org?
2) How is schema.org used?
1) Which classes?
2) Which attributes?
3) Is schema.org used correctly?
3) Who is using schema.org in tourism?
8. www.sti-innsbruck.at
2. Daten
8
What is schema.org?
• Initiative founded 2011
• Vocabulary for structuring data in web sites
• Embedded into html
– Microdata
– RDFa
– JSON-LD
Source: http://www.schema.org
9. www.sti-innsbruck.at
2. Daten
9
Analysis of all web sites:
• Founded in 2007
• Non-Profit Organisation
• Crawls web 4 times per year
• Datadumps are available open for public
• November 2013: 2,3 billion webseiten, 148TB
• Dezember 2014: 2,1 billion webseiten, 160TB
Source: http://commoncrawl.org/the-data/get-started/
10. www.sti-innsbruck.at
2. Daten
10
Only survey structured data:
WebDataCommons:
• 2012 Freie Universität Berlin & KIT
• Currently Uni Mannheim
• Operated by Chris Bizer
• Extracts structured data from the Common Crawl
– WebTables: 147 Million relational tab. (11Billion HTML Tab.)
– Hyperlink Graph: 3,5 Billion Webseiten, 128 Billion Links
– Semantically annotated data:
• November 2013: 44TB, 2.2Bn URLs
• Dezember 2014: 160TB, 2Bn URLs
Source: http://webdatacommons.org/structureddata/
11. www.sti-innsbruck.at
2. Daten
11
• November 2013 corpus
• Subset: schema.org/Hotel
– 35GB
– 127 Mio. Triples
• OWLIM-SE Repository – thanks Ontotext
• SPARQL Queries
• Linux Debian 3.2, STI – thanks David
12. www.sti-innsbruck.at
3. Analyse
12
1) How many hotels are annotated with schema.org?
4.841.353
• Hotels annotated several times
– own website
– booking websites
740.298
• Lost all hotels with same names
– Adler, Post, ...
Bind to address!
21. www.sti-innsbruck.at
3. Analyse
21
3) Who uses schema.org in tourism?
Hypothesis:
„Schema.org is mainly used by booking- and rating
websites, barely by hotels themselves.“
22. www.sti-innsbruck.at
3. Analyse
22
Approach:
• Hotels on booking- & rating sites
Search for annotation on own web site
• Countercheck with annotated hotel websites
Multiple appearance in data set?
Currently: exemplaric (top-booking sites)
Next step: full data set
23. www.sti-innsbruck.at
3. Analyse
23
Summary:
• Main user of schema.org/Hotel:
booking- and rating sites
Errors:
incomplete
Wrong claces
Wrong attributes
Wrong datatypes
Comprehensive errory analysis: Uni Mannheim
(R. Meusel & H. Paulheim) [1]
[1] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/MeuselPaulheim-HeuristicsForFixingCommonErrorsInDeployedSchemaOrgMicrodata-ESWC2015.pdf
26. www.sti-innsbruck.at
3. Analyse
26
What did we do with that knowledge?
• Talk at TFF 2015 in Mayrhofen
• Paper for SEMANTICS 2015
• Consulting of some participants of TFF
– Hotel Adlers Innsbruck
– Hotel in Seefeld
– ...