1. Š Copyright 2015 STI INNSBRUCK www.sti-innsbruck.at
Elias Kärle â June 15th, 2015 â OC Meeting
schema.org usage for hotels
An analysis based on the Web Data Commons data set
7. www.sti-innsbruck.at
1. Motivation
7
1) How many hotels use schema.org?
2) How is schema.org used?
1) Which classes?
2) Which attributes?
3) Is schema.org used correctly?
3) Who is using schema.org in tourism?
8. www.sti-innsbruck.at
2. Daten
8
What is schema.org?
⢠Initiative founded 2011
⢠Vocabulary for structuring data in web sites
⢠Embedded into html
â Microdata
â RDFa
â JSON-LD
Source: http://www.schema.org
9. www.sti-innsbruck.at
2. Daten
9
Analysis of all web sites:
⢠Founded in 2007
⢠Non-Profit Organisation
⢠Crawls web 4 times per year
⢠Datadumps are available open for public
⢠November 2013: 2,3 billion webseiten, 148TB
⢠Dezember 2014: 2,1 billion webseiten, 160TB
Source: http://commoncrawl.org/the-data/get-started/
10. www.sti-innsbruck.at
2. Daten
10
Only survey structured data:
WebDataCommons:
⢠2012 Freie Universität Berlin & KIT
⢠Currently Uni Mannheim
⢠Operated by Chris Bizer
⢠Extracts structured data from the Common Crawl
â WebTables: 147 Million relational tab. (11Billion HTML Tab.)
â Hyperlink Graph: 3,5 Billion Webseiten, 128 Billion Links
â Semantically annotated data:
⢠November 2013: 44TB, 2.2Bn URLs
⢠Dezember 2014: 160TB, 2Bn URLs
Source: http://webdatacommons.org/structureddata/
11. www.sti-innsbruck.at
2. Daten
11
⢠November 2013 corpus
⢠Subset: schema.org/Hotel
â 35GB
â 127 Mio. Triples
⢠OWLIM-SE Repository â thanks Ontotext
⢠SPARQL Queries
⢠Linux Debian 3.2, STI â thanks David
12. www.sti-innsbruck.at
3. Analyse
12
1) How many hotels are annotated with schema.org?
4.841.353
⢠Hotels annotated several times
â own website
â booking websites
740.298
⢠Lost all hotels with same names
â Adler, Post, ...
ď¨ Bind to address!
21. www.sti-innsbruck.at
3. Analyse
21
3) Who uses schema.org in tourism?
Hypothesis:
âSchema.org is mainly used by booking- and rating
websites, barely by hotels themselves.â
22. www.sti-innsbruck.at
3. Analyse
22
Approach:
⢠Hotels on booking- & rating sites
ď Search for annotation on own web site
⢠Countercheck with annotated hotel websites
ď Multiple appearance in data set?
Currently: exemplaric (top-booking sites)
Next step: full data set
23. www.sti-innsbruck.at
3. Analyse
23
Summary:
⢠Main user of schema.org/Hotel:
ď booking- and rating sites
Errors:
ď incomplete
ď Wrong claces
ď Wrong attributes
ď Wrong datatypes
ď Comprehensive errory analysis: Uni Mannheim
(R. Meusel & H. Paulheim) [1]
[1] http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/MeuselPaulheim-HeuristicsForFixingCommonErrorsInDeployedSchemaOrgMicrodata-ESWC2015.pdf
26. www.sti-innsbruck.at
3. Analyse
26
What did we do with that knowledge?
⢠Talk at TFF 2015 in Mayrhofen
⢠Paper for SEMANTICS 2015
⢠Consulting of some participants of TFF
â Hotel Adlers Innsbruck
â Hotel in Seefeld
â ...