Slideset of the keynote talk given by Christian Bizer at the Language Data and Knowledge (LDK 2019) conference in Leipzig, Germany
Abstract of the talk:
Millions of websites have started to annotate data describing products, local business, events, jobs, places, recipes, and reviews within their HTML pages using the schema.org vocabulary. These annotations are widely used by search engines to render rich snippets within search results. Surprisingly, the annotations are hardly used by the research community. In the talk, Christian Bizer investigates the potential of schema.org annotations for being used as training data for tasks such as entity matching, information extraction, and sentiment analysis. Web pages that offer semantic annotations often also contain additional structured data in the form of HTML tables. In the second part of the talk, Christian Bizer discusses the interplay of semantic annotations and web tables for information extraction as well as the general potential of relational HTML tables for complementing knowledge bases such as DBpedia, focusing on the discovery of formerly unknown long tail entities as well as the extraction of n-ary relations.
Link to the conference website:
http://2019.ldk-conf.org/invited-speakers/
Schema.org Annotations and Web Tables Underexploited Semantic Nuggets on the Web
1. Data and Web Science GroupSchema.org Annotations and Web Tables
Underexploited Semantic Nuggets on the Web?
120-23 May 2019 in Leipzig, Germany
Prof. Dr. Christian Bizer
2. Data and Web Science GroupCool Resources are often Build
Exploiting Some Structure Somewhere
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 2
3. Data and Web Science Group
Web Tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 3
4. Data and Web Science Group
Schema.org Annotations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 4
<div itemtype="http://schema.org/Hotel">
<span itemprop="name">Vienna Marriott Hotel</span>
<span itemprop="address" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Parkring 12a</span>
<span itemprop="addressLocality">Vienna</span>
</span>
</div>
<div itemtype="http://schema.org/Product">
<h1 itemprop="name">Signature Instant Dome 7</h1>
Item # <span itemprop=âproductID">200734246807</span>
<img itemprop=âimageâ src=â../20073424680.jpg">
<span itemprop=âreview" itemtype="http://schema.org/Review">
<span itemprop=âratingValue">5</span>
<span itemprop=âreviewBody">Great waterproof tent!</span>
</span>
</div>
5. Data and Web Science Group
Hypothesis
â Web tables and Schema.org annotations are useful
training data for tasks such as
â Named Entity Recognition
â Information Extraction
â Entity Matching
â Taxonomy Induction
â Sentiment Analysis
â Knowledge Base Completion
â but they are hardly used by public research yet
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 5
6. Data and Web Science Group
Outline
1. Schema.org Annotations
â Adoption on the Web
â Current applications and further potential
2. Web Tables
â Usage on the Web
â Application potential
3. Conclusions
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 6
7. Data and Web Science Group
1. Schema.org Annotations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 7
â ask site owners since 2011 to
annotate data for enriching search results
â 675 Types: Event, Place, Local Business, Product, Review, Person
â Encoding: Microdata, RDFa, JSON-LD
8. Data and Web Science Group
Usage of Schema.org Data @ Google
Data snippets
within
search results
Google maps
location data
Data snippets
within
info boxes
9. Data and Web Science Group
Usage of Schema.org Data @ Google
https://developers.google.com/search/docs/guides/search-gallery
Product
offers
Job
offers
10. Data and Web Science Group
Web Data Commons Project
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 10
â extracts all Microformat, Microdata,
RDFa, JSON-LD data from the Common Crawl
â analyzes and provides the extracted data for download
â statistics about some extraction runs
â 2018 CC Corpus: 2.5 billion HTML pages ï 31.5 billion RDF triples
â 2017 CC Corpus: 3.1 billion HTML pages ï 38.2 billion RDF triples
â 2014 CC Corpus: 2.0 billion HTML pages ï 20.4 billion RDF triples
â 2010 CC Corpus: 2.8 billion HTML pages ï 5.1 billion RDF triples
â uses 100 machines on Amazon EC2
â approx. 2000 machine/hours ïš 250 Euro
http://webdatacommons.org/structureddata/
11. Data and Web Science Group
Overall Adoption 2018
http://webdatacommons.org/structureddata/2018-12/
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 11
944 million HTML pages out of the 2.5 billion pages
provide semantic annotations (37.1%).
9.6 million pay-level-domains (PLDs) out of the
32.8 million pay-level-domains covered by the crawl
provide semantic annotations (29.3%).
12. Data and Web Science Group
Frequently used Schema.org Classes
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 12
Top Classes # Websites (PLDs)
Microdata JSON-LD
schema:WebPage 1,124,583 121,393
schema:Product 812,205 40,169
schema:Offer 676,899 57,756
schema:BreadcrumbList 621,344 205,971
schema:Article 612,361 57,082
schema:Organization 510,069 1,349,775
schema:PostalAddress 502,615 176,500
schema:ImageObject 360,875 111,946
schema:Blog 337,843 12,174
schema:Person 324,349 335,784
schema:LocalBusiness 294,390 249,017
schema:AggregateRating 258,078 23,105
schema:Review 124,022 6,622
schema:Place 92,127 66,396
schema:Event 88,130 63,605
http://webdatacommons.org/structureddata/2018-12/
13. Data and Web Science Group
Attributes used to Describe Products
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 13
Top Attributes PLDs Microdata
# %
schema:Product/name 754,812 92 %
schema:Product/offers 645,994 79 %
schema:Offer/price 639,598 78 %
schema:Offer/priceCurrency 606,990 74 %
schema:Product/image 573,614 70 %
schema:Product/description 520,307 64 %
schema:Offer/availability 477,170 58 %
schema:Product/url 364,889 44 %
schema:Product/sku 160,343 19 %
schema:Product/aggregateRating 141,194 17 %
schema:Product/brand 113,209 13 %
schema:Product/category 62,170 7 %
schema:Product/productID 47,088 5 %
⊠⊠âŠ
http://webdatacommons.org/structureddata/2018-12/stats/html-md.xlsx
14. Data and Web Science Group
Use Case 1: Gather Large Amounts of
Hyponym Relationships
â plenty of entity
names are
annotated
â Hearst Pattern
of the 21st century
(if class is covered)
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 14
http://webdatacommons.org/structureddata/2018-12/stats/schema_org_subsets.html
Class # Entities Names
Product 368,386,801
Organization 176,365,220
Local Business 44,100,625
Place 31,748,612
Event 26,035,295
Job Posting 5,078,941
Movie 5,233,520
Hotel 2,537,089
College or University 2,392,797
Recipe 2,036,417
Music Album 1,594,941
Restaurant 1,480,443
Airport 1,145,323
15. Data and Web Science Group
Coverage: Top 15 Hotel Websites
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 15
Top 15 Travel Websites schema:Hotel
Booking.com P
TripAdvisor P
Expedia P
Agoda P
Hotels.com P
Kayak P
Priceline P
Travelocity P
Orbitz P
ChoiceHotels P
HolidayCheck P
ChoiceHotels P
InterContinental Hotels Group P
Marriott International P
Global Hyatt Corp. T
Adoption:
93 %
KĂ€rle, Fensel, Toma, Fensel: Why Are There More Hotels in Tyrol than in Austria? Analyzing Schema.org Usage in
the Hotel Domain. ICTT 2016.
16. Data and Web Science Group
Use Case 2: Distant Supervision for
Information Extraction
â Goal: Learn how to extract data from websites that do
not provide schema.org annotations using annotations as
training data.
â Example: Google Small Events Paper @ SIGIR 2015
â extract data about small venue concerts, theatre performances,
garage sales, etc.
â they use 217,000 schema.org events from ClueWeb2012
as distant supervision, e.g.
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 16
Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
<div itemtype="http://schema.org/MusicEvent">
<h1 itemprop="name">School Cool Concert</h1>
<p itemprop="startDate">25.09.2018 15:00</p>
<p itemprop=âlocation" itemtype="http://schema.org/Place">School Gym Puero High</span>
<div>
17. Data and Web Science Group
Google Small Events Paper @ SIGIR 2015
â Approach
â train an information extraction method that
combines structural and text features
â extracted attributes: What? When? Where?
â apply method to pages without semantic
annotations
â Result
â extract 200,000 additional small-scale
events from ClueWeb2012
â double recall at 0.85 precision
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 17
Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
18. Data and Web Science Group
Underexploited?
â Using schema.org data as
distant supervision will likely
also work for other information
extraction tasks.
â Research questions
â How to best exploit annotations
as distant supervision?
â How good do the annotations
perform compared to other
sources of supervision?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 18
Class # Entities
Product 368,386,801
Organization 176,365,220
Local Business 44,100,625
Place 31,748,612
Event 26,035,295
Job Posting 5,078,941
Movie 5,233,520
Hotel 2,537,089
College or University 2,392,797
Recipe 2,036,417
Music Album 1,594,941
Restaurant 1,480,443
Airport 1,145,323
19. Data and Web Science Group
Use Case 3: Training Data for
Entity Matching / Link Discovery
â Goal: Find e-shops that offer a specific product in order to
compare product prices
â Observation: Some e-shops annotates product IDs, many
e-shops do not
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 19
Properties PLDs
# %
schema:Product/name 754,812 92 %
schema:Product/description 520,307 64 %
schema:Product/sku 160,343 19 %
schema:Product/productID 47,088 5 %
schema:Product/mpn 12,882 1.6%
schema:Product/gtin13 7,994 1%
20. Data and Web Science Group
Learn How to Match Products using
Schema.org Data as Supervision
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 20
Product
offer
Product
offer
Product
offer
Product
offerProduct
offerProduct
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Clusters of offers
having the same product ID
Unseen offers
without product IDs
Learn
Matcher
Matcher Same product?Product
offer
Product
offer
Product
offer
21. Data and Web Science Group
Data Cleansing
for Building the Clusters
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 21
Filtering of product
offers with annotated
product identifiers
Removal of
listing pages
Filtering by
identifier value
length
Cluster creation
based on identifier
value co-occurrence
Split wrong
clusters due to
category IDs
121M offers 58M offers 26M offers 16.4M clusters
All Languages
26.5M offers
79K distinct websites
16.6M clusters (products)
English Offers
16M offers
43K distinct websites
10M clusters (products)
16.6M clusters
22. Data and Web Science Group
Cluster Size Distribution by Category
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 22
A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large-Scale Product Matching.
ECNLP 2019 Workshop @ WWW2019.
23. Data and Web Science Group
Training Subsets
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 23
Category #Positive
Pairs
#Negative
Pairs
s:name s:description specification
table
Computers_small 20,444 21,676 100% 82% 55%
Cameras_small 7,539 9,093 100% 61% 4%
Watches_small 5,449 8,819 100% 48% 4%
Shoes_small 3,476 5,924 99% 36% 1%
Computers_large 120,568 139,020 100% 87% 60%
Cameras_large 16,390 84,788 100% 65% 5%
Watches_large 7,729 65,516 100% 55% 3%
Shoes_large 3,476 36,224 100% 35% 1%
Gold standard: 550 offer pairs per category
24. Data and Web Science Group
Results: Traditional Learning Methods
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 24
Category Classifier Precision Recall F1
Computers_small LinearSVC 0.75 0.94 0.84
Cameras_small Random Forest 0.75 0.87 0.81
Watches_small LinearSVC 0.74 0.91 0.81
Shoes_small Log. Regression 0.72 0.95 0.82
Computers_large Random Forest 0.92 0.95 0.94
Cameras_large Dec. Tree 0.86 0.85 0.86
Watches_large Random Forest 0.90 0.87 0.88
Shoes_large Log. Regression 0.94 0.86 0.90
Feature creation: word co-occurrence
25. Data and Web Science Group
DeepMatcher
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 25
- word-based, character based
- pre-trained (word2vec, GloVe, fastText), learned
- attribute summarization: SIF, RNN, Attention, Hybrid
- attribute comparison: fixed distance, learnable distance
- classification: fully connected neural net
Mudgal, Li, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
26. Data and Web Science Group
Results: DeepMatcher
â Human-level matching performance just using web data for training.
â The 6.6% errors in the training data are averaged out in the learning.
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 26
Category Model Precision Recall F1
Computers_small Combined SIF 0.91 0.98 0.95
Cameras_small Combined RNN 0.95 0.93 0.94
Watches_small Combined RNN 0.93 0.98 0.95
Shoes_small Combined SIF 0.87 0.99 0.92
Computers_large RNN 0.96 0.98 0.97
Cameras_large RNN 0.94 0.98 0.96
Watches_large RNN 0.96 0.99 0.97
Shoes_large RNN 0.97 0.99 0.98
27. Data and Web Science Group
Underexploited?
â If we can get human-level matching performance using just
web data for training, why do product comparison portals
still pay people to manually label offers for training?
â Research questions
â Does this work for all product categories?
â How good does it work for new products?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 27
28. Data and Web Science Group
Use Case 3: Taxonomy Matching and
Taxonomy Clustering
â E-Shops annotate their product categorization
â The marketing ideas behind these categorizations differ widely
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 28
Relevant Properties # PLDs %
schema:Offer/category 62,170 7 % of all shops
schema:WebPage/BreadcrumbList 621,344 11% of all sites
Apparel > Summer > Mens > Jerseys
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jersey > over $60 s
Home > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables
Shop > Tables > Dining Tables > Aland Wood Tables
29. Data and Web Science Group
Cool Taxonomy Clustering Problem
â How to cluster 62,000 taxonomies in order to discover the
most frequent categorization ideas within a domain?
â How do shops categorize action cameras?
â What subcategories are frequently used for action cameras?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 29
Meusel, Primpeli, Meilicke, Paulheim, Bizer: Exploiting Microdata Annotations to Consistently Categorize Product
Offers at Web Scale. EC-Web 2015.
30. Data and Web Science Group
Learn How to Match Product Taxonomies
using Schema.org Data as Supervision
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 30
Category
A
Category
B
Taxonomies that share
some products
Unseen categories
containing some
products
Similar category?
P1
P2
P1
P2
P3
Learn
Matcher
Matcher
31. Data and Web Science Group
Underexploited?
â Cool training data for the rather unexplored research
challenge of large-scale taxonomy clustering
â Research questions
â Which similarity metrics work best?
â How to block 62,000 taxonomies in order to deal with
the runtime complexity?
â On which depths to start clustering in order to answer
interesting business questions?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 31
32. Data and Web Science Group
Use Case 4: Training Data for Sentiment
Analysis
â 130,000 websites annotate reviews
â 2018 WDC corpus contains 13.5 million reviewswhich
annotate schema:ratingValue and schema:reviewBody
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 32
<div itemtype="http://schema.org/Product">
<h1 itemprop="name">Signature Instant Dome 7</h1>
Item # <span itemprop=âproductID">200734246807</span>
<img itemprop=âimageâ src=â../20073424680.jpg">
<span itemprop=âreview" itemtype="http://schema.org/Review">
<span itemprop=âratingValue">5</span>
<span itemprop=âreviewBody">Great waterproof tent which
worked perfectly for me âŠ.</span>
</span>
</div>
33. Data and Web Science Group
Reviews by Type of Reviewed Item
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 33
C. Bizer, A. Primpeli, R. Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
34. Data and Web Science Group
Underexploited?
â Up to my knowledge, schema.org reviews from many
websites are not used as training data for sentiment
analysis yet.
â Research questions
â For which cases does the training data help to improve
sentiment analysis compared to other training corpora?
â For which types of entities are there improvements?
â For which languages?
â Is the up-to-dateness of the training data relevant?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 34
35. Data and Web Science Group
Conclusion: Semantic Annotations
â Comprehensive coverage of specific domains
â many entities, many different websites
â Narrow set of attributes used
â often requires additional parsing and specific cleansing
â Underexploited?
â Information extraction: 1 paper, positive results
â Entity matching: 1 paper, positive results
â Taxonomy clustering: 0 experimental papers
â Sentiment analysis: 0 experimental papers
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 35
36. Data and Web Science Group
2. Web Tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 36
https://research.google.com/tables
37. Data and Web Science Group
Attribute/Value Table together with
schema.org Annotations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 37
Qui, et al.: DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. VLDB 2015.
Petrovski, et al: The WDC Gold Standards for Product Feature Extraction and Product Matching. ECWeb 2016.
s:name
Attribute/Value
HTML Table
s:breadcrumb
38. Data and Web Science Group
Web Tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 38
Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
Our focus
In corpus of 14B raw tables, 154M are relational tables (1.1%).
Cafarella et al. 2008
39. Data and Web Science Group
Web Data Commons â Web Tables Corpus
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 39
Lehmberg, Ritze, Meusel, Bizer: A Large Public Corpus of Web Tables containing Time and Context Metadata.
WWW2016 Companion.
â Large public corpus of Web tables
â extracted from Common Crawl 2015 (1.78 billion pages)
â 90 million relational tables (0.9%)
â 139 million attribute/value tables (1,3%)
â selected out of 10.2 B raw tables
â 1TB zipped
â http://webdatacommons.org/webtables/
40. Data and Web Science Group
Topical Profiling:
Websites contributing many tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 40
Website # Tables Topic
apple.com 50,910 Music
baseball-reference.com 25,647 Sports
latestf1news.com 17,726 Sports
nascar.com 17,465 Sports
amazon.com 16,551 Products
wikipedia.org 13,993 Various
inkjetsuperstore.com 12,282 Products
flightmemory.com 8,044 Flights
windshieldguy.com 7,305 Products
citytowninfo.com 6,293 Cities
blogspot.com 4,762 Various
7digital.com 4,462 Music
41. Data and Web Science Group
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 41
Usage Scenario:
Knowledge Base Completion
A1 A2 A3 A4 ⊠AN ? ? ? âŠ
E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?
⊠? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
⊠? ? ? ? ? ? ? ? ? ?
Slot Filling:
add facts for existing instances
and existing properties
Schema Expansion:
add and fill new attributes
Entity Expansion:
âą add new instances
âą and their descriptions
(values for existing
attributes)
Class in KB
42. Data and Web Science Group
Use Case 1: Slot Filling
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 42
A1 A2 A3 A4 ⊠AN
E1 ? ? ?
E2 ? ?
E3
⊠?
Em ? ? ? ?
Class in KB
43. Data and Web Science Group
Slot Filling: Two Steps
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 43
1. Matching 2. Data Fusion
IATA Code
44. Data and Web Science Group
Holistic Entity- and Schema Matching
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 44
â T2K+ matches
â millions of web tables to the
â DBpedia knowledge base
â Exploits
â wide range of features
â synergies between matching tasks
â table corpus for indirect matching
â patterns from KB for post filtering
â Results
Dominique Ritze, Christian Bizer: Matching Web Tables To DBpedia â A Feature Utility Study. EDBT 2017.
Task Precision Recall F1
Instance .90 .76 .82
Property .77 .65 .70
Class .94 .94 .94
45. Data and Web Science Group
Matching Small Web Tables:
Table Stitching
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 45
â Tables with the same schema appear
many times on the same website
â as large original tables are split for
human consumption
â 97% of all websites use between 1 and
10 different schemata [VLDBâ17]
â Stitching merges all web tables on one
website having the same schema
đ =
đ đâđŻ
đđ
Oliver Lehmberg, Christian Bizer: Stitching web tables for improving matching quality. VLDB Endowment, 2017.
46. Data and Web Science Group
Effect of Table Stitching
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 46
â Strongly reduces the amount of
tables
â 19.8 web tables on average per
union table
â 46.9 web tables on average per
stitched union table
â Attribute matching F1:
â original web tables 0.35
â union tables 0.53
â stitched union tables 0.67
â expected from T2K+ 0.70
5,176,160
261,215 104,221
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
Count
web tables union tables stitched union tables
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision Recall F1-Measure
Web Tables Union Tables Stitched Union Tables
47. Data and Web Science Group
Table to DBpedia Matching Results
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 47
â about 1 million out of 30 million tables match DBpedia (~3%)
â 13,726,582 row-to-instance correspondences (~0.7%)
â 562,445 attribute-to-property correspondences
â 301,450 tables with attribute correspondences (ca. 32%)
ïš 8 million triples
â Content variety
â 274 different classes, 721 unique properties
â 717,174 unique instances (15.6% of DBpedia)
Ritze, Lehmberg, Oulabi and Bizer: Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge
Bases. WWW2016.
48. Data and Web Science Group
â Fusion Results
â Group Sizes
â Large groups (Head): values likely
already exist in the KB
â Small groups (Tail): often new
value but difficult to fuse correctly
Data Fusion Results
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 48
Fusion Strategy Precision Recall F1
Voting .369 .823 .509
Knowledge-based Trust .639 .785 .705
PageRank .365 .814 .504
49. Data and Web Science Group
Fused Triples from Web Tables
by DBpedia Class
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 49
DBpedia Class Overlap with
existing values
New values
Person 117 522 15 050
Athlete 84 562 9 067
Artist 2 019 427
Office Holder 3 465 510
Politician 3 124 1 167
Organisation 20 522 7 903
Company 6 376 2 547
Sports Team 790 132
Educational Inst. 8 844 3 132
Broadcaster 4 004 1 924
Work 189 131 27 867
Musical Work 118 511 8 427
Film 29 903 12 143
Software 17 554 2 766
Place 32 855 9 871
Populated Place 16 604 6 704
Country 2 084 433
Architectural Struct. 10 441 1 775
Species 9 016 1 429
50. Data and Web Science Group
What is going wrong?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 50
https://www.bls.gov/oes/2017/may/oes_ca.htm
51. Data and Web Science Group
â Matching and fusion methods expect binary relations
â But many relations in Web tables are not binary!
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 51
What is going wrong?
State Avg. Sal.
California $128,805
New York
City
$113,689
Alabama $39,180
State Avg. Sal.
California $63,805
New York City $53,472
Alabama $39,180
52. Data and Web Science Group
Synthesizing N-ary Relations from Web
Tables
â Exploiting page context: Title and URL
â Stitching tables from the same site
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 52
Oliver Lehmberg, Christian Bizer: Synthesizing N-ary Relations from Web Tables. WIMS2019.
53. Data and Web Science Group
Amount of N-ary Relations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 53
Oliver Lehmberg, Christian Bizer: Profiling the Semantics of N-ary Web Table Data. SBD Workshop @ SIGMOD 2019.
Results for the WDC2015 corpus:
â 37.50% of all relations are non-binary
â 50.47% of these require context attributes
Time dimension is
often part of keys
but also many
other attributes
54. Data and Web Science Group
Use Case 2: Adding Long-Tail Entities
to Knowledge Base
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 54
A1 A2 A3 A4 ⊠AN ? ? ? âŠ
E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?
⊠? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
⊠? ? ? ? ? ? ? ? ? ?
Entity Expansion:
1. Find new entities
2. Compile descriptions of
these new entities
Class in KB
55. Data and Web Science Group
Long-Tail Entity Expansion Pipeline
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 55
Row
Clustering
Entity
Creation
New
Detection
Knowledge
base
New entities added to
knowledge base
Schema
Matching
Web tables
Output of first iteration used to refine the
schema mapping in a second iteration
Yaser Oulabi, Christian Bizer: Extending cross-domain knowledge bases with long tail entities using web table
data. EDBT 2019.
56. Data and Web Science Group
Entity Expansion Results
â 43 % of errors caused by wrong attribute-to-property matching
â 35 % of errors are a result of outdated or wrong data in tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 56
Class New Entities New Facts New Entities
Accuracy
New Facts
Accuracy
Football Player 13,983 (+67%) 43,800 (+32%) 0.60 0.95
Song 186,943 (+356%) 393,711 (+125%) 0.70 0.85
- 50,000 100,000 150,000 200,000 250,000
Football Player
Song
Existing New
57. Data and Web Science Group
Conclusion: Web Tables
â Extremely wide set of schemata with very specific attributes
â many n-ary relations with complex keys / time dimension
â Table context is often required to understand the data
â page title, URL, headings, text?
â Everything works OK, but not good enough yet
â fact precision slot filling: 0.64
â new entities accuracy: 0.60-0.70
â For researchers: Rewarding but challenging domain
â For applications: Human in the loop still required for
verification
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 57
58. Data and Web Science Group
Conclusion
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 58
Dimension Semantic Annotations Web Tables
Classes Few classes, deep coverage
(Products, Local Business, âŠ)
Wide range of topics
(Sports, Statistics, âŠ)
Attributes Small set of rather generic
attributes
Wide range of very
specific attributes
Challenges Parsing of below 1NF
attribute values
Understanding the
semantics of attributes
Opportunities Resource creation
Large-scale training
Resource creation
Underexploited Yes Yes, once challenges
are handled