SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Data and Web Science GroupSchema.org Annotations and Web Tables
Underexploited Semantic Nuggets on the Web?
120-23 May 2019 in Leipzig, Germany
Prof. Dr. Christian Bizer
Data and Web Science GroupCool Resources are often Build
Exploiting Some Structure Somewhere
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 2
Data and Web Science Group
Web Tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 3
Data and Web Science Group
Schema.org Annotations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 4
<div itemtype="http://schema.org/Hotel">
<span itemprop="name">Vienna Marriott Hotel</span>
<span itemprop="address" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Parkring 12a</span>
<span itemprop="addressLocality">Vienna</span>
</span>
</div>
<div itemtype="http://schema.org/Product">
<h1 itemprop="name">Signature Instant Dome 7</h1>
Item # <span itemprop=“productID">200734246807</span>
<img itemprop=“image“ src=“../20073424680.jpg">
<span itemprop=“review" itemtype="http://schema.org/Review">
<span itemprop=“ratingValue">5</span>
<span itemprop=“reviewBody">Great waterproof tent!</span>
</span>
</div>
Data and Web Science Group
Hypothesis
– Web tables and Schema.org annotations are useful
training data for tasks such as
– Named Entity Recognition
– Information Extraction
– Entity Matching
– Taxonomy Induction
– Sentiment Analysis
– Knowledge Base Completion
– but they are hardly used by public research yet
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 5
Data and Web Science Group
Outline
1. Schema.org Annotations
– Adoption on the Web
– Current applications and further potential
2. Web Tables
– Usage on the Web
– Application potential
3. Conclusions
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 6
Data and Web Science Group
1. Schema.org Annotations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 7
– ask site owners since 2011 to
annotate data for enriching search results
– 675 Types: Event, Place, Local Business, Product, Review, Person
– Encoding: Microdata, RDFa, JSON-LD
Data and Web Science Group
Usage of Schema.org Data @ Google
Data snippets
within
search results
Google maps
location data
Data snippets
within
info boxes
Data and Web Science Group
Usage of Schema.org Data @ Google
https://developers.google.com/search/docs/guides/search-gallery
Product
offers
Job
offers
Data and Web Science Group
Web Data Commons Project
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 10
– extracts all Microformat, Microdata,
RDFa, JSON-LD data from the Common Crawl
– analyzes and provides the extracted data for download
– statistics about some extraction runs
– 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples
– 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples
– 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples
– 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples
– uses 100 machines on Amazon EC2
– approx. 2000 machine/hours  250 Euro
http://webdatacommons.org/structureddata/
Data and Web Science Group
Overall Adoption 2018
http://webdatacommons.org/structureddata/2018-12/
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 11
944 million HTML pages out of the 2.5 billion pages
provide semantic annotations (37.1%).
9.6 million pay-level-domains (PLDs) out of the
32.8 million pay-level-domains covered by the crawl
provide semantic annotations (29.3%).
Data and Web Science Group
Frequently used Schema.org Classes
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 12
Top Classes # Websites (PLDs)
Microdata JSON-LD
schema:WebPage 1,124,583 121,393
schema:Product 812,205 40,169
schema:Offer 676,899 57,756
schema:BreadcrumbList 621,344 205,971
schema:Article 612,361 57,082
schema:Organization 510,069 1,349,775
schema:PostalAddress 502,615 176,500
schema:ImageObject 360,875 111,946
schema:Blog 337,843 12,174
schema:Person 324,349 335,784
schema:LocalBusiness 294,390 249,017
schema:AggregateRating 258,078 23,105
schema:Review 124,022 6,622
schema:Place 92,127 66,396
schema:Event 88,130 63,605
http://webdatacommons.org/structureddata/2018-12/
Data and Web Science Group
Attributes used to Describe Products
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 13
Top Attributes PLDs Microdata
# %
schema:Product/name 754,812 92 %
schema:Product/offers 645,994 79 %
schema:Offer/price 639,598 78 %
schema:Offer/priceCurrency 606,990 74 %
schema:Product/image 573,614 70 %
schema:Product/description 520,307 64 %
schema:Offer/availability 477,170 58 %
schema:Product/url 364,889 44 %
schema:Product/sku 160,343 19 %
schema:Product/aggregateRating 141,194 17 %
schema:Product/brand 113,209 13 %
schema:Product/category 62,170 7 %
schema:Product/productID 47,088 5 %

 
 

http://webdatacommons.org/structureddata/2018-12/stats/html-md.xlsx
Data and Web Science Group
Use Case 1: Gather Large Amounts of
Hyponym Relationships
– plenty of entity
names are
annotated
– Hearst Pattern
of the 21st century
(if class is covered)
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 14
http://webdatacommons.org/structureddata/2018-12/stats/schema_org_subsets.html
Class # Entities Names
Product 368,386,801
Organization 176,365,220
Local Business 44,100,625
Place 31,748,612
Event 26,035,295
Job Posting 5,078,941
Movie 5,233,520
Hotel 2,537,089
College or University 2,392,797
Recipe 2,036,417
Music Album 1,594,941
Restaurant 1,480,443
Airport 1,145,323
Data and Web Science Group
Coverage: Top 15 Hotel Websites
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 15
Top 15 Travel Websites schema:Hotel
Booking.com P
TripAdvisor P
Expedia P
Agoda P
Hotels.com P
Kayak P
Priceline P
Travelocity P
Orbitz P
ChoiceHotels P
HolidayCheck P
ChoiceHotels P
InterContinental Hotels Group P
Marriott International P
Global Hyatt Corp. T
Adoption:
93 %
KĂ€rle, Fensel, Toma, Fensel: Why Are There More Hotels in Tyrol than in Austria? Analyzing Schema.org Usage in
the Hotel Domain. ICTT 2016.
Data and Web Science Group
Use Case 2: Distant Supervision for
Information Extraction
– Goal: Learn how to extract data from websites that do
not provide schema.org annotations using annotations as
training data.
– Example: Google Small Events Paper @ SIGIR 2015
– extract data about small venue concerts, theatre performances,
garage sales, etc.
– they use 217,000 schema.org events from ClueWeb2012
as distant supervision, e.g.
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 16
Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
<div itemtype="http://schema.org/MusicEvent">
<h1 itemprop="name">School Cool Concert</h1>
<p itemprop="startDate">25.09.2018 15:00</p>
<p itemprop=“location" itemtype="http://schema.org/Place">School Gym Puero High</span>
<div>
Data and Web Science Group
Google Small Events Paper @ SIGIR 2015
– Approach
– train an information extraction method that
combines structural and text features
– extracted attributes: What? When? Where?
– apply method to pages without semantic
annotations
– Result
– extract 200,000 additional small-scale
events from ClueWeb2012
– double recall at 0.85 precision
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 17
Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
Data and Web Science Group
Underexploited?
– Using schema.org data as
distant supervision will likely
also work for other information
extraction tasks.
– Research questions
– How to best exploit annotations
as distant supervision?
– How good do the annotations
perform compared to other
sources of supervision?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 18
Class # Entities
Product 368,386,801
Organization 176,365,220
Local Business 44,100,625
Place 31,748,612
Event 26,035,295
Job Posting 5,078,941
Movie 5,233,520
Hotel 2,537,089
College or University 2,392,797
Recipe 2,036,417
Music Album 1,594,941
Restaurant 1,480,443
Airport 1,145,323
Data and Web Science Group
Use Case 3: Training Data for
Entity Matching / Link Discovery
– Goal: Find e-shops that offer a specific product in order to
compare product prices
– Observation: Some e-shops annotates product IDs, many
e-shops do not
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 19
Properties PLDs
# %
schema:Product/name 754,812 92 %
schema:Product/description 520,307 64 %
schema:Product/sku 160,343 19 %
schema:Product/productID 47,088 5 %
schema:Product/mpn 12,882 1.6%
schema:Product/gtin13 7,994 1%
Data and Web Science Group
Learn How to Match Products using
Schema.org Data as Supervision
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 20
Product
offer
Product
offer
Product
offer
Product
offerProduct
offerProduct
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Product
offer
Clusters of offers
having the same product ID
Unseen offers
without product IDs
Learn
Matcher
Matcher Same product?Product
offer
Product
offer
Product
offer
Data and Web Science Group
Data Cleansing
for Building the Clusters
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 21
Filtering of product
offers with annotated
product identifiers
Removal of
listing pages
Filtering by
identifier value
length
Cluster creation
based on identifier
value co-occurrence
Split wrong
clusters due to
category IDs
121M offers 58M offers 26M offers 16.4M clusters
All Languages
26.5M offers
79K distinct websites
16.6M clusters (products)
English Offers
16M offers
43K distinct websites
10M clusters (products)
16.6M clusters
Data and Web Science Group
Cluster Size Distribution by Category
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 22
A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large-Scale Product Matching.
ECNLP 2019 Workshop @ WWW2019.
Data and Web Science Group
Training Subsets
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 23
Category #Positive
Pairs
#Negative
Pairs
s:name s:description specification
table
Computers_small 20,444 21,676 100% 82% 55%
Cameras_small 7,539 9,093 100% 61% 4%
Watches_small 5,449 8,819 100% 48% 4%
Shoes_small 3,476 5,924 99% 36% 1%
Computers_large 120,568 139,020 100% 87% 60%
Cameras_large 16,390 84,788 100% 65% 5%
Watches_large 7,729 65,516 100% 55% 3%
Shoes_large 3,476 36,224 100% 35% 1%
Gold standard: 550 offer pairs per category
Data and Web Science Group
Results: Traditional Learning Methods
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 24
Category Classifier Precision Recall F1
Computers_small LinearSVC 0.75 0.94 0.84
Cameras_small Random Forest 0.75 0.87 0.81
Watches_small LinearSVC 0.74 0.91 0.81
Shoes_small Log. Regression 0.72 0.95 0.82
Computers_large Random Forest 0.92 0.95 0.94
Cameras_large Dec. Tree 0.86 0.85 0.86
Watches_large Random Forest 0.90 0.87 0.88
Shoes_large Log. Regression 0.94 0.86 0.90
Feature creation: word co-occurrence
Data and Web Science Group
DeepMatcher
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 25
- word-based, character based
- pre-trained (word2vec, GloVe, fastText), learned
- attribute summarization: SIF, RNN, Attention, Hybrid
- attribute comparison: fixed distance, learnable distance
- classification: fully connected neural net
Mudgal, Li, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
Data and Web Science Group
Results: DeepMatcher
– Human-level matching performance just using web data for training.
– The 6.6% errors in the training data are averaged out in the learning.
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 26
Category Model Precision Recall F1
Computers_small Combined SIF 0.91 0.98 0.95
Cameras_small Combined RNN 0.95 0.93 0.94
Watches_small Combined RNN 0.93 0.98 0.95
Shoes_small Combined SIF 0.87 0.99 0.92
Computers_large RNN 0.96 0.98 0.97
Cameras_large RNN 0.94 0.98 0.96
Watches_large RNN 0.96 0.99 0.97
Shoes_large RNN 0.97 0.99 0.98
Data and Web Science Group
Underexploited?
– If we can get human-level matching performance using just
web data for training, why do product comparison portals
still pay people to manually label offers for training?
– Research questions
– Does this work for all product categories?
– How good does it work for new products?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 27
Data and Web Science Group
Use Case 3: Taxonomy Matching and
Taxonomy Clustering
– E-Shops annotate their product categorization
– The marketing ideas behind these categorizations differ widely
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 28
Relevant Properties # PLDs %
schema:Offer/category 62,170 7 % of all shops
schema:WebPage/BreadcrumbList 621,344 11% of all sites
Apparel > Summer > Mens > Jerseys
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jersey > over $60 s
Home > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables
Shop > Tables > Dining Tables > Aland Wood Tables
Data and Web Science Group
Cool Taxonomy Clustering Problem
– How to cluster 62,000 taxonomies in order to discover the
most frequent categorization ideas within a domain?
– How do shops categorize action cameras?
– What subcategories are frequently used for action cameras?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 29
Meusel, Primpeli, Meilicke, Paulheim, Bizer: Exploiting Microdata Annotations to Consistently Categorize Product
Offers at Web Scale. EC-Web 2015.
Data and Web Science Group
Learn How to Match Product Taxonomies
using Schema.org Data as Supervision
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 30
Category
A
Category
B
Taxonomies that share
some products
Unseen categories
containing some
products
Similar category?
P1
P2
P1
P2
P3
Learn
Matcher
Matcher
Data and Web Science Group
Underexploited?
– Cool training data for the rather unexplored research
challenge of large-scale taxonomy clustering
– Research questions
– Which similarity metrics work best?
– How to block 62,000 taxonomies in order to deal with
the runtime complexity?
– On which depths to start clustering in order to answer
interesting business questions?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 31
Data and Web Science Group
Use Case 4: Training Data for Sentiment
Analysis
– 130,000 websites annotate reviews
– 2018 WDC corpus contains 13.5 million reviewswhich
annotate schema:ratingValue and schema:reviewBody
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 32
<div itemtype="http://schema.org/Product">
<h1 itemprop="name">Signature Instant Dome 7</h1>
Item # <span itemprop=“productID">200734246807</span>
<img itemprop=“image“ src=“../20073424680.jpg">
<span itemprop=“review" itemtype="http://schema.org/Review">
<span itemprop=“ratingValue">5</span>
<span itemprop=“reviewBody">Great waterproof tent which
worked perfectly for me 
.</span>
</span>
</div>
Data and Web Science Group
Reviews by Type of Reviewed Item
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 33
C. Bizer, A. Primpeli, R. Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
Data and Web Science Group
Underexploited?
– Up to my knowledge, schema.org reviews from many
websites are not used as training data for sentiment
analysis yet.
– Research questions
– For which cases does the training data help to improve
sentiment analysis compared to other training corpora?
– For which types of entities are there improvements?
– For which languages?
– Is the up-to-dateness of the training data relevant?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 34
Data and Web Science Group
Conclusion: Semantic Annotations
– Comprehensive coverage of specific domains
– many entities, many different websites
– Narrow set of attributes used
– often requires additional parsing and specific cleansing
– Underexploited?
– Information extraction: 1 paper, positive results
– Entity matching: 1 paper, positive results
– Taxonomy clustering: 0 experimental papers
– Sentiment analysis: 0 experimental papers
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 35
Data and Web Science Group
2. Web Tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 36
https://research.google.com/tables
Data and Web Science Group
Attribute/Value Table together with
schema.org Annotations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 37
Qui, et al.: DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. VLDB 2015.
Petrovski, et al: The WDC Gold Standards for Product Feature Extraction and Product Matching. ECWeb 2016.
s:name
Attribute/Value
HTML Table
s:breadcrumb
Data and Web Science Group
Web Tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 38
Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
Our focus
In corpus of 14B raw tables, 154M are relational tables (1.1%).
Cafarella et al. 2008
Data and Web Science Group
Web Data Commons – Web Tables Corpus
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 39
Lehmberg, Ritze, Meusel, Bizer: A Large Public Corpus of Web Tables containing Time and Context Metadata.
WWW2016 Companion.
– Large public corpus of Web tables
– extracted from Common Crawl 2015 (1.78 billion pages)
– 90 million relational tables (0.9%)
– 139 million attribute/value tables (1,3%)
– selected out of 10.2 B raw tables
– 1TB zipped
– http://webdatacommons.org/webtables/
Data and Web Science Group
Topical Profiling:
Websites contributing many tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 40
Website # Tables Topic
apple.com 50,910 Music
baseball-reference.com 25,647 Sports
latestf1news.com 17,726 Sports
nascar.com 17,465 Sports
amazon.com 16,551 Products
wikipedia.org 13,993 Various
inkjetsuperstore.com 12,282 Products
flightmemory.com 8,044 Flights
windshieldguy.com 7,305 Products
citytowninfo.com 6,293 Cities
blogspot.com 4,762 Various
7digital.com 4,462 Music
Data and Web Science Group
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 41
Usage Scenario:
Knowledge Base Completion
A1 A2 A3 A4 
 AN ? ? ? 

E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?

 ? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?

 ? ? ? ? ? ? ? ? ? ?
Slot Filling:
add facts for existing instances
and existing properties
Schema Expansion:
add and fill new attributes
Entity Expansion:
‱ add new instances
‱ and their descriptions
(values for existing
attributes)
Class in KB
Data and Web Science Group
Use Case 1: Slot Filling
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 42
A1 A2 A3 A4 
 AN
E1 ? ? ?
E2 ? ?
E3

 ?
Em ? ? ? ?
Class in KB
Data and Web Science Group
Slot Filling: Two Steps
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 43
1. Matching 2. Data Fusion
IATA Code
Data and Web Science Group
Holistic Entity- and Schema Matching
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 44
– T2K+ matches
– millions of web tables to the
– DBpedia knowledge base
– Exploits
– wide range of features
– synergies between matching tasks
– table corpus for indirect matching
– patterns from KB for post filtering
– Results
Dominique Ritze, Christian Bizer: Matching Web Tables To DBpedia – A Feature Utility Study. EDBT 2017.
Task Precision Recall F1
Instance .90 .76 .82
Property .77 .65 .70
Class .94 .94 .94
Data and Web Science Group
Matching Small Web Tables:
Table Stitching
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 45
– Tables with the same schema appear
many times on the same website
– as large original tables are split for
human consumption
– 97% of all websites use between 1 and
10 different schemata [VLDB‘17]
– Stitching merges all web tables on one
website having the same schema
𝑆 =
𝑇 𝑖∈𝒯
𝑇𝑖
Oliver Lehmberg, Christian Bizer: Stitching web tables for improving matching quality. VLDB Endowment, 2017.
Data and Web Science Group
Effect of Table Stitching
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 46
– Strongly reduces the amount of
tables
– 19.8 web tables on average per
union table
– 46.9 web tables on average per
stitched union table
– Attribute matching F1:
– original web tables 0.35
– union tables 0.53
– stitched union tables 0.67
– expected from T2K+ 0.70
5,176,160
261,215 104,221
-
1,000,000
2,000,000
3,000,000
4,000,000
5,000,000
6,000,000
Count
web tables union tables stitched union tables
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Precision Recall F1-Measure
Web Tables Union Tables Stitched Union Tables
Data and Web Science Group
Table to DBpedia Matching Results
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 47
– about 1 million out of 30 million tables match DBpedia (~3%)
– 13,726,582 row-to-instance correspondences (~0.7%)
– 562,445 attribute-to-property correspondences
– 301,450 tables with attribute correspondences (ca. 32%)
 8 million triples
– Content variety
– 274 different classes, 721 unique properties
– 717,174 unique instances (15.6% of DBpedia)
Ritze, Lehmberg, Oulabi and Bizer: Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge
Bases. WWW2016.
Data and Web Science Group
– Fusion Results
– Group Sizes
– Large groups (Head): values likely
already exist in the KB
– Small groups (Tail): often new
value but difficult to fuse correctly
Data Fusion Results
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 48
Fusion Strategy Precision Recall F1
Voting .369 .823 .509
Knowledge-based Trust .639 .785 .705
PageRank .365 .814 .504
Data and Web Science Group
Fused Triples from Web Tables
by DBpedia Class
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 49
DBpedia Class Overlap with
existing values
New values
Person 117 522 15 050
Athlete 84 562 9 067
Artist 2 019 427
Office Holder 3 465 510
Politician 3 124 1 167
Organisation 20 522 7 903
Company 6 376 2 547
Sports Team 790 132
Educational Inst. 8 844 3 132
Broadcaster 4 004 1 924
Work 189 131 27 867
Musical Work 118 511 8 427
Film 29 903 12 143
Software 17 554 2 766
Place 32 855 9 871
Populated Place 16 604 6 704
Country 2 084 433
Architectural Struct. 10 441 1 775
Species 9 016 1 429
Data and Web Science Group
What is going wrong?
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 50
https://www.bls.gov/oes/2017/may/oes_ca.htm
Data and Web Science Group
– Matching and fusion methods expect binary relations
– But many relations in Web tables are not binary!
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 51
What is going wrong?
State Avg. Sal.
California $128,805
New York
City
$113,689
Alabama $39,180
State Avg. Sal.
California $63,805
New York City $53,472
Alabama $39,180
Data and Web Science Group
Synthesizing N-ary Relations from Web
Tables
– Exploiting page context: Title and URL
– Stitching tables from the same site
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 52
Oliver Lehmberg, Christian Bizer: Synthesizing N-ary Relations from Web Tables. WIMS2019.
Data and Web Science Group
Amount of N-ary Relations
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 53
Oliver Lehmberg, Christian Bizer: Profiling the Semantics of N-ary Web Table Data. SBD Workshop @ SIGMOD 2019.
Results for the WDC2015 corpus:
– 37.50% of all relations are non-binary
– 50.47% of these require context attributes
Time dimension is
often part of keys
but also many
other attributes
Data and Web Science Group
Use Case 2: Adding Long-Tail Entities
to Knowledge Base
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 54
A1 A2 A3 A4 
 AN ? ? ? 

E1 ? ? ? ? ? ? ?
E2 ? ? ? ? ? ?
E3 ? ? ? ?

 ? ? ? ? ?
Em ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? ? ?

 ? ? ? ? ? ? ? ? ? ?
Entity Expansion:
1. Find new entities
2. Compile descriptions of
these new entities
Class in KB
Data and Web Science Group
Long-Tail Entity Expansion Pipeline
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 55
Row
Clustering
Entity
Creation
New
Detection
Knowledge
base
New entities added to
knowledge base
Schema
Matching
Web tables
Output of first iteration used to refine the
schema mapping in a second iteration
Yaser Oulabi, Christian Bizer: Extending cross-domain knowledge bases with long tail entities using web table
data. EDBT 2019.
Data and Web Science Group
Entity Expansion Results
– 43 % of errors caused by wrong attribute-to-property matching
– 35 % of errors are a result of outdated or wrong data in tables
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 56
Class New Entities New Facts New Entities
Accuracy
New Facts
Accuracy
Football Player 13,983 (+67%) 43,800 (+32%) 0.60 0.95
Song 186,943 (+356%) 393,711 (+125%) 0.70 0.85
- 50,000 100,000 150,000 200,000 250,000
Football Player
Song
Existing New
Data and Web Science Group
Conclusion: Web Tables
– Extremely wide set of schemata with very specific attributes
– many n-ary relations with complex keys / time dimension
– Table context is often required to understand the data
– page title, URL, headings, text?
– Everything works OK, but not good enough yet
– fact precision slot filling: 0.64
– new entities accuracy: 0.60-0.70
– For researchers: Rewarding but challenging domain
– For applications: Human in the loop still required for
verification
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 57
Data and Web Science Group
Conclusion
Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 58
Dimension Semantic Annotations Web Tables
Classes Few classes, deep coverage
(Products, Local Business, 
)
Wide range of topics
(Sports, Statistics, 
)
Attributes Small set of rather generic
attributes
Wide range of very
specific attributes
Challenges Parsing of below 1NF
attribute values
Understanding the
semantics of attributes
Opportunities Resource creation
Large-scale training
Resource creation
Underexploited Yes Yes, once challenges
are handled

Weitere Àhnliche Inhalte

Was ist angesagt?

Colman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API ReferenceColman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API ReferenceOhad Flinker
 
Facebook and cambridge analytica scandal
Facebook and cambridge analytica scandalFacebook and cambridge analytica scandal
Facebook and cambridge analytica scandalSajibHossain17
 
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?Martin Hepp
 
Linked Data
Linked DataLinked Data
Linked DataDanny Ayers
 
Facebook & Cambridge Analytica data scandal ppt
Facebook & Cambridge Analytica data scandal pptFacebook & Cambridge Analytica data scandal ppt
Facebook & Cambridge Analytica data scandal pptUmang Maheshwari
 
Facebook data breach
Facebook data breachFacebook data breach
Facebook data breachGreen Gyaanam
 

Was ist angesagt? (8)

Colman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API ReferenceColman Hackathon Webhose.io API Reference
Colman Hackathon Webhose.io API Reference
 
Facebook and cambridge analytica scandal
Facebook and cambridge analytica scandalFacebook and cambridge analytica scandal
Facebook and cambridge analytica scandal
 
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
 
Linked Data
Linked DataLinked Data
Linked Data
 
Facebook & Cambridge Analytica data scandal ppt
Facebook & Cambridge Analytica data scandal pptFacebook & Cambridge Analytica data scandal ppt
Facebook & Cambridge Analytica data scandal ppt
 
Google and their stance on Link Evolution
Google and their stance on Link EvolutionGoogle and their stance on Link Evolution
Google and their stance on Link Evolution
 
Facebook data breach
Facebook data breachFacebook data breach
Facebook data breach
 
Cambridge analytica
Cambridge analyticaCambridge analytica
Cambridge analytica
 

Ähnlich wie Schema.org Annotations and Web Tables Underexploited Semantic Nuggets on the Web

Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
dkNET Annual Meeting - June 2017
dkNET Annual Meeting - June 2017dkNET Annual Meeting - June 2017
dkNET Annual Meeting - June 2017dkNET
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...Michele Pasin
 
Web 2008
Web 2008Web 2008
Web 2008Galit Fein
 
3rd International Conference on Data Science and Applications (DSA 2022)
3rd International Conference on Data Science and Applications (DSA 2022)3rd International Conference on Data Science and Applications (DSA 2022)
3rd International Conference on Data Science and Applications (DSA 2022)ijdms
 
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TourismFastForward
 
3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)ijccsa
 
3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)ijait
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Searchsopekmir
 
Seven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchSeven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchNVIDIA
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Cataloguesdavid-read
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
 
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
 
Business X Design - People, Planet & Product
Business X Design - People, Planet & ProductBusiness X Design - People, Planet & Product
Business X Design - People, Planet & ProductCyber-Duck
 
Events on the outside, on the inside and at the core (jaxlondon)
Events on the outside, on the inside and at the core (jaxlondon)Events on the outside, on the inside and at the core (jaxlondon)
Events on the outside, on the inside and at the core (jaxlondon)Chris Richardson
 
Events on the outside, on the inside and at the core - Chris Richardson
Events on the outside, on the inside and at the core - Chris RichardsonEvents on the outside, on the inside and at the core - Chris Richardson
Events on the outside, on the inside and at the core - Chris RichardsonJAXLondon_Conference
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys
 

Ähnlich wie Schema.org Annotations and Web Tables Underexploited Semantic Nuggets on the Web (20)

Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
dkNET Annual Meeting - June 2017
dkNET Annual Meeting - June 2017dkNET Annual Meeting - June 2017
dkNET Annual Meeting - June 2017
 
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...STI 2022 - Generating large-scale network analyses of scientific landscapes i...
STI 2022 - Generating large-scale network analyses of scientific landscapes i...
 
Web 2008
Web 2008Web 2008
Web 2008
 
3rd International Conference on Data Science and Applications (DSA 2022)
3rd International Conference on Data Science and Applications (DSA 2022)3rd International Conference on Data Science and Applications (DSA 2022)
3rd International Conference on Data Science and Applications (DSA 2022)
 
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
 
Samba resume
Samba resumeSamba resume
Samba resume
 
3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
 
3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
3 rd International Conference on Cloud, Big Data and Web Services (CBW 2022)
 
Rank | Analyse | Lead | Search
Rank | Analyse | Lead | SearchRank | Analyse | Lead | Search
Rank | Analyse | Lead | Search
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 
Seven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence ResearchSeven Ways to Boost Artificial Intelligence Research
Seven Ways to Boost Artificial Intelligence Research
 
Open Data and CKAN Data Catalogues
Open Data and CKAN Data CataloguesOpen Data and CKAN Data Catalogues
Open Data and CKAN Data Catalogues
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 
Business X Design - People, Planet & Product
Business X Design - People, Planet & ProductBusiness X Design - People, Planet & Product
Business X Design - People, Planet & Product
 
Events on the outside, on the inside and at the core (jaxlondon)
Events on the outside, on the inside and at the core (jaxlondon)Events on the outside, on the inside and at the core (jaxlondon)
Events on the outside, on the inside and at the core (jaxlondon)
 
Events on the outside, on the inside and at the core - Chris Richardson
Events on the outside, on the inside and at the core - Chris RichardsonEvents on the outside, on the inside and at the core - Chris Richardson
Events on the outside, on the inside and at the core - Chris Richardson
 
Koneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data WebKoneksys - Offering Services to Connect Data using the Data Web
Koneksys - Offering Services to Connect Data using the Data Web
 

Mehr von Chris Bizer

JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
 
Data Search and Search Joins (UniversitÀt Heidelberg 2015)
Data Search and Search Joins (UniversitÀt Heidelberg 2015)Data Search and Search Joins (UniversitÀt Heidelberg 2015)
Data Search and Search Joins (UniversitÀt Heidelberg 2015)Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million WebsitesChris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 

Mehr von Chris Bizer (9)

JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Data Search and Search Joins (UniversitÀt Heidelberg 2015)
Data Search and Search Joins (UniversitÀt Heidelberg 2015)Data Search and Search Joins (UniversitÀt Heidelberg 2015)
Data Search and Search Joins (UniversitÀt Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

KĂŒrzlich hochgeladen

Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$kojalkojal131
 
Lucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRL
Lucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRLLucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRL
Lucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRLimonikaupta
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024APNIC
 
Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...Sheetaleventcompany
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebJames Anderson
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024APNIC
 
✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663Call Girls Mumbai
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort ServiceDelhi Call girls
 
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445ruhi
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...tanu pandey
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...Escorts Call Girls
 
â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...Diya Sharma
 

KĂŒrzlich hochgeladen (20)

Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
Lucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRL
Lucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRLLucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRL
Lucknow ❀CALL GIRL 88759*99948 ❀CALL GIRLS IN Lucknow ESCORT SERVICE❀CALL GIRL
 
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Rohini 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
đ“€€Call On 7877925207 đ“€€ Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❀ 7710465962 Independent Call Girls In C...
 
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark WebGDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
GDG Cloud Southlake 32: Kyle Hettinger: Demystifying the Dark Web
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Sukhdev Vihar Delhi 💯Call Us 🔝8264348440🔝
 
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort ServiceEnjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort Service
Enjoy Night⚡Call Girls Dlf City Phase 3 Gurgaon >àŒ’8448380779 Escort Service
 
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
Russian Call Girls in %(+971524965298  )#  Call Girls in DubaiRussian Call Girls in %(+971524965298  )#  Call Girls in Dubai
Russian Call Girls in %(+971524965298 )# Call Girls in Dubai
 
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭ 6378878445
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...(+971568250507  ))#  Young Call Girls  in Ajman  By Pakistani Call Girls  in ...
(+971568250507 ))# Young Call Girls in Ajman By Pakistani Call Girls in ...
 
VVVIP Call Girls In Connaught Place âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place âžĄïž Delhi âžĄïž 9999965857 🚀 No Advance 24HRS...
 
â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
â‚č5.5k {Cash Payment}New Friends Colony Call Girls In [Delhi NIHARIKA] 🔝|97111...
 

Schema.org Annotations and Web Tables Underexploited Semantic Nuggets on the Web

  • 1. Data and Web Science GroupSchema.org Annotations and Web Tables Underexploited Semantic Nuggets on the Web? 120-23 May 2019 in Leipzig, Germany Prof. Dr. Christian Bizer
  • 2. Data and Web Science GroupCool Resources are often Build Exploiting Some Structure Somewhere Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 2
  • 3. Data and Web Science Group Web Tables Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 3
  • 4. Data and Web Science Group Schema.org Annotations Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 4 <div itemtype="http://schema.org/Hotel"> <span itemprop="name">Vienna Marriott Hotel</span> <span itemprop="address" itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress">Parkring 12a</span> <span itemprop="addressLocality">Vienna</span> </span> </div> <div itemtype="http://schema.org/Product"> <h1 itemprop="name">Signature Instant Dome 7</h1> Item # <span itemprop=“productID">200734246807</span> <img itemprop=“image“ src=“../20073424680.jpg"> <span itemprop=“review" itemtype="http://schema.org/Review"> <span itemprop=“ratingValue">5</span> <span itemprop=“reviewBody">Great waterproof tent!</span> </span> </div>
  • 5. Data and Web Science Group Hypothesis – Web tables and Schema.org annotations are useful training data for tasks such as – Named Entity Recognition – Information Extraction – Entity Matching – Taxonomy Induction – Sentiment Analysis – Knowledge Base Completion – but they are hardly used by public research yet Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 5
  • 6. Data and Web Science Group Outline 1. Schema.org Annotations – Adoption on the Web – Current applications and further potential 2. Web Tables – Usage on the Web – Application potential 3. Conclusions Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 6
  • 7. Data and Web Science Group 1. Schema.org Annotations Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 7 – ask site owners since 2011 to annotate data for enriching search results – 675 Types: Event, Place, Local Business, Product, Review, Person – Encoding: Microdata, RDFa, JSON-LD
  • 8. Data and Web Science Group Usage of Schema.org Data @ Google Data snippets within search results Google maps location data Data snippets within info boxes
  • 9. Data and Web Science Group Usage of Schema.org Data @ Google https://developers.google.com/search/docs/guides/search-gallery Product offers Job offers
  • 10. Data and Web Science Group Web Data Commons Project Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 10 – extracts all Microformat, Microdata, RDFa, JSON-LD data from the Common Crawl – analyzes and provides the extracted data for download – statistics about some extraction runs – 2018 CC Corpus: 2.5 billion HTML pages  31.5 billion RDF triples – 2017 CC Corpus: 3.1 billion HTML pages  38.2 billion RDF triples – 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples – 2010 CC Corpus: 2.8 billion HTML pages  5.1 billion RDF triples – uses 100 machines on Amazon EC2 – approx. 2000 machine/hours  250 Euro http://webdatacommons.org/structureddata/
  • 11. Data and Web Science Group Overall Adoption 2018 http://webdatacommons.org/structureddata/2018-12/ Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 11 944 million HTML pages out of the 2.5 billion pages provide semantic annotations (37.1%). 9.6 million pay-level-domains (PLDs) out of the 32.8 million pay-level-domains covered by the crawl provide semantic annotations (29.3%).
  • 12. Data and Web Science Group Frequently used Schema.org Classes Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 12 Top Classes # Websites (PLDs) Microdata JSON-LD schema:WebPage 1,124,583 121,393 schema:Product 812,205 40,169 schema:Offer 676,899 57,756 schema:BreadcrumbList 621,344 205,971 schema:Article 612,361 57,082 schema:Organization 510,069 1,349,775 schema:PostalAddress 502,615 176,500 schema:ImageObject 360,875 111,946 schema:Blog 337,843 12,174 schema:Person 324,349 335,784 schema:LocalBusiness 294,390 249,017 schema:AggregateRating 258,078 23,105 schema:Review 124,022 6,622 schema:Place 92,127 66,396 schema:Event 88,130 63,605 http://webdatacommons.org/structureddata/2018-12/
  • 13. Data and Web Science Group Attributes used to Describe Products Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 13 Top Attributes PLDs Microdata # % schema:Product/name 754,812 92 % schema:Product/offers 645,994 79 % schema:Offer/price 639,598 78 % schema:Offer/priceCurrency 606,990 74 % schema:Product/image 573,614 70 % schema:Product/description 520,307 64 % schema:Offer/availability 477,170 58 % schema:Product/url 364,889 44 % schema:Product/sku 160,343 19 % schema:Product/aggregateRating 141,194 17 % schema:Product/brand 113,209 13 % schema:Product/category 62,170 7 % schema:Product/productID 47,088 5 % 
 
 
 http://webdatacommons.org/structureddata/2018-12/stats/html-md.xlsx
  • 14. Data and Web Science Group Use Case 1: Gather Large Amounts of Hyponym Relationships – plenty of entity names are annotated – Hearst Pattern of the 21st century (if class is covered) Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 14 http://webdatacommons.org/structureddata/2018-12/stats/schema_org_subsets.html Class # Entities Names Product 368,386,801 Organization 176,365,220 Local Business 44,100,625 Place 31,748,612 Event 26,035,295 Job Posting 5,078,941 Movie 5,233,520 Hotel 2,537,089 College or University 2,392,797 Recipe 2,036,417 Music Album 1,594,941 Restaurant 1,480,443 Airport 1,145,323
  • 15. Data and Web Science Group Coverage: Top 15 Hotel Websites Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 15 Top 15 Travel Websites schema:Hotel Booking.com P TripAdvisor P Expedia P Agoda P Hotels.com P Kayak P Priceline P Travelocity P Orbitz P ChoiceHotels P HolidayCheck P ChoiceHotels P InterContinental Hotels Group P Marriott International P Global Hyatt Corp. T Adoption: 93 % KĂ€rle, Fensel, Toma, Fensel: Why Are There More Hotels in Tyrol than in Austria? Analyzing Schema.org Usage in the Hotel Domain. ICTT 2016.
  • 16. Data and Web Science Group Use Case 2: Distant Supervision for Information Extraction – Goal: Learn how to extract data from websites that do not provide schema.org annotations using annotations as training data. – Example: Google Small Events Paper @ SIGIR 2015 – extract data about small venue concerts, theatre performances, garage sales, etc. – they use 217,000 schema.org events from ClueWeb2012 as distant supervision, e.g. Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 16 Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015. <div itemtype="http://schema.org/MusicEvent"> <h1 itemprop="name">School Cool Concert</h1> <p itemprop="startDate">25.09.2018 15:00</p> <p itemprop=“location" itemtype="http://schema.org/Place">School Gym Puero High</span> <div>
  • 17. Data and Web Science Group Google Small Events Paper @ SIGIR 2015 – Approach – train an information extraction method that combines structural and text features – extracted attributes: What? When? Where? – apply method to pages without semantic annotations – Result – extract 200,000 additional small-scale events from ClueWeb2012 – double recall at 0.85 precision Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 17 Foley, Bendersky, Josifovski: Learning to Extract Local Events from the Web. SIGIR2015.
  • 18. Data and Web Science Group Underexploited? – Using schema.org data as distant supervision will likely also work for other information extraction tasks. – Research questions – How to best exploit annotations as distant supervision? – How good do the annotations perform compared to other sources of supervision? Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 18 Class # Entities Product 368,386,801 Organization 176,365,220 Local Business 44,100,625 Place 31,748,612 Event 26,035,295 Job Posting 5,078,941 Movie 5,233,520 Hotel 2,537,089 College or University 2,392,797 Recipe 2,036,417 Music Album 1,594,941 Restaurant 1,480,443 Airport 1,145,323
  • 19. Data and Web Science Group Use Case 3: Training Data for Entity Matching / Link Discovery – Goal: Find e-shops that offer a specific product in order to compare product prices – Observation: Some e-shops annotates product IDs, many e-shops do not Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 19 Properties PLDs # % schema:Product/name 754,812 92 % schema:Product/description 520,307 64 % schema:Product/sku 160,343 19 % schema:Product/productID 47,088 5 % schema:Product/mpn 12,882 1.6% schema:Product/gtin13 7,994 1%
  • 20. Data and Web Science Group Learn How to Match Products using Schema.org Data as Supervision Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 20 Product offer Product offer Product offer Product offerProduct offerProduct offer Product offer Product offer Product offer Product offer Product offer Product offer Clusters of offers having the same product ID Unseen offers without product IDs Learn Matcher Matcher Same product?Product offer Product offer Product offer
  • 21. Data and Web Science Group Data Cleansing for Building the Clusters Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 21 Filtering of product offers with annotated product identifiers Removal of listing pages Filtering by identifier value length Cluster creation based on identifier value co-occurrence Split wrong clusters due to category IDs 121M offers 58M offers 26M offers 16.4M clusters All Languages 26.5M offers 79K distinct websites 16.6M clusters (products) English Offers 16M offers 43K distinct websites 10M clusters (products) 16.6M clusters
  • 22. Data and Web Science Group Cluster Size Distribution by Category Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 22 A. Primpeli, R. Peeters, C. Bizer: The WDC Training Dataset and Gold Standard for Large-Scale Product Matching. ECNLP 2019 Workshop @ WWW2019.
  • 23. Data and Web Science Group Training Subsets Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 23 Category #Positive Pairs #Negative Pairs s:name s:description specification table Computers_small 20,444 21,676 100% 82% 55% Cameras_small 7,539 9,093 100% 61% 4% Watches_small 5,449 8,819 100% 48% 4% Shoes_small 3,476 5,924 99% 36% 1% Computers_large 120,568 139,020 100% 87% 60% Cameras_large 16,390 84,788 100% 65% 5% Watches_large 7,729 65,516 100% 55% 3% Shoes_large 3,476 36,224 100% 35% 1% Gold standard: 550 offer pairs per category
  • 24. Data and Web Science Group Results: Traditional Learning Methods Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 24 Category Classifier Precision Recall F1 Computers_small LinearSVC 0.75 0.94 0.84 Cameras_small Random Forest 0.75 0.87 0.81 Watches_small LinearSVC 0.74 0.91 0.81 Shoes_small Log. Regression 0.72 0.95 0.82 Computers_large Random Forest 0.92 0.95 0.94 Cameras_large Dec. Tree 0.86 0.85 0.86 Watches_large Random Forest 0.90 0.87 0.88 Shoes_large Log. Regression 0.94 0.86 0.90 Feature creation: word co-occurrence
  • 25. Data and Web Science Group DeepMatcher Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 25 - word-based, character based - pre-trained (word2vec, GloVe, fastText), learned - attribute summarization: SIF, RNN, Attention, Hybrid - attribute comparison: fixed distance, learnable distance - classification: fully connected neural net Mudgal, Li, et al.: Deep Learning for Entity Matching: A Design Space Exploration. SIGMOD 2018.
  • 26. Data and Web Science Group Results: DeepMatcher – Human-level matching performance just using web data for training. – The 6.6% errors in the training data are averaged out in the learning. Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 26 Category Model Precision Recall F1 Computers_small Combined SIF 0.91 0.98 0.95 Cameras_small Combined RNN 0.95 0.93 0.94 Watches_small Combined RNN 0.93 0.98 0.95 Shoes_small Combined SIF 0.87 0.99 0.92 Computers_large RNN 0.96 0.98 0.97 Cameras_large RNN 0.94 0.98 0.96 Watches_large RNN 0.96 0.99 0.97 Shoes_large RNN 0.97 0.99 0.98
  • 27. Data and Web Science Group Underexploited? – If we can get human-level matching performance using just web data for training, why do product comparison portals still pay people to manually label offers for training? – Research questions – Does this work for all product categories? – How good does it work for new products? Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 27
  • 28. Data and Web Science Group Use Case 3: Taxonomy Matching and Taxonomy Clustering – E-Shops annotate their product categorization – The marketing ideas behind these categorizations differ widely Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 28 Relevant Properties # PLDs % schema:Offer/category 62,170 7 % of all shops schema:WebPage/BreadcrumbList 621,344 11% of all sites Apparel > Summer > Mens > Jerseys Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jersey > over $60 s Home > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables Shop > Tables > Dining Tables > Aland Wood Tables
  • 29. Data and Web Science Group Cool Taxonomy Clustering Problem – How to cluster 62,000 taxonomies in order to discover the most frequent categorization ideas within a domain? – How do shops categorize action cameras? – What subcategories are frequently used for action cameras? Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 29 Meusel, Primpeli, Meilicke, Paulheim, Bizer: Exploiting Microdata Annotations to Consistently Categorize Product Offers at Web Scale. EC-Web 2015.
  • 30. Data and Web Science Group Learn How to Match Product Taxonomies using Schema.org Data as Supervision Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 30 Category A Category B Taxonomies that share some products Unseen categories containing some products Similar category? P1 P2 P1 P2 P3 Learn Matcher Matcher
  • 31. Data and Web Science Group Underexploited? – Cool training data for the rather unexplored research challenge of large-scale taxonomy clustering – Research questions – Which similarity metrics work best? – How to block 62,000 taxonomies in order to deal with the runtime complexity? – On which depths to start clustering in order to answer interesting business questions? Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 31
  • 32. Data and Web Science Group Use Case 4: Training Data for Sentiment Analysis – 130,000 websites annotate reviews – 2018 WDC corpus contains 13.5 million reviewswhich annotate schema:ratingValue and schema:reviewBody Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 32 <div itemtype="http://schema.org/Product"> <h1 itemprop="name">Signature Instant Dome 7</h1> Item # <span itemprop=“productID">200734246807</span> <img itemprop=“image“ src=“../20073424680.jpg"> <span itemprop=“review" itemtype="http://schema.org/Review"> <span itemprop=“ratingValue">5</span> <span itemprop=“reviewBody">Great waterproof tent which worked perfectly for me 
.</span> </span> </div>
  • 33. Data and Web Science Group Reviews by Type of Reviewed Item Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 33 C. Bizer, A. Primpeli, R. Peeters: Using the Semantic Web as a Source of Training Data. Datenbank Spektrum, 2019.
  • 34. Data and Web Science Group Underexploited? – Up to my knowledge, schema.org reviews from many websites are not used as training data for sentiment analysis yet. – Research questions – For which cases does the training data help to improve sentiment analysis compared to other training corpora? – For which types of entities are there improvements? – For which languages? – Is the up-to-dateness of the training data relevant? Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 34
  • 35. Data and Web Science Group Conclusion: Semantic Annotations – Comprehensive coverage of specific domains – many entities, many different websites – Narrow set of attributes used – often requires additional parsing and specific cleansing – Underexploited? – Information extraction: 1 paper, positive results – Entity matching: 1 paper, positive results – Taxonomy clustering: 0 experimental papers – Sentiment analysis: 0 experimental papers Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 35
  • 36. Data and Web Science Group 2. Web Tables Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 36 https://research.google.com/tables
  • 37. Data and Web Science Group Attribute/Value Table together with schema.org Annotations Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 37 Qui, et al.: DEXTER: Large-Scale Discovery and Extraction of Product Specifications on the Web. VLDB 2015. Petrovski, et al: The WDC Gold Standards for Product Feature Extraction and Product Matching. ECWeb 2016. s:name Attribute/Value HTML Table s:breadcrumb
  • 38. Data and Web Science Group Web Tables Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 38 Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011. Our focus In corpus of 14B raw tables, 154M are relational tables (1.1%). Cafarella et al. 2008
  • 39. Data and Web Science Group Web Data Commons – Web Tables Corpus Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 39 Lehmberg, Ritze, Meusel, Bizer: A Large Public Corpus of Web Tables containing Time and Context Metadata. WWW2016 Companion. – Large public corpus of Web tables – extracted from Common Crawl 2015 (1.78 billion pages) – 90 million relational tables (0.9%) – 139 million attribute/value tables (1,3%) – selected out of 10.2 B raw tables – 1TB zipped – http://webdatacommons.org/webtables/
  • 40. Data and Web Science Group Topical Profiling: Websites contributing many tables Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 40 Website # Tables Topic apple.com 50,910 Music baseball-reference.com 25,647 Sports latestf1news.com 17,726 Sports nascar.com 17,465 Sports amazon.com 16,551 Products wikipedia.org 13,993 Various inkjetsuperstore.com 12,282 Products flightmemory.com 8,044 Flights windshieldguy.com 7,305 Products citytowninfo.com 6,293 Cities blogspot.com 4,762 Various 7digital.com 4,462 Music
  • 41. Data and Web Science Group Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 41 Usage Scenario: Knowledge Base Completion A1 A2 A3 A4 
 AN ? ? ? 
 E1 ? ? ? ? ? ? ? E2 ? ? ? ? ? ? E3 ? ? ? ? 
 ? ? ? ? ? Em ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 
 ? ? ? ? ? ? ? ? ? ? Slot Filling: add facts for existing instances and existing properties Schema Expansion: add and fill new attributes Entity Expansion: ‱ add new instances ‱ and their descriptions (values for existing attributes) Class in KB
  • 42. Data and Web Science Group Use Case 1: Slot Filling Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 42 A1 A2 A3 A4 
 AN E1 ? ? ? E2 ? ? E3 
 ? Em ? ? ? ? Class in KB
  • 43. Data and Web Science Group Slot Filling: Two Steps Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 43 1. Matching 2. Data Fusion IATA Code
  • 44. Data and Web Science Group Holistic Entity- and Schema Matching Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 44 – T2K+ matches – millions of web tables to the – DBpedia knowledge base – Exploits – wide range of features – synergies between matching tasks – table corpus for indirect matching – patterns from KB for post filtering – Results Dominique Ritze, Christian Bizer: Matching Web Tables To DBpedia – A Feature Utility Study. EDBT 2017. Task Precision Recall F1 Instance .90 .76 .82 Property .77 .65 .70 Class .94 .94 .94
  • 45. Data and Web Science Group Matching Small Web Tables: Table Stitching Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 45 – Tables with the same schema appear many times on the same website – as large original tables are split for human consumption – 97% of all websites use between 1 and 10 different schemata [VLDB‘17] – Stitching merges all web tables on one website having the same schema 𝑆 = 𝑇 𝑖∈𝒯 𝑇𝑖 Oliver Lehmberg, Christian Bizer: Stitching web tables for improving matching quality. VLDB Endowment, 2017.
  • 46. Data and Web Science Group Effect of Table Stitching Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 46 – Strongly reduces the amount of tables – 19.8 web tables on average per union table – 46.9 web tables on average per stitched union table – Attribute matching F1: – original web tables 0.35 – union tables 0.53 – stitched union tables 0.67 – expected from T2K+ 0.70 5,176,160 261,215 104,221 - 1,000,000 2,000,000 3,000,000 4,000,000 5,000,000 6,000,000 Count web tables union tables stitched union tables 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall F1-Measure Web Tables Union Tables Stitched Union Tables
  • 47. Data and Web Science Group Table to DBpedia Matching Results Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 47 – about 1 million out of 30 million tables match DBpedia (~3%) – 13,726,582 row-to-instance correspondences (~0.7%) – 562,445 attribute-to-property correspondences – 301,450 tables with attribute correspondences (ca. 32%)  8 million triples – Content variety – 274 different classes, 721 unique properties – 717,174 unique instances (15.6% of DBpedia) Ritze, Lehmberg, Oulabi and Bizer: Profiling the Potential of Web Tables for Augmenting Cross-domain Knowledge Bases. WWW2016.
  • 48. Data and Web Science Group – Fusion Results – Group Sizes – Large groups (Head): values likely already exist in the KB – Small groups (Tail): often new value but difficult to fuse correctly Data Fusion Results Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 48 Fusion Strategy Precision Recall F1 Voting .369 .823 .509 Knowledge-based Trust .639 .785 .705 PageRank .365 .814 .504
  • 49. Data and Web Science Group Fused Triples from Web Tables by DBpedia Class Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 49 DBpedia Class Overlap with existing values New values Person 117 522 15 050 Athlete 84 562 9 067 Artist 2 019 427 Office Holder 3 465 510 Politician 3 124 1 167 Organisation 20 522 7 903 Company 6 376 2 547 Sports Team 790 132 Educational Inst. 8 844 3 132 Broadcaster 4 004 1 924 Work 189 131 27 867 Musical Work 118 511 8 427 Film 29 903 12 143 Software 17 554 2 766 Place 32 855 9 871 Populated Place 16 604 6 704 Country 2 084 433 Architectural Struct. 10 441 1 775 Species 9 016 1 429
  • 50. Data and Web Science Group What is going wrong? Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 50 https://www.bls.gov/oes/2017/may/oes_ca.htm
  • 51. Data and Web Science Group – Matching and fusion methods expect binary relations – But many relations in Web tables are not binary! Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 51 What is going wrong? State Avg. Sal. California $128,805 New York City $113,689 Alabama $39,180 State Avg. Sal. California $63,805 New York City $53,472 Alabama $39,180
  • 52. Data and Web Science Group Synthesizing N-ary Relations from Web Tables – Exploiting page context: Title and URL – Stitching tables from the same site Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 52 Oliver Lehmberg, Christian Bizer: Synthesizing N-ary Relations from Web Tables. WIMS2019.
  • 53. Data and Web Science Group Amount of N-ary Relations Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 53 Oliver Lehmberg, Christian Bizer: Profiling the Semantics of N-ary Web Table Data. SBD Workshop @ SIGMOD 2019. Results for the WDC2015 corpus: – 37.50% of all relations are non-binary – 50.47% of these require context attributes Time dimension is often part of keys but also many other attributes
  • 54. Data and Web Science Group Use Case 2: Adding Long-Tail Entities to Knowledge Base Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 54 A1 A2 A3 A4 
 AN ? ? ? 
 E1 ? ? ? ? ? ? ? E2 ? ? ? ? ? ? E3 ? ? ? ? 
 ? ? ? ? ? Em ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 
 ? ? ? ? ? ? ? ? ? ? Entity Expansion: 1. Find new entities 2. Compile descriptions of these new entities Class in KB
  • 55. Data and Web Science Group Long-Tail Entity Expansion Pipeline Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 55 Row Clustering Entity Creation New Detection Knowledge base New entities added to knowledge base Schema Matching Web tables Output of first iteration used to refine the schema mapping in a second iteration Yaser Oulabi, Christian Bizer: Extending cross-domain knowledge bases with long tail entities using web table data. EDBT 2019.
  • 56. Data and Web Science Group Entity Expansion Results – 43 % of errors caused by wrong attribute-to-property matching – 35 % of errors are a result of outdated or wrong data in tables Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 56 Class New Entities New Facts New Entities Accuracy New Facts Accuracy Football Player 13,983 (+67%) 43,800 (+32%) 0.60 0.95 Song 186,943 (+356%) 393,711 (+125%) 0.70 0.85 - 50,000 100,000 150,000 200,000 250,000 Football Player Song Existing New
  • 57. Data and Web Science Group Conclusion: Web Tables – Extremely wide set of schemata with very specific attributes – many n-ary relations with complex keys / time dimension – Table context is often required to understand the data – page title, URL, headings, text? – Everything works OK, but not good enough yet – fact precision slot filling: 0.64 – new entities accuracy: 0.60-0.70 – For researchers: Rewarding but challenging domain – For applications: Human in the loop still required for verification Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 57
  • 58. Data and Web Science Group Conclusion Christian Bizer: Schema.org Annotations and Web Tables. LDK2019, Leipzig, 22.5.2019 58 Dimension Semantic Annotations Web Tables Classes Few classes, deep coverage (Products, Local Business, 
) Wide range of topics (Sports, Statistics, 
) Attributes Small set of rather generic attributes Wide range of very specific attributes Challenges Parsing of below 1NF attribute values Understanding the semantics of attributes Opportunities Resource creation Large-scale training Resource creation Underexploited Yes Yes, once challenges are handled

Hinweis der Redaktion

  1. Idealo