Evolving the Web into a Global Dataspace – Advances and Applications

Chris Bizer
Chris BizerProfessor um University of Mannheim
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 1
Prof. Dr. Christian Bizer
Evolving the Web into a Global
Dataspace
- Advances and Applications -
18th International Conference on Business Information System (BIS 2015)
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 2
Hello
Professor Christian Bizer
University of Mannheim
Research Topics
−Web Technologies
−Web Data Profiling
−Web Data Integration
−Web Mining
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 3
Data and Web Science Group @ University of Mannheim
− 6 Professors
• Heiner Stuckenschmidt
• Rainer Gemulla
• Christian Bizer
• Simone Ponzetto
• Heiko Paulheim
• Johanna Völker
− 25 researchers and PhD students
− http://dws.informatik.uni-mannheim.de/
1. Research methods for integrating and mining large
amounts of heterogeneous information from the Web.
2. Empirically analyze the content and structure of the Web.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 4
Querying the Classic Web
DB
HTML
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 5
Long Standing Goal
Query the Web like
a single, global
database
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 6
2001 Article: The Semantic Web
Envisions three things to happen:
1.people publish data in structured form
in addition to HTML pages on the Web
2.common vocabularies / ontologies are
used to represent data
3.people implement cool applications that
do smart things with the available data
Tim Berners-Lee, James Hendler and Ora Lassila:
The Semantic Web. Scientific American, May 2001.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 7
14 Years Later
There are 1.5 million publications about the
Semantic Web on Google Scholar, but
1. Do people publish structured data on the Web?
2. Do people agree on common vocabularies / ontologies?
3. What are the cool applications that do smart things
with the data?
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 8
Outline
1. Semantic Annotations in HTML Pages
2. Linked Data
3. Knowledge Graphs
4. Conclusions
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 9
1. Semantic Annotations in HTML Pages
Simple idea: Help machines to understand
Web content by marking up data in HTML
pages.
<div itemtype="http://schema.org/Hotel">
<span itemprop="name">Vienna Marriott Hotel</span>
<span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">Parkring 12a</span>
<span itemprop="addressLocality">Vienna</span>
<span itemprop="addressCountry">Austria</span>
</span>
<div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
<span itemprop="ratingValue"> 4 </span> stars-based on
<span itemprop="reviewCount"> 250 </span> reviews.
</div>
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 10
Semantic Annotation Formats
Microformats
Microdata
RDFa
− date back to 2003
− small set of fixed formats
− W3C Recommendation in 2008
− can represent any type of data
− proposed in 2009
− tries to be simpler than RDFa
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 11
Open Graph Protocol
− allows site owners to determine how
entities are displayed in Facebook
− relies on RDFa for marking up data in HTML pages
− available since April 2010
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 12
Schema.org
− ask site owners since 2011 to
annotate data for enriching search results
− 675 Types: Event, Place, Local Business, Product, Review, Person
− Encoding: Microdata, RDFa, JSON-LD
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 13
Usage of Schema.org Data @ Google
Rich snippets
within
search results
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 14
Event Data in Google Applications
https://developers.google.com/structured-data/
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 15
Flight Offers in Google Search Results
Annotated
webpages
directly below
Google Flights
results
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 16
Rich-Snippets Get More User Attention
− Suchen
Source: www.looktracker.com
Potential business
incentive.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 17
Motivation for Semantic Annotations
− Study by searchmetrics.com in 2013: 10.000s of search keywords
− Type of rich-snippet displayed by Google:
Source: http://www.searchmetrics.com/de/knowledge-base/schema/
Google displays Rich-Snippets for 40% of all
queries.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 18
The Common Crawl
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 19
The Web Data Commons Project
− extracts all Microformat, Microdata, RDFa data
from the Common Crawl
− analyzes and provides the extracted data for download
− four extraction runs so far
• 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples
• 2013 CC Corpus: 2.2 billion HTML pages  17.2 billion RDF triples
• 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples
• 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples
− uses 100 machines on Amazon EC2
• approx. 3000 machine/hours
(spot instances of type c3.xlarge)  550 Euro
− http://www.webdatacommons.org/structureddata/
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 20
Overall Adoption 2014
620 million HTML pages out of the 2 billion pages
provide semantic annotations (30%).
2.72 million pay-level-domains (PLDs) out of the
15.68 million pay-level-domains covered by the
crawl provide annotations (17%).
Google, 2014*:
5 million websites provide Schema.org data.
* Guha in LDOW2014 Keynote
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 21
Number of PLDs providing Semantic Annotations
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 22
Most Popular Classes
RDFa
Microdata
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 23
Topical Focus – Microdata 2014
2014 2013
Class Instances # PLDs PLDs
# % # %
1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,04
2 schema:Article 54.972.000 88,7 10,82% 65.930 14,22
3 schema:Blog 3.787.000 110,663 13,50% 64.709 13,96
4 schema:Product 288.083.000 89,608 10,93% 56.388 12,16
5 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,31
6 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,53
7 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,94
8 schema:Offer 236.953.000 62,849 7,66% 35.635 7,69
9 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,61
10 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,92
11 schema:Organization 101.769.000 52,733 6,43% 24.255 5,23
12 schema:Person 115.376.000 47,936 5,85% 21.107 4,55
13 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,47
14 dv:Product 12.411.000 16,003 1,95% 13.844 2,99
15 schema:Review 42.561.000 20,124 2,45% 13.137 2,83
16 dv:Review-aggregate 3.964.000 14,094 1,72% 13.075 2,82
17 dv:Organization 3.155.000 10,649 1,30% 9.582 2,07
18 dv:Offer 7.170.000 11,64 1,42% 9.298 2,01
19 dv:Address 2.138.000 9,674 1,18% 8.866 1,91
20 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8
− Top Classes
− Topics:
• CMS and blog
metadata
• products and
offers
• ratings and
reviews
• business listings
• address data
• ...and a massive
long tail
schema: = Schema.org
dv: = Google Rich Snippet Vocabulary (deprecated)
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 24
Adoption by E-Commerce Websites
Distribution by Alexa Top-15 Shopping Sites
Top-Level Domain
TLD #PLDs
com 38344
co.uk 3605
net 1813
de 1333
pl 1273
com.br 1194
ru 1165
com.au 1062
nl 1002
Website schema:Product
Amazon.com 
Ebay.com 
NetFlix.com 
Amazon.co.uk 
Walmart.com 
etsy.com 
Ikea.com 
Bestbuy.com 
Homedepot.com 
Target.com 
Groupon.com 
Newegg.com 
Lowes.com 
Macys.com 
Nordstrom.com 
Adoption by Top-15:
60 %
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 25
Properties used to Describe Products
Top 15 Properties PLDs
# %
schema:Product/name 78,292 87 %
schema:Product/image 59,445 66 %
schema:Product/description 58,228 65 %
schema:Product/offers 57,633 64 %
schema:Offer/price 54,290 61 %
schema:Offer/availability 36,789 41 %
schema:Offer/priceCurrency 30,610 34 %
schema:Product/url 23,723 26 %
schema:Product/aggregateRating 21,166 24 %
schema:AggregateRating/ratingValue 20,513 23 %
schema:AggregateRating/reviewCount 14,930 17 %
schema:Product/manufacturer 10,150 11 %
schema:Product/brand 9,739 11 %
schema:Product/productID 9,221 10 %
schema:Product/sku 7955 9 %
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 26
Adoption by Travel Websites
Top 15 Travel Websites schema:Hotel Any Class
Booking.com (uses DataVoc)  
TripAdvisor  
Expedia  
Agoda  
Hotels.com  
Kayak  
Priceline  
Travelocity  
Orbitz  
ChoiceHotels  
HolidayCheck  
ChoiceHotels  
InterContinental Hotels Group  
Marriott International  
Global Hyatt Corp.  
Adoption:
73 %
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 27
Properties used to Describe Hotels
Top 10 Properties PLDs
# %
schema:Hotel/name 4173 88,35 %
schema:Hotel/address 3311 70,10 %
schema:Hotel/telephone 2488 52,68 %
schema:PostalAddress/streetAddress 2362 50,01 %
schema:PostalAddress/addressLocality 2231 47,24 %
schema:Hotel/url 2102 44,51 %
schema:PostalAddress/postalCode 2096 44,38 %
schema:AggregateRating/ratingValue 1952 41,33 %
schema:Hotel/aggregateRating 1866 39,51 %
schema:AggregateRating/bestRating 1697 35,93 %
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 28
Adoption by Job Websites
Distribution by Top-10 Employment Sites
Top-Level Domain
Adoption by Top-10: 70 %
TLD #PLDs
jobs 908
com 828
org 263
co.uk 194
net 40
nl 38
ca 33
de 32
jobs 908
Website schema:JobPosting
Indeed.com 
Monster.com 
Careerbuilder.com 
Snagajob.com 
Jobsdb.com 
Jobsearch.about.com 
Jobs.net 
Internships.com 
Jobs.aol.com 
Quintcareers.com 
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 29
Properties used to Describe Job Postings
Top 10 Properties PLDs
# %
JobPosting/title 2588 91.16 %
JobPosting/hiringOrganization 1412 49.74 %
JobPosting/description 1192 41.99 %
JobPosting/jobLocation 1062 37.41 %
Organization/name 862 30.36 %
JobPosting/datePosted 793 27.93 %
Place/address 471 16.59 %
JobPosting/baseSalary 227 8.00 %
JobPosting/industry 209 7.36 %
JobPosting/educationRequirements 145 5.11 %
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 30
Class / Property Distribution
 Only a small set of
classes / properties
is used.
 Strong focus on
Schema.org and
Facebook vocabularies.
schema.org
675 classes
965 properties
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 31
Opportunity 1: Search Engine Optimization
Get richer visibility in search results and potentially more clicks.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 32
Opportunity 2: Change Push to Pull Communication
− Current situation:
• Information providers need to
push data into multiple channels
• multiple search engines
• multiple domain-specific portals
− Web approach:
• You maintain a website
• All interested parties
crawl your data
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 33
Opportunity 3: Applications beyond Rich-Snippets
− E-Commerce
• Rich source of product data, offers, and reviews
• Opportunity to build global product catalogs
• Opportunity to mine product and rating data on global-scale
− Tourism
• Additional data for tourism applications: Nearby local businesses, nearby
landmarks, nearby hospitals, nearby events
• Search engines as new competitors put pressure on large booking portals?
− Recruitment
• Increased market transparency
• Search engines as new competitors put pressure on job portals that charge
per posting?
− High up-to-dateness of data
• as original data providers know about changes first
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 34
Main Challenge: Data Integration and Cleansing
The schema is standardized, but
1. entity names differ
2. the schema is rather shallow and a rather low number of
properties is used
3. data quality differs as the data is created by experts and
rookies
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 35
Property PLDs
# %
schema:Product/name 78,292 87%
schema:Product/description 58,228 65%
schema:Product/manufacturer 10,150 11%
schema:Product/brand 9,739 11%
schema:Product/productID 9,221 10%
Looking Deeper into the E-Commerce Data
1. The structure of the data is rather shallow
• Product features are encoded in titles and descriptions
• Example product name:
“Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB”
• Example product description:
“Faster Flash Storage with 64 GB Solid State Drive and USB 3.0 …”
• Product IDs are provided by only 10% of the websites
• Categorization information is provided only by 2% of the websites.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 36
Categorization of Product Offers
− We analyzed 1.9 million product offers from 9200 shops
− We trained bag-of-words classifier for 9 product categories
on product descriptions from Amazon.
Source: Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering
Microdata Markup. In: 4th Workshop on Data Extraction and Object Search (DEOS2014)  @ WWW2014
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 37
Identity Resolution for Electronic Products
− We trained feature extractors for product descriptions on offers for
electronic products from Amazon.
− We used the Silk framework for identity resolution.
Precision= 85%
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 38
Starting Points for Further Improvements
− Identity Resolution
• Exploit product identifiers to learn better product recognizers
• 10% of the websites (9,221 PLDs) use s:Product/productID
• 1% of the websites (935 PLDs) use s:Product/gtin13
− Categorization of Products
• Exploit categorization information provided by subset of the websites
• 1,5% of the websites (1,497 PLDs) use s:Offer/category
• 0,5% of the websites (460 PLDs) use s:WebPage/breadcrumb
• Challenge: Integration of ~ 2,000 product taxonomies
Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden
Furniture > Tables > Dining Tables
Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden
Furniture > Tables > Dining Tables
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys >
over $60
Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys >
over $60
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 39
Conclusion: Semantic Annotations in HTML Pages
1. Wide-spread adoption of semantic annotations
• motivated by mayor search engines
1. Strong ontology agreement driven by data consumers
• Schema.org, Open Graph Protocol
1. Main application: Rich-snippets
2. Endless data pool for
• Commercial applications
• product and travel data integration and mining
• up-to-date listings of local businesses
• job search engines that increase market transparency
• Research
• large-scale data integration and mining
• information extraction (using annotations as distant supervision*)
* Foley, et al.: Learning to Extract Local Events from the Web. SIGIR 2015
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 40
Download and Play with the Data
− http://www.webdatacommons.org/structureddata/
− Only tip of the iceberg, as each website is only partly crawled.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 41
2. Linked Data
B C
RDF
RDF
link
A D E
RDF
links
RDF
links
RDF
links
RDF
RDF
RDF
RDF
RDF RDF
RDF
RDF
RDF
• by using RDF to publish structured data directly on the Web
• by setting links between data items within different data sources.
Set of best practices for publishing structured data on
the Web in the form of a single global data graph.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 42
Links as Integration Hints
 publishing Identity Links on the Web
 publishing Vocabulary Links on the Web
<http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4>
owl:sameAs
<http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> .
<http://xmlns.com/foaf/0.1/Person>
owl:equivalentClass
<http://dbpedia.org/ontology/Person> .
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 43
Effort Distribution between Publisher and Consumer
Publishers or third
parties provides
identity/vocabulary links
Consumer mines missing
identity/vocabulary links
Effort
Distribution
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 44
LOD Datasets on the Web: April 2014
Growth without new category Social Networking: 94 %
Source: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in
Different Topical Domains. In: 13th International Semantic Web Conference (ISWC2014).
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 45
Uptake in the Government Domain
− Various efforts by public sector
institutions world-wide
− Forerunners
• UK government
• US government
− Types of data published
• statistical data
• environmental data
• budget and election data
− Goals
• Make data available to the public and
other government agencies
• Ease data integration by using standards,
providing unique identifiers and by setting links
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 46
Uptake in the Libraries Community
− Institutions publishing Linked Data
• Library of Congress (subject headings)
• German National Library (PND dataset and subject headings)
• Swedish National Library (Libris - catalog)
• Hungarian National Library (OPAC and digital library)
• Europeana Digital Library (4 million artifacts)
• Springer (metadata about conference proceedings)
− Goals:
1. Interconnect resources between repositories
(by topic, by location, by historical period, by ...)
2. Integrate library catalogs on global scale
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 47
Uptake in the Life Science Domain
− Goals:
1. Connect life science datasets
in order to support
• biological knowledge discovery
• drug discovery
1. Reuse results of previous
integration efforts
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 48
Uptake in the Linguistic Research Community
http://linguistic-lod.org/llod-cloud
http://www.lider-project.eu/
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 49
Ontological Agreement
− Strong agreement on some vocabularies
− Proprietary vocabularies are used in
addition to common ones,
as data is often very specific
Widely-Used Vocabularies
Proprietary Vocabularies
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 50
RDF Links
− Some datasets put a lot of effort into linking
− Many datasets only link to a small number of
other datasets or do not set RDF links at all
Datasets with Top In-Degrees Out-Degrees per Category
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 51
RDF Links in the LOD Cloud: August 2014
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 52
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 53
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 54
Linked Data as Background Knowledge for Data Mining
Which factors correlate with unemployment in France?
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 55
Unemployment Table with Additional Attributes
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 56
RapidMiner Linked Open Data Extension
Allows you to
1. link local table to LOD data sources
2. extend local table with additional attributes
3. mine extended tables using all Rapidminer features
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 57
Finding Correlations
− Use additional attributes to find interesting correlations
− Example correlation for unemployment in France:
• African islands, islands in the Indian Ocean,
outermost regions of the EU (positive)
• Population growth (positive)
• Energy consumption (negative)
• Hospital beds/inhabitants
(negative)
• Fast food restaurants (positive)
• Police stations (positive)
Source: Petar Ristoski, Christian Bizer, and Heiko Paulheim: Mining the Web of Linked
Data with RapidMiner. Semantic Web Challenge, Winner of the Open Track, 2014.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 58
Commercial Applications: Content Management at BBC
− Interconnect content management systems of different TV and radio stations.
− Similar efforts to connect content repositories at Elsevier and Springer.
Source: http://www.w3.org/2001/sw/sweo/public/UseCases/BBC
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 59
− IBM Rational uses Linked Data
technologies to connect data
from different
• software development tools
• software lifecycle tools
− Goals:
1. Make data independent
of concrete tool (IBM or third party)
2. Allow services (reporting, discovery)
to access data from all tools
3. Distributed data space as an
alternative to central repository or
integration hub / bus
Commercial Applications: Application Integration at IBM
Source: http://www.w3.org/2001/sw/sweo/public/UseCases/IBM
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 60
Conclusion: Linked Data vs. HTML-embeded Data
Linked Data Microdata, Microformats, RDFa
~ 1000 sources millions of sources
covers wider range of specific topics
focused on search engines and
facebook
more complex
data structures
very simple and shallow
data structures
partial ontology agreement strong ontology agreement
data integration eased by RDF links
data integration often
requires NLP techniques
various application prototypes
some industrial uptake
strong application pull
by search engines
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 61
3. Knowledge Graphs
− Google Knowledge Graph
• development started 2012, builds on Freebase
• 570 million objects described by over 18 billion facts (2012)
• 1500 classes, 35,000 properties
− Microsoft Satori Knowledge Base
• revealed to the public in mid-2013
− Yahoo Knowledge Graph
• revealed to the public early-2014
− Knowledge Graphs employ RDF-style graph data models
Large cross-domain knowledge bases which
aim
to cover all “relevant” entities in the world.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 62
Data Sources used to Build Knowledge Graphs
1. Wikipedia
• infoboxes, category system, information extraction from text
1. Open license sources
• e.g. CIA World Factbook, MusicBrainz, …
1. Commercial third-party data
• e.g. IMDB, company listings, …
1. schema.org annotations in web pages
• e.g. contact information for companies
• e.g. logos of companies
Lots of effort is spend on data integration and manual data curation
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 63
Application of the Google Knowledge Graph
− Enrich search results with knowledge cards and lists
− Goal: Fulfil information need without having users navigate to other
websites
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 64
Application of the Microsoft Knowledge Graph
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 65
1. Answer fact queries: “birthdate michael douglas”
2. Compare things: ”compare eiffel tower vs empire state building”
Applications of the Google Knowledge Graph
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 66
Google Now Smart Cards
− Direct answers are especially important in the mobile context
− Google Now displays direct answers for 19.45% of the queries
(Source: Stone Temple Consulting, 2015)
− Medical facts are reviewed by an average of 11.1 doctors
(Source: Google)
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 67
New SEO Topic: How to influence Knowledge Graphs?
Source: http://searchengineland.com/
leveraging-wikidata-gain-google-
knowledge-graph-result-219706
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 68
Behind-the-Scenes Applications
− Google
• uses its knowledge graph to identity entities in web pages (Entity Linking)
• Hummingbird ranking algorithm (deployed in 2013) uses
knowledge graph as background knowledge for ranking
search results.
− Yahoo
• uses its knowledge graph to “support applications across the company:
• Web Search, Content Understanding
• Recommendation, Personalization, Advertisement”*
− Data Integration
• becomes matching data sources against knowledge graphs
as intermediate schemata.
Various tasks become easier, if you know all
entities in the world.
*Source: Nicolas Torzec, Yahoo 2014
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 69
Public Knowledge Graphs
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 70
The DBpedia Knowledge Base - Version 2014
− Describes 4.58 million things, out of which
4.22 million are classified in a consistent ontology
using 685 classes and 2679 different properties
• 1,445,000 persons
• 735,000 places
• 241,000 organizations
• 123,000 music albums
− Altogether 3 billion pieces of information (RDF triples)
• 580 million were extracted from the English edition of Wikipedia
• 29,000,000 links to external web pages
• 50,000,000 external links into other RDF datasets
− DBpedia Internationalization
• provides data from 125 Wikipedia language editions for download
• For 28 popular languages DBpedia provides cleaned infobox data
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 71
DBpedia @ BIS2015
1. Thursday, 10:00
The Past, Present & Future of DBpedia
Keynote by Dimitris Kontokostas
2. Thursday, 10:45
4th DBpedia Community Meeting
Room 2
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 72
Google Knowledge Vault
− Research project to build a knowledge base
using facts extracted from 1 billion web pages
1. Web text (TXT): Entity linking,
relationship extraction
2. HTML trees (DOM): Wrapper induction
3. HTML tables (TBL): Relational tables
4. Semantic Annotations (ANO): schema.org, OGP
− Employs probabilistic model for data fusion
− Results: 1.6 billion facts
• 271 million with confidence >90%
• 90 million not in Freebase
Source: Luna Dong, Evgeniy Gabrilovich, et al.: Knowledge Vault:
A Web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 73
Data Sources for Public Research in this Space
1. Common Crawl
• ~ 2 billion HTML pages
• updated very couple of months
1. WebDataCommons HTML Tables Corpus
• 147 million relational web tables
• selected out of the 11 billion tables contained in the Common Crawl
• http://webdatacommons.org/webtables/
1. WebDataCommons Microdata and RDFa Corpora
• 20.4 billion RDF triples
• http://www.webdatacommons.org/structureddata/
1. Billion Triples Challenge Dataset 2014
• 4 billion RDF triples crawled from Linked Data sources
• http://km.aifb.kit.edu/projects/btc-2014/
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 74
Conclusion: 2001 Article - The Semantic Web
Envisions three things to happen:
1.people publish data in structured form
in addition to HTML pages on the Web
2.common vocabularies / ontologies are used
to represent data
3.people implement cool applications that
do smart things with the available data
Tim Berners-Lee, James Hendler and Ora Lassila:
The Semantic Web. Scientific American, May 2001.
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 75
4. Conclusions
1. Publication of Structured Data
• there is more data available as most people from research and industry like
• especially, schema.org annotations are currently gaining traction
• exciting test-bed for research on data profiling and data integration techniques
1. Ontological Agreement
• exists due to application-pull (Google, Facebook)
• but data source-specific attributes are also important
(e.g. in life science or government statistics domain)
1. Applications
• the big players are moving (Rich-Snippets, Knowledge Graphs)
• there is a lot of further application potential in the available data
• experimentation in industry, but many efforts are still in the prototype stage
Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 76
Thanks
− References
• Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa
and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC2014).
• Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best
Practices in Different Topical Domains (Slides, Video). 13th International Semantic Web
Conference (ISWC2014).
• Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering
Microdata Markup. 4th Workshop on Data Extraction and Object Search (DEOS2014).
− Detailed statistics on RDFa, Microdata and Microformats adoption
• http://www.webdatacommons.org/structureddata/
− Detailed statistics on Linked Data adoption
• http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
1 von 76

Recomendados

The Graph Structure of the Web - Aggregated by Pay-Level Domain von
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
3.4K views20 Folien
Search Joins with the Web - ICDT2014 Invited Lecture von
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
3.5K views73 Folien
Extending Tables with Data from over a Million Websites von
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million WebsitesChris Bizer
3.6K views25 Folien
Adoption of the Linked Data Best Practices in Different Topical Domains von
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
3.1K views28 Folien
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch... von
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
2.6K views53 Folien
DBpedia - An Interlinking Hub in the Web of Data von
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
2.5K views43 Folien

Más contenido relacionado

Was ist angesagt?

How links can make your open data even greater von
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greaterCristina Sarasua
309 views28 Folien
Registration / Certification Interoperability Architecture (overlay peer-review) von
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Herbert Van de Sompel
371 views44 Folien
The Semantic Web – A Vision Come True, or Giving Up the Great Plan? von
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?Martin Hepp
2.3K views19 Folien
The web is rotting and what to do about it von
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about itHerbert Van de Sompel
325 views86 Folien
Signposting Overview (Version November 2017) von
Signposting Overview (Version November 2017)Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)Herbert Van de Sompel
11.3K views36 Folien
Linked Data Overview - AGI Technical SIG von
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGChris Ewing
860 views31 Folien

Was ist angesagt?(20)

How links can make your open data even greater von Cristina Sarasua
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
Cristina Sarasua309 views
Registration / Certification Interoperability Architecture (overlay peer-review) von Herbert Van de Sompel
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)
The Semantic Web – A Vision Come True, or Giving Up the Great Plan? von Martin Hepp
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
Martin Hepp2.3K views
Linked Data Overview - AGI Technical SIG von Chris Ewing
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
Chris Ewing860 views
IN2N: Cross-institutional Authority Collaboration von Alexander Haffner
IN2N: Cross-institutional Authority CollaborationIN2N: Cross-institutional Authority Collaboration
IN2N: Cross-institutional Authority Collaboration
Produce and consume_linked_data_with_drupal von STI Innsbruck
Produce and consume_linked_data_with_drupalProduce and consume_linked_data_with_drupal
Produce and consume_linked_data_with_drupal
STI Innsbruck411 views
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap... von Data Beers
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
Data Beers559 views
Linked Data: turning the web into a context graph von Leigh Dodds
Linked Data: turning the web into a context graphLinked Data: turning the web into a context graph
Linked Data: turning the web into a context graph
Leigh Dodds3.2K views
Heuristics for Fixing Common Errors in Deployed schema.org Microdata von Robert Meusel
Heuristics for Fixing Common Errors in Deployed schema.org MicrodataHeuristics for Fixing Common Errors in Deployed schema.org Microdata
Heuristics for Fixing Common Errors in Deployed schema.org Microdata
Robert Meusel3K views
Data.dcs: Converting Legacy Data into Linked Data von Matthew Rowe
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
Matthew Rowe590 views
20180226 data driven smart governance von Dongpo Deng
20180226 data driven smart governance20180226 data driven smart governance
20180226 data driven smart governance
Dongpo Deng3.9K views

Destacado

The Academics' Guide to Social Media von
The Academics' Guide to Social Media The Academics' Guide to Social Media
The Academics' Guide to Social Media Sue Beckingham
2.6K views38 Folien
Social media for Academics von
Social media for AcademicsSocial media for Academics
Social media for AcademicsHugo Guyader
2.7K views64 Folien
Academic Social Network Sites: a rough guide for researchers von
Academic Social Network Sites: a rough guide for researchersAcademic Social Network Sites: a rough guide for researchers
Academic Social Network Sites: a rough guide for researchersDanny Kingsley
5.7K views57 Folien
Digital traces von
Digital tracesDigital traces
Digital tracesSanjana Hattotuwa
1.4K views73 Folien
Using social media as academics for learning, teaching and research von
Using social media as academics for learning, teaching and researchUsing social media as academics for learning, teaching and research
Using social media as academics for learning, teaching and researchSue Beckingham
3.5K views16 Folien
Academic social networking sites von
Academic social networking sitesAcademic social networking sites
Academic social networking sitesKaty Jordan
8.6K views29 Folien

Destacado(20)

The Academics' Guide to Social Media von Sue Beckingham
The Academics' Guide to Social Media The Academics' Guide to Social Media
The Academics' Guide to Social Media
Sue Beckingham2.6K views
Social media for Academics von Hugo Guyader
Social media for AcademicsSocial media for Academics
Social media for Academics
Hugo Guyader2.7K views
Academic Social Network Sites: a rough guide for researchers von Danny Kingsley
Academic Social Network Sites: a rough guide for researchersAcademic Social Network Sites: a rough guide for researchers
Academic Social Network Sites: a rough guide for researchers
Danny Kingsley5.7K views
Using social media as academics for learning, teaching and research von Sue Beckingham
Using social media as academics for learning, teaching and researchUsing social media as academics for learning, teaching and research
Using social media as academics for learning, teaching and research
Sue Beckingham3.5K views
Academic social networking sites von Katy Jordan
Academic social networking sitesAcademic social networking sites
Academic social networking sites
Katy Jordan8.6K views
Social media for research and knowledge sharing von Hasnain Zafar
Social media for research and knowledge sharingSocial media for research and knowledge sharing
Social media for research and knowledge sharing
Hasnain Zafar1.3K views
Social Media Benefits For Researchers von Inge de Waard
Social Media Benefits For ResearchersSocial Media Benefits For Researchers
Social Media Benefits For Researchers
Inge de Waard2.3K views
The Digital Academic: Social and Other Digital Media for Academics von Deborah Lupton
The Digital Academic: Social and Other Digital Media for AcademicsThe Digital Academic: Social and Other Digital Media for Academics
The Digital Academic: Social and Other Digital Media for Academics
Deborah Lupton7.2K views
Skills Development Through Authentic Assessment von Alan Cann
Skills Development Through Authentic AssessmentSkills Development Through Authentic Assessment
Skills Development Through Authentic Assessment
Alan Cann1.2K views
Networking and the importance of a professional online presence von Sue Beckingham
Networking and the importance of a professional online presenceNetworking and the importance of a professional online presence
Networking and the importance of a professional online presence
Sue Beckingham769 views
Social media for researchers - maximizing your personal impact von Alan Cann
Social media for researchers - maximizing your personal impactSocial media for researchers - maximizing your personal impact
Social media for researchers - maximizing your personal impact
Alan Cann3.2K views
Using social media for learning and teaching #Bett2017 #ALiSOnline von Sue Beckingham
Using social media for learning and teaching #Bett2017 #ALiSOnlineUsing social media for learning and teaching #Bett2017 #ALiSOnline
Using social media for learning and teaching #Bett2017 #ALiSOnline
Sue Beckingham2.9K views
Graph theory von Kumar
Graph theoryGraph theory
Graph theory
Kumar 9.9K views
The role and importance of social media in science von Jari Laru
The role and importance of social media in science The role and importance of social media in science
The role and importance of social media in science
Jari Laru5.1K views
Using social media to build your academic career von lisbk
Using social media to build your academic careerUsing social media to build your academic career
Using social media to build your academic career
lisbk9.5K views

Similar a Evolving the Web into a Global Dataspace – Advances and Applications

TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten" von
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TourismFastForward
1.8K views41 Folien
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... von
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
933 views58 Folien
Integrating Product Data from the Semantic Web using Deep Learning Techniques von
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
35 views47 Folien
Big Data – Marketing Challenge or Opportunity? von
Big Data – Marketing Challenge or Opportunity?Big Data – Marketing Challenge or Opportunity?
Big Data – Marketing Challenge or Opportunity?edynamic
1.1K views48 Folien
Web 2008 von
Web 2008Web 2008
Web 2008Galit Fein
456 views39 Folien
Schema and Open Graph 101 - SMX Munich von
Schema and Open Graph 101 - SMX MunichSchema and Open Graph 101 - SMX Munich
Schema and Open Graph 101 - SMX MunichMatthew Brown
226.5K views50 Folien

Similar a Evolving the Web into a Global Dataspace – Advances and Applications(20)

TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten" von TourismFastForward
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TFF2015, Christian Bizer, Uni Mannheim "Schema.org-Annotationen in Webseiten"
TourismFastForward1.8K views
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... von Chris Bizer
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Chris Bizer933 views
Integrating Product Data from the Semantic Web using Deep Learning Techniques von Chris Bizer
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Chris Bizer35 views
Big Data – Marketing Challenge or Opportunity? von edynamic
Big Data – Marketing Challenge or Opportunity?Big Data – Marketing Challenge or Opportunity?
Big Data – Marketing Challenge or Opportunity?
edynamic1.1K views
Schema and Open Graph 101 - SMX Munich von Matthew Brown
Schema and Open Graph 101 - SMX MunichSchema and Open Graph 101 - SMX Munich
Schema and Open Graph 101 - SMX Munich
Matthew Brown226.5K views
Digital Transformation of Civil Engineering and Construction von pdemian
Digital Transformation of Civil Engineering and ConstructionDigital Transformation of Civil Engineering and Construction
Digital Transformation of Civil Engineering and Construction
pdemian111 views
Produktdatenmanagement mit Neo4j von Neo4j
Produktdatenmanagement mit Neo4jProduktdatenmanagement mit Neo4j
Produktdatenmanagement mit Neo4j
Neo4j186 views
A possible future role of schema.org for business reporting von sopekmir
A possible future role of schema.org for business reportingA possible future role of schema.org for business reporting
A possible future role of schema.org for business reporting
sopekmir1.2K views
Digital dealer conference automotive guerrilla marketing using social media v2 von Ralph Paglia
Digital dealer conference   automotive guerrilla marketing using social media v2Digital dealer conference   automotive guerrilla marketing using social media v2
Digital dealer conference automotive guerrilla marketing using social media v2
Ralph Paglia264 views
Automotive guerrilla marketing for car dealers using social media and web 2 0 von Ralph Paglia
Automotive guerrilla marketing for car dealers using social media and web 2 0Automotive guerrilla marketing for car dealers using social media and web 2 0
Automotive guerrilla marketing for car dealers using social media and web 2 0
Ralph Paglia290 views
Automotive guerrilla marketing for car dealers using social media and web 2 0 von Social Media Marketing
Automotive guerrilla marketing for car dealers using social media and web 2 0Automotive guerrilla marketing for car dealers using social media and web 2 0
Automotive guerrilla marketing for car dealers using social media and web 2 0
Digital dealer6 web20-guerrillamarketing-v2 von Ralph Paglia
Digital dealer6 web20-guerrillamarketing-v2Digital dealer6 web20-guerrillamarketing-v2
Digital dealer6 web20-guerrillamarketing-v2
Ralph Paglia275 views
Acs Presentation Thinking Outside Of Inbox V2 von Johnny Teoh
Acs Presentation   Thinking Outside Of Inbox V2Acs Presentation   Thinking Outside Of Inbox V2
Acs Presentation Thinking Outside Of Inbox V2
Johnny Teoh6.6K views
Building for success on the capable web - t3imd 2020 von Andrey Lipattsev
Building for success on the capable web -  t3imd 2020Building for success on the capable web -  t3imd 2020
Building for success on the capable web - t3imd 2020
Andrey Lipattsev96 views
Knowledge Graph Implementation into Drupal Content Management System (CMS) fo... von Martin Kaltenböck
Knowledge Graph Implementation into Drupal Content Management System (CMS) fo...Knowledge Graph Implementation into Drupal Content Management System (CMS) fo...
Knowledge Graph Implementation into Drupal Content Management System (CMS) fo...
Martin Kaltenböck939 views

Más de Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? von
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
40 views56 Folien
Using the Semantic Web as Training Data for Product Matching von
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
552 views26 Folien
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web von
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
873 views53 Folien
Data Search and Search Joins (Universität Heidelberg 2015) von
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
423 views57 Folien
Exploring the Application Potential of Relational Web Tables von
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
679 views25 Folien
Evolving the Web into a Global Database - Advances and Applications. von
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
1.6K views47 Folien

Más de Chris Bizer(6)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? von Chris Bizer
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
Chris Bizer40 views
Using the Semantic Web as Training Data for Product Matching von Chris Bizer
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
Chris Bizer552 views
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web von Chris Bizer
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
Chris Bizer873 views
Data Search and Search Joins (Universität Heidelberg 2015) von Chris Bizer
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
Chris Bizer423 views
Exploring the Application Potential of Relational Web Tables von Chris Bizer
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
Chris Bizer679 views
Evolving the Web into a Global Database - Advances and Applications. von Chris Bizer
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
Chris Bizer1.6K views

Último

Infomatica-MDM.pptx von
Infomatica-MDM.pptxInfomatica-MDM.pptx
Infomatica-MDM.pptxKapil Rangwani
12 views16 Folien
4_4_WP_4_06_ND_Model.pptx von
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptxd6fmc6kwd4
7 views13 Folien
Dr. Ousmane Badiane-2023 ReSAKSS Conference von
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceAKADEMIYA2063
5 views34 Folien
Pydata Global 2023 - How can a learnt model unlearn something von
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn somethingSARADINDU SENGUPTA
8 views13 Folien
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx von
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptxDataScienceConferenc1
5 views15 Folien
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf von
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOppotus
34 views19 Folien

Último(20)

4_4_WP_4_06_ND_Model.pptx von d6fmc6kwd4
4_4_WP_4_06_ND_Model.pptx4_4_WP_4_06_ND_Model.pptx
4_4_WP_4_06_ND_Model.pptx
d6fmc6kwd47 views
Dr. Ousmane Badiane-2023 ReSAKSS Conference von AKADEMIYA2063
Dr. Ousmane Badiane-2023 ReSAKSS ConferenceDr. Ousmane Badiane-2023 ReSAKSS Conference
Dr. Ousmane Badiane-2023 ReSAKSS Conference
AKADEMIYA20635 views
Pydata Global 2023 - How can a learnt model unlearn something von SARADINDU SENGUPTA
Pydata Global 2023 - How can a learnt model unlearn somethingPydata Global 2023 - How can a learnt model unlearn something
Pydata Global 2023 - How can a learnt model unlearn something
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx von DataScienceConferenc1
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
[DSC Europe 23] Branka Panic - Peace in the age of artificial intelligence.pptx
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf von Oppotus
OPPOTUS - Malaysians on Malaysia 3Q2023.pdfOPPOTUS - Malaysians on Malaysia 3Q2023.pdf
OPPOTUS - Malaysians on Malaysia 3Q2023.pdf
Oppotus34 views
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between... von DataScienceConferenc1
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
[DSC Europe 23] Ilija Duni - How Foursquare Builds Meaningful Bridges Between...
Best Home Security Systems.pptx von mogalang
Best Home Security Systems.pptxBest Home Security Systems.pptx
Best Home Security Systems.pptx
mogalang9 views
GDG Community Day 2023 - Interpretable ML in production von SARADINDU SENGUPTA
GDG Community Day 2023 - Interpretable ML in productionGDG Community Day 2023 - Interpretable ML in production
GDG Community Day 2023 - Interpretable ML in production
Underfunded.pptx von vgarcia19
Underfunded.pptxUnderfunded.pptx
Underfunded.pptx
vgarcia1915 views
AZConf 2023 - Considerations for LLMOps: Running LLMs in production von SARADINDU SENGUPTA
AZConf 2023 - Considerations for LLMOps: Running LLMs in productionAZConf 2023 - Considerations for LLMOps: Running LLMs in production
AZConf 2023 - Considerations for LLMOps: Running LLMs in production
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning von SARADINDU SENGUPTA
GDG Cloud Community Day 2022 -  Managing data quality in Machine LearningGDG Cloud Community Day 2022 -  Managing data quality in Machine Learning
GDG Cloud Community Day 2022 - Managing data quality in Machine Learning
Customer Data Cleansing Project.pptx von Nat O
Customer Data Cleansing Project.pptxCustomer Data Cleansing Project.pptx
Customer Data Cleansing Project.pptx
Nat O6 views
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf von DataScienceConferenc1
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
[DSC Europe 23] Irena Cerovic - AI in International Development.pdf
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language... von patiladiti752
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language...
patiladiti7528 views

Evolving the Web into a Global Dataspace – Advances and Applications

  • 1. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 1 Prof. Dr. Christian Bizer Evolving the Web into a Global Dataspace - Advances and Applications - 18th International Conference on Business Information System (BIS 2015)
  • 2. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 2 Hello Professor Christian Bizer University of Mannheim Research Topics −Web Technologies −Web Data Profiling −Web Data Integration −Web Mining
  • 3. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 3 Data and Web Science Group @ University of Mannheim − 6 Professors • Heiner Stuckenschmidt • Rainer Gemulla • Christian Bizer • Simone Ponzetto • Heiko Paulheim • Johanna Völker − 25 researchers and PhD students − http://dws.informatik.uni-mannheim.de/ 1. Research methods for integrating and mining large amounts of heterogeneous information from the Web. 2. Empirically analyze the content and structure of the Web.
  • 4. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 4 Querying the Classic Web DB HTML
  • 5. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 5 Long Standing Goal Query the Web like a single, global database
  • 6. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 6 2001 Article: The Semantic Web Envisions three things to happen: 1.people publish data in structured form in addition to HTML pages on the Web 2.common vocabularies / ontologies are used to represent data 3.people implement cool applications that do smart things with the available data Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web. Scientific American, May 2001.
  • 7. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 7 14 Years Later There are 1.5 million publications about the Semantic Web on Google Scholar, but 1. Do people publish structured data on the Web? 2. Do people agree on common vocabularies / ontologies? 3. What are the cool applications that do smart things with the data?
  • 8. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 8 Outline 1. Semantic Annotations in HTML Pages 2. Linked Data 3. Knowledge Graphs 4. Conclusions
  • 9. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 9 1. Semantic Annotations in HTML Pages Simple idea: Help machines to understand Web content by marking up data in HTML pages. <div itemtype="http://schema.org/Hotel"> <span itemprop="name">Vienna Marriott Hotel</span> <span itemprop="address" itemscope="" itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress">Parkring 12a</span> <span itemprop="addressLocality">Vienna</span> <span itemprop="addressCountry">Austria</span> </span> <div itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating"> <span itemprop="ratingValue"> 4 </span> stars-based on <span itemprop="reviewCount"> 250 </span> reviews. </div>
  • 10. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 10 Semantic Annotation Formats Microformats Microdata RDFa − date back to 2003 − small set of fixed formats − W3C Recommendation in 2008 − can represent any type of data − proposed in 2009 − tries to be simpler than RDFa
  • 11. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 11 Open Graph Protocol − allows site owners to determine how entities are displayed in Facebook − relies on RDFa for marking up data in HTML pages − available since April 2010
  • 12. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 12 Schema.org − ask site owners since 2011 to annotate data for enriching search results − 675 Types: Event, Place, Local Business, Product, Review, Person − Encoding: Microdata, RDFa, JSON-LD
  • 13. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 13 Usage of Schema.org Data @ Google Rich snippets within search results
  • 14. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 14 Event Data in Google Applications https://developers.google.com/structured-data/
  • 15. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 15 Flight Offers in Google Search Results Annotated webpages directly below Google Flights results
  • 16. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 16 Rich-Snippets Get More User Attention − Suchen Source: www.looktracker.com Potential business incentive.
  • 17. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 17 Motivation for Semantic Annotations − Study by searchmetrics.com in 2013: 10.000s of search keywords − Type of rich-snippet displayed by Google: Source: http://www.searchmetrics.com/de/knowledge-base/schema/ Google displays Rich-Snippets for 40% of all queries.
  • 18. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 18 The Common Crawl
  • 19. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 19 The Web Data Commons Project − extracts all Microformat, Microdata, RDFa data from the Common Crawl − analyzes and provides the extracted data for download − four extraction runs so far • 2014 CC Corpus: 2.0 billion HTML pages  20.4 billion RDF triples • 2013 CC Corpus: 2.2 billion HTML pages  17.2 billion RDF triples • 2012 CC Corpus: 3.0 billion HTML pages  7.3 billion RDF triples • 2009/2010 CC Corpus: 2.5 billion HTML pages  5.1 billion RDF triples − uses 100 machines on Amazon EC2 • approx. 3000 machine/hours (spot instances of type c3.xlarge)  550 Euro − http://www.webdatacommons.org/structureddata/
  • 20. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 20 Overall Adoption 2014 620 million HTML pages out of the 2 billion pages provide semantic annotations (30%). 2.72 million pay-level-domains (PLDs) out of the 15.68 million pay-level-domains covered by the crawl provide annotations (17%). Google, 2014*: 5 million websites provide Schema.org data. * Guha in LDOW2014 Keynote
  • 21. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 21 Number of PLDs providing Semantic Annotations
  • 22. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 22 Most Popular Classes RDFa Microdata
  • 23. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 23 Topical Focus – Microdata 2014 2014 2013 Class Instances # PLDs PLDs # % # % 1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,04 2 schema:Article 54.972.000 88,7 10,82% 65.930 14,22 3 schema:Blog 3.787.000 110,663 13,50% 64.709 13,96 4 schema:Product 288.083.000 89,608 10,93% 56.388 12,16 5 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,31 6 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,53 7 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,94 8 schema:Offer 236.953.000 62,849 7,66% 35.635 7,69 9 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,61 10 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,92 11 schema:Organization 101.769.000 52,733 6,43% 24.255 5,23 12 schema:Person 115.376.000 47,936 5,85% 21.107 4,55 13 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,47 14 dv:Product 12.411.000 16,003 1,95% 13.844 2,99 15 schema:Review 42.561.000 20,124 2,45% 13.137 2,83 16 dv:Review-aggregate 3.964.000 14,094 1,72% 13.075 2,82 17 dv:Organization 3.155.000 10,649 1,30% 9.582 2,07 18 dv:Offer 7.170.000 11,64 1,42% 9.298 2,01 19 dv:Address 2.138.000 9,674 1,18% 8.866 1,91 20 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8 − Top Classes − Topics: • CMS and blog metadata • products and offers • ratings and reviews • business listings • address data • ...and a massive long tail schema: = Schema.org dv: = Google Rich Snippet Vocabulary (deprecated)
  • 24. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 24 Adoption by E-Commerce Websites Distribution by Alexa Top-15 Shopping Sites Top-Level Domain TLD #PLDs com 38344 co.uk 3605 net 1813 de 1333 pl 1273 com.br 1194 ru 1165 com.au 1062 nl 1002 Website schema:Product Amazon.com  Ebay.com  NetFlix.com  Amazon.co.uk  Walmart.com  etsy.com  Ikea.com  Bestbuy.com  Homedepot.com  Target.com  Groupon.com  Newegg.com  Lowes.com  Macys.com  Nordstrom.com  Adoption by Top-15: 60 %
  • 25. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 25 Properties used to Describe Products Top 15 Properties PLDs # % schema:Product/name 78,292 87 % schema:Product/image 59,445 66 % schema:Product/description 58,228 65 % schema:Product/offers 57,633 64 % schema:Offer/price 54,290 61 % schema:Offer/availability 36,789 41 % schema:Offer/priceCurrency 30,610 34 % schema:Product/url 23,723 26 % schema:Product/aggregateRating 21,166 24 % schema:AggregateRating/ratingValue 20,513 23 % schema:AggregateRating/reviewCount 14,930 17 % schema:Product/manufacturer 10,150 11 % schema:Product/brand 9,739 11 % schema:Product/productID 9,221 10 % schema:Product/sku 7955 9 %
  • 26. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 26 Adoption by Travel Websites Top 15 Travel Websites schema:Hotel Any Class Booking.com (uses DataVoc)   TripAdvisor   Expedia   Agoda   Hotels.com   Kayak   Priceline   Travelocity   Orbitz   ChoiceHotels   HolidayCheck   ChoiceHotels   InterContinental Hotels Group   Marriott International   Global Hyatt Corp.   Adoption: 73 %
  • 27. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 27 Properties used to Describe Hotels Top 10 Properties PLDs # % schema:Hotel/name 4173 88,35 % schema:Hotel/address 3311 70,10 % schema:Hotel/telephone 2488 52,68 % schema:PostalAddress/streetAddress 2362 50,01 % schema:PostalAddress/addressLocality 2231 47,24 % schema:Hotel/url 2102 44,51 % schema:PostalAddress/postalCode 2096 44,38 % schema:AggregateRating/ratingValue 1952 41,33 % schema:Hotel/aggregateRating 1866 39,51 % schema:AggregateRating/bestRating 1697 35,93 %
  • 28. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 28 Adoption by Job Websites Distribution by Top-10 Employment Sites Top-Level Domain Adoption by Top-10: 70 % TLD #PLDs jobs 908 com 828 org 263 co.uk 194 net 40 nl 38 ca 33 de 32 jobs 908 Website schema:JobPosting Indeed.com  Monster.com  Careerbuilder.com  Snagajob.com  Jobsdb.com  Jobsearch.about.com  Jobs.net  Internships.com  Jobs.aol.com  Quintcareers.com 
  • 29. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 29 Properties used to Describe Job Postings Top 10 Properties PLDs # % JobPosting/title 2588 91.16 % JobPosting/hiringOrganization 1412 49.74 % JobPosting/description 1192 41.99 % JobPosting/jobLocation 1062 37.41 % Organization/name 862 30.36 % JobPosting/datePosted 793 27.93 % Place/address 471 16.59 % JobPosting/baseSalary 227 8.00 % JobPosting/industry 209 7.36 % JobPosting/educationRequirements 145 5.11 %
  • 30. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 30 Class / Property Distribution  Only a small set of classes / properties is used.  Strong focus on Schema.org and Facebook vocabularies. schema.org 675 classes 965 properties
  • 31. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 31 Opportunity 1: Search Engine Optimization Get richer visibility in search results and potentially more clicks.
  • 32. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 32 Opportunity 2: Change Push to Pull Communication − Current situation: • Information providers need to push data into multiple channels • multiple search engines • multiple domain-specific portals − Web approach: • You maintain a website • All interested parties crawl your data
  • 33. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 33 Opportunity 3: Applications beyond Rich-Snippets − E-Commerce • Rich source of product data, offers, and reviews • Opportunity to build global product catalogs • Opportunity to mine product and rating data on global-scale − Tourism • Additional data for tourism applications: Nearby local businesses, nearby landmarks, nearby hospitals, nearby events • Search engines as new competitors put pressure on large booking portals? − Recruitment • Increased market transparency • Search engines as new competitors put pressure on job portals that charge per posting? − High up-to-dateness of data • as original data providers know about changes first
  • 34. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 34 Main Challenge: Data Integration and Cleansing The schema is standardized, but 1. entity names differ 2. the schema is rather shallow and a rather low number of properties is used 3. data quality differs as the data is created by experts and rookies
  • 35. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 35 Property PLDs # % schema:Product/name 78,292 87% schema:Product/description 58,228 65% schema:Product/manufacturer 10,150 11% schema:Product/brand 9,739 11% schema:Product/productID 9,221 10% Looking Deeper into the E-Commerce Data 1. The structure of the data is rather shallow • Product features are encoded in titles and descriptions • Example product name: “Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB” • Example product description: “Faster Flash Storage with 64 GB Solid State Drive and USB 3.0 …” • Product IDs are provided by only 10% of the websites • Categorization information is provided only by 2% of the websites.
  • 36. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 36 Categorization of Product Offers − We analyzed 1.9 million product offers from 9200 shops − We trained bag-of-words classifier for 9 product categories on product descriptions from Amazon. Source: Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering Microdata Markup. In: 4th Workshop on Data Extraction and Object Search (DEOS2014)  @ WWW2014
  • 37. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 37 Identity Resolution for Electronic Products − We trained feature extractors for product descriptions on offers for electronic products from Amazon. − We used the Silk framework for identity resolution. Precision= 85%
  • 38. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 38 Starting Points for Further Improvements − Identity Resolution • Exploit product identifiers to learn better product recognizers • 10% of the websites (9,221 PLDs) use s:Product/productID • 1% of the websites (935 PLDs) use s:Product/gtin13 − Categorization of Products • Exploit categorization information provided by subset of the websites • 1,5% of the websites (1,497 PLDs) use s:Offer/category • 0,5% of the websites (460 PLDs) use s:WebPage/breadcrumb • Challenge: Integration of ~ 2,000 product taxonomies Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables > Dining Tables Home > Shop > Outdoor & Garden > Barbecues & Outdoor Living > Garden Furniture > Tables > Dining Tables Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys > over $60 Philadelphia Eagles > Philadelphia Eagles Mens > Philadelphia Eagles Mens Jerseys > over $60
  • 39. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 39 Conclusion: Semantic Annotations in HTML Pages 1. Wide-spread adoption of semantic annotations • motivated by mayor search engines 1. Strong ontology agreement driven by data consumers • Schema.org, Open Graph Protocol 1. Main application: Rich-snippets 2. Endless data pool for • Commercial applications • product and travel data integration and mining • up-to-date listings of local businesses • job search engines that increase market transparency • Research • large-scale data integration and mining • information extraction (using annotations as distant supervision*) * Foley, et al.: Learning to Extract Local Events from the Web. SIGIR 2015
  • 40. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 40 Download and Play with the Data − http://www.webdatacommons.org/structureddata/ − Only tip of the iceberg, as each website is only partly crawled.
  • 41. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 41 2. Linked Data B C RDF RDF link A D E RDF links RDF links RDF links RDF RDF RDF RDF RDF RDF RDF RDF RDF • by using RDF to publish structured data directly on the Web • by setting links between data items within different data sources. Set of best practices for publishing structured data on the Web in the form of a single global data graph.
  • 42. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 42 Links as Integration Hints  publishing Identity Links on the Web  publishing Vocabulary Links on the Web <http://www4.wiwiss.fu-berlin.de/is-group/resource/persons/Person4> owl:sameAs <http://dblp.l3s.de/d2r/resource/authors/Christian_Bizer> . <http://xmlns.com/foaf/0.1/Person> owl:equivalentClass <http://dbpedia.org/ontology/Person> .
  • 43. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 43 Effort Distribution between Publisher and Consumer Publishers or third parties provides identity/vocabulary links Consumer mines missing identity/vocabulary links Effort Distribution
  • 44. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 44 LOD Datasets on the Web: April 2014 Growth without new category Social Networking: 94 % Source: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. In: 13th International Semantic Web Conference (ISWC2014).
  • 45. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 45 Uptake in the Government Domain − Various efforts by public sector institutions world-wide − Forerunners • UK government • US government − Types of data published • statistical data • environmental data • budget and election data − Goals • Make data available to the public and other government agencies • Ease data integration by using standards, providing unique identifiers and by setting links
  • 46. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 46 Uptake in the Libraries Community − Institutions publishing Linked Data • Library of Congress (subject headings) • German National Library (PND dataset and subject headings) • Swedish National Library (Libris - catalog) • Hungarian National Library (OPAC and digital library) • Europeana Digital Library (4 million artifacts) • Springer (metadata about conference proceedings) − Goals: 1. Interconnect resources between repositories (by topic, by location, by historical period, by ...) 2. Integrate library catalogs on global scale
  • 47. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 47 Uptake in the Life Science Domain − Goals: 1. Connect life science datasets in order to support • biological knowledge discovery • drug discovery 1. Reuse results of previous integration efforts
  • 48. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 48 Uptake in the Linguistic Research Community http://linguistic-lod.org/llod-cloud http://www.lider-project.eu/
  • 49. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 49 Ontological Agreement − Strong agreement on some vocabularies − Proprietary vocabularies are used in addition to common ones, as data is often very specific Widely-Used Vocabularies Proprietary Vocabularies
  • 50. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 50 RDF Links − Some datasets put a lot of effort into linking − Many datasets only link to a small number of other datasets or do not set RDF links at all Datasets with Top In-Degrees Out-Degrees per Category
  • 51. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 51 RDF Links in the LOD Cloud: August 2014
  • 52. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 52
  • 53. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 53
  • 54. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 54 Linked Data as Background Knowledge for Data Mining Which factors correlate with unemployment in France?
  • 55. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 55 Unemployment Table with Additional Attributes
  • 56. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 56 RapidMiner Linked Open Data Extension Allows you to 1. link local table to LOD data sources 2. extend local table with additional attributes 3. mine extended tables using all Rapidminer features
  • 57. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 57 Finding Correlations − Use additional attributes to find interesting correlations − Example correlation for unemployment in France: • African islands, islands in the Indian Ocean, outermost regions of the EU (positive) • Population growth (positive) • Energy consumption (negative) • Hospital beds/inhabitants (negative) • Fast food restaurants (positive) • Police stations (positive) Source: Petar Ristoski, Christian Bizer, and Heiko Paulheim: Mining the Web of Linked Data with RapidMiner. Semantic Web Challenge, Winner of the Open Track, 2014.
  • 58. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 58 Commercial Applications: Content Management at BBC − Interconnect content management systems of different TV and radio stations. − Similar efforts to connect content repositories at Elsevier and Springer. Source: http://www.w3.org/2001/sw/sweo/public/UseCases/BBC
  • 59. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 59 − IBM Rational uses Linked Data technologies to connect data from different • software development tools • software lifecycle tools − Goals: 1. Make data independent of concrete tool (IBM or third party) 2. Allow services (reporting, discovery) to access data from all tools 3. Distributed data space as an alternative to central repository or integration hub / bus Commercial Applications: Application Integration at IBM Source: http://www.w3.org/2001/sw/sweo/public/UseCases/IBM
  • 60. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 60 Conclusion: Linked Data vs. HTML-embeded Data Linked Data Microdata, Microformats, RDFa ~ 1000 sources millions of sources covers wider range of specific topics focused on search engines and facebook more complex data structures very simple and shallow data structures partial ontology agreement strong ontology agreement data integration eased by RDF links data integration often requires NLP techniques various application prototypes some industrial uptake strong application pull by search engines
  • 61. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 61 3. Knowledge Graphs − Google Knowledge Graph • development started 2012, builds on Freebase • 570 million objects described by over 18 billion facts (2012) • 1500 classes, 35,000 properties − Microsoft Satori Knowledge Base • revealed to the public in mid-2013 − Yahoo Knowledge Graph • revealed to the public early-2014 − Knowledge Graphs employ RDF-style graph data models Large cross-domain knowledge bases which aim to cover all “relevant” entities in the world.
  • 62. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 62 Data Sources used to Build Knowledge Graphs 1. Wikipedia • infoboxes, category system, information extraction from text 1. Open license sources • e.g. CIA World Factbook, MusicBrainz, … 1. Commercial third-party data • e.g. IMDB, company listings, … 1. schema.org annotations in web pages • e.g. contact information for companies • e.g. logos of companies Lots of effort is spend on data integration and manual data curation
  • 63. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 63 Application of the Google Knowledge Graph − Enrich search results with knowledge cards and lists − Goal: Fulfil information need without having users navigate to other websites
  • 64. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 64 Application of the Microsoft Knowledge Graph
  • 65. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 65 1. Answer fact queries: “birthdate michael douglas” 2. Compare things: ”compare eiffel tower vs empire state building” Applications of the Google Knowledge Graph
  • 66. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 66 Google Now Smart Cards − Direct answers are especially important in the mobile context − Google Now displays direct answers for 19.45% of the queries (Source: Stone Temple Consulting, 2015) − Medical facts are reviewed by an average of 11.1 doctors (Source: Google)
  • 67. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 67 New SEO Topic: How to influence Knowledge Graphs? Source: http://searchengineland.com/ leveraging-wikidata-gain-google- knowledge-graph-result-219706
  • 68. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 68 Behind-the-Scenes Applications − Google • uses its knowledge graph to identity entities in web pages (Entity Linking) • Hummingbird ranking algorithm (deployed in 2013) uses knowledge graph as background knowledge for ranking search results. − Yahoo • uses its knowledge graph to “support applications across the company: • Web Search, Content Understanding • Recommendation, Personalization, Advertisement”* − Data Integration • becomes matching data sources against knowledge graphs as intermediate schemata. Various tasks become easier, if you know all entities in the world. *Source: Nicolas Torzec, Yahoo 2014
  • 69. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 69 Public Knowledge Graphs
  • 70. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 70 The DBpedia Knowledge Base - Version 2014 − Describes 4.58 million things, out of which 4.22 million are classified in a consistent ontology using 685 classes and 2679 different properties • 1,445,000 persons • 735,000 places • 241,000 organizations • 123,000 music albums − Altogether 3 billion pieces of information (RDF triples) • 580 million were extracted from the English edition of Wikipedia • 29,000,000 links to external web pages • 50,000,000 external links into other RDF datasets − DBpedia Internationalization • provides data from 125 Wikipedia language editions for download • For 28 popular languages DBpedia provides cleaned infobox data
  • 71. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 71 DBpedia @ BIS2015 1. Thursday, 10:00 The Past, Present & Future of DBpedia Keynote by Dimitris Kontokostas 2. Thursday, 10:45 4th DBpedia Community Meeting Room 2
  • 72. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 72 Google Knowledge Vault − Research project to build a knowledge base using facts extracted from 1 billion web pages 1. Web text (TXT): Entity linking, relationship extraction 2. HTML trees (DOM): Wrapper induction 3. HTML tables (TBL): Relational tables 4. Semantic Annotations (ANO): schema.org, OGP − Employs probabilistic model for data fusion − Results: 1.6 billion facts • 271 million with confidence >90% • 90 million not in Freebase Source: Luna Dong, Evgeniy Gabrilovich, et al.: Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014.
  • 73. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 73 Data Sources for Public Research in this Space 1. Common Crawl • ~ 2 billion HTML pages • updated very couple of months 1. WebDataCommons HTML Tables Corpus • 147 million relational web tables • selected out of the 11 billion tables contained in the Common Crawl • http://webdatacommons.org/webtables/ 1. WebDataCommons Microdata and RDFa Corpora • 20.4 billion RDF triples • http://www.webdatacommons.org/structureddata/ 1. Billion Triples Challenge Dataset 2014 • 4 billion RDF triples crawled from Linked Data sources • http://km.aifb.kit.edu/projects/btc-2014/
  • 74. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 74 Conclusion: 2001 Article - The Semantic Web Envisions three things to happen: 1.people publish data in structured form in addition to HTML pages on the Web 2.common vocabularies / ontologies are used to represent data 3.people implement cool applications that do smart things with the available data Tim Berners-Lee, James Hendler and Ora Lassila: The Semantic Web. Scientific American, May 2001.
  • 75. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 75 4. Conclusions 1. Publication of Structured Data • there is more data available as most people from research and industry like • especially, schema.org annotations are currently gaining traction • exciting test-bed for research on data profiling and data integration techniques 1. Ontological Agreement • exists due to application-pull (Google, Facebook) • but data source-specific attributes are also important (e.g. in life science or government statistics domain) 1. Applications • the big players are moving (Rich-Snippets, Knowledge Graphs) • there is a lot of further application potential in the available data • experimentation in industry, but many efforts are still in the prototype stage
  • 76. Bizer: Evolving the Web into a global Dataspace, BIS 2015, 24.6.2015 Slide 76 Thanks − References • Robert Meusel, Petar Petrovski and Christian Bizer: The WebDataCommons Microdata, RDFa and Microformat Dataset Series. 13th International Semantic Web Conference (ISWC2014). • Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains (Slides, Video). 13th International Semantic Web Conference (ISWC2014). • Petar Petrovski, Volha Bryl, Christian Bizer: Integrating Product Data from Websites offering Microdata Markup. 4th Workshop on Data Extraction and Object Search (DEOS2014). − Detailed statistics on RDFa, Microdata and Microformats adoption • http://www.webdatacommons.org/structureddata/ − Detailed statistics on Linked Data adoption • http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/

Hinweis der Redaktion

  1. Since 2012
  2. High agreement on vocabulary biggest Datasets in their category (288 million product descriptions, 42 million reviews)
  3. http://www.alexa.com/topsites/category/Top/Shopping Amazon Instant Video: Ja mit JSON-LD
  4. Potential reason: HR databases are not stuctured
  5. Hotels: 60% of booking via websites, commission 20% Tricky leagal questions involved
  6. PrecisionElectrinics = 93% PrecisionAppeal= 88%
  7. Google: 300 people Microsoft: 120 people Yahoo: 30 people
  8. Very up-to-date info (oscar nomenees)