SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Slide 1
Statistical Natural Language Processing Colloquium
Universität Heidelberg, 26.11.2015
Data Search and Search Joins
Prof. Dr. Christian Bizer
Slide 2
Hello
Professor Christian Bizer
University of Mannheim
Research Topics
 Web Technologies
 Web Data Profiling
 Web Data Integration
 Web Mining
Slide 3
Data and Web Science Group @ University of Mannheim
 5 Professors
 Heiner Stuckenschmidt
 Rainer Gemulla
 Christian Bizer
 Simone Ponzetto
 Heiko Paulheim
 25 postdocs and PhD students
 http://dws.informatik.uni-mannheim.de/
1. Research methods for integrating and mining large amounts
of heterogeneous information from the Web.
2. Empirically analyze the content and structure of the Web.
Slide 4
Outline
1. Motivation for Data Search
2. Types of Data Search
1. Entity Search
2. Table Search
3. Constraint and Unconstraint Search Joins
3. The Mannheim Search Join Engine
1. Methods
2. Evaluation
4. Conclusion and Outlook
Slide 5
1. Motivation for Data Search
Deluge of structured data on the Web
1. Semantic annotations in HTML pages
2. Linked data
3. Relational HTML tables
4. Data portals
Need for search techniques that
exploit the structure of the data.
Slide 6
 ask site owners to embed schema.org
annotations into their HTML pages
 200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, …
 Encoding: Microdata or RDFa
Semantic Annotations in HTML Pages
More and more Websites semantically
markup the content of their HTML pages.
Slide 7
Usage of Schema.org Data @ Google
Data snippets
within
search results
Data snippets
within
info boxes
Slide 8
Adoption of Semantic Annotations in 2014
620 million HTML pages out of the 2 billion pages
provide semantic annotations (30%).
2.72 million pay-level-domains (PLDs) out of the 15.68
million pay-level-domains covered by the crawl provide
annotations (17%).
Google, 2014*:
5 million websites provide Schema.org data.
* Guha in LDOW2014 Keynote
WebDataCommons project extracted all Microformat,
Microdata, RDFa data from the Common Crawl 2014
Slide 9
Topical Focus – Microdata 2014
2014 2013
Class Instances # PLDs PLDs
# % # %
1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,04
2 schema:Article 54.972.000 88,7 10,82% 65.930 14,22
3 schema:Blog 3.787.000 110,663 13,50% 64.709 13,96
4 schema:Product 288.083.000 89,608 10,93% 56.388 12,16
5 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,31
6 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,53
7 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,94
8 schema:Offer 236.953.000 62,849 7,66% 35.635 7,69
9 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,61
10 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,92
11 schema:Organization 101.769.000 52,733 6,43% 24.255 5,23
12 schema:Person 115.376.000 47,936 5,85% 21.107 4,55
13 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,47
14 dv:Product 12.411.000 16,003 1,95% 13.844 2,99
15 schema:Review 42.561.000 20,124 2,45% 13.137 2,83
16 dv:Review-aggregate 3.964.000 14,094 1,72% 13.075 2,82
17 dv:Organization 3.155.000 10,649 1,30% 9.582 2,07
18 dv:Offer 7.170.000 11,64 1,42% 9.298 2,01
19 dv:Address 2.138.000 9,674 1,18% 8.866 1,91
20 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8
 Top Classes
 Topics:
 CMS and blog
metadata
 products and
offers
 ratings and
reviews
 business listings
 address data
 ...and a massive
long tail
schema: = Schema.org
dv: = Google Rich Snippet Vocabulary (deprecated)
Slide 10
Adoption by E-Commerce Websites
Distribution by Alexa Top-15 Shopping Sites
Top-Level Domain
TLD #PLDs
com 38344
co.uk 3605
net 1813
de 1333
pl 1273
com.br 1194
ru 1165
com.au 1062
nl 1002
Website schema:Product
Amazon.com T
Ebay.com P
NetFlix.com T
Amazon.co.uk T
Walmart.com P
etsy.com T
Ikea.com P
Bestbuy.com P
Homedepot.com P
Target.com P
Groupon.com T
Newegg.com P
Lowes.com T
Macys.com P
Nordstrom.com P
Adoption by Top-15:
60 %
Slide 11
Linked Data
B C
RDF
RDF
link
A D E
RDF
links
RDF
links
RDF
links
RDF
RDF
RDF
RDF
RDF RDF
RDF
RDF
RDF
Extends the Web with a single global data graph
1. by using RDF to publish structured data on the Web
2. by setting links between data items within different
data sources.
Slide 12
Linked Data on the Web
~ 1000 data sets (2014)
Slide 13
Relational HTML Tables
In corpus of 14B raw tables, 154M are “good” relations (1.1%).
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
• WebDataCommons Web Tabel Corpus 2015: 233 million tables for public download.
Slide 14
Data Portals
Several 100.000 datasets are available via data portals.
Slide 15
Summary: Motivation for Data Search
Deluge of structured data on the Web
1. Semantic annotations in HTML pages
2. Linked data
3. Relational HTML tables
4. Data portals
Deluge of structured data on the Intranet
1. Piles of Excel files
2. Piles of CSV files
3. Other data dumps
Slide 16
2. Types of Data Search
1. Entity Search
2. Table Search
3. Constraint and Unconstraint Search Joins
Slide 17
Entity Search
 Well researched area with mass market adoption
 Techniques
 hits on different fields contribute differently to ranking
 surface forms of entity names are gathered via click log analysis
 query refinement using facets
Given some keywords and optionally some facet filters,
generate ranked list of relevant entities.
Slide 18
Table Search
 Example:
Google
Table
Search
Given some keywords generate ranked list of relevant tables.
Slide 19
Table Search: FeatureRank
Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
 Combination of query independent
and query dependent features
 Weights learned using linear regression
 Heavily weighted features
 hits left-most column
 hits header
 Result quality: fraction of high scoring relevant tables
k Naïve FeatureRank
10 0.26 0.43
20 0.33 0.56
30 0.34 0.66
Slide 20
 Improve ranking based on deeper understanding of the tables
 Subject Attribute Detection (Ventis)
 Simple heuristic approach (Accuracy: 83%)
- take first column from the left of Web tables that is not a number or a date
 SVM Classifier (Accuracy: 94%)
- fraction of cells with unique content, variance in the number of tokens in each
cell, column index from the left, …
 Header Detection (Pimplikar)
 one header row: 60%, two or more header rows: 22%, no header: 18%
 Match tables against large-scale knowledge base (Ventis)
 Web IsA database for types, TextRunner for relations
Ventis, et al.: Recovering Semantics of Tables on the Web. VLDB 2011.
Pimplikar, Sarawagi: Answering table queries on the web using column keywords, VLDB 2012.
Ranking using Deeper Understanding of Tables
Slide 21
Microsoft Power Query for Excel
 Excel add-in offering table search
 Provides various adapters to index intranet and web data
Excel files
CVS files
XML files
Text files
SQL databases
Azure marketplace
Sharepoint lists
HDFS
Facebook API
SAP BI Universes
Salesforce Objects
Slide 22
Shortcoming: Manual Data Integration Required
Slide 23
Constraint Search Join
No. Region Unemployment
1 Alsace 11 %
2 Lorraine 12 %
3 Guadeloupe 28 %
4 Centre 10 %
5 Martinique 25 %
… … …
GDP per Capita
45.914 €
51.233 €
19.810 €
59.502 €
21,527 €
…
+
Region „GDP per Capita“
A Constraint Search Join is a join operation which extends
a local table with an additional user-specified attribute
based on a large corpus of heterogeneous tabular data.
Slide 24
Assumptions on Table Corpus
1. Entity-Attributes-Tables
 One entity per row
2. Subject Attribute
 name of the entity
 string, no number or other data type
 relatively unique values
 used as pseudo-key
Rank Film Studio Director Length
1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min
2. Alien Brandwine Ridley Scott 117 min
3. Black Moon NEF Louis Malle 100 min
Slide 25
Input/Output: Constraint Search Join
 Input:
1. Corpus of heterogeneous tables
2. Query table
3. Subject attribute specification
4. Keywords describing extension attribute
 Output:
 Query table complemented with extension attribute
Slide 26
Elements of a Search Join
s() : Search operator determines the set of relevant tables
m() : MultiJoin operator performs a series of left-outer joins
between the query table and all relevant tables
c() : Consolidation operator fuses attribute values in order to
return a concise result table containing high-quality data
𝑅 = 𝑐(𝑚 𝑞, 𝑠 𝑞,𝑠,𝑎 𝑇 𝑤𝑒𝑏 )
Slide 27
The Search Operator
 Dimension of Relevancy
1. Right Attribute
 table should contain the extension attribute
 requires schema matching
2. Entity Coverage
 table should cover many entities from the query table
 requires identity resolution
3. Timeliness
 data should refer to desired point in time
4. Trustworthiness
 data should be trustworthy
The search operator determines the set of relevant tables.
Slide 28
Multi-Join Operator
 Input
q = query table
Tr = set of relevant tables
 Output
te = extended query table
The multi-join operator performs a series of left-outer joins on the
subject attribute between the query table and all relevant tables.
No. Region
1 Alsace
2 Lorraine
3 Guadeloupe
4 Centre
GDP
45.914 €
51.233 €
NULL
NULL
GDP per
Capita
45.000 €
NULL
19.000 €
59.500 €
GDP (2012)
45.363
NULL
NULL
59.230
GDP [current,
US$]
48,900 $
NULL
20,200 $
63,260 $
Slide 29
Consolidation Operator
 Input
te = extended query table
 Output
tr = result table
 The operator might employ various conflict resolution functions
 majority vote
 clustered vote
 considering timeliness of data
 considering source trustworthiness
 ….
The consolidation operator fuses attribute values in order to
return a concise result table containing high-quality data.
No Region GDP (current, €)
1 Alsace 45.914
2 Lorraine 51.233
3 Guadeloupe 19.000
4 Centre 59.500
Slide 30
Infogather
 Prototype developed by Microsoft Research
 Data corpus: 573 million HTML tables from Bing Crawl (2011)
 Splits HTML Tables into Binary-Entity-Attribute Tables
Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic
Matching with Web Tables. SIGMOD 2012.
Region Unemployment
Alsace 11 %
Lorraine 12 %
Guadeloupe 28 %
Region GDP per Capita
Alsace 45.914 €
Lorraine 51.233 €
Guadeloupe 19.810 €
Subject Attribute Subject Attribute
Slide 31
Matching Graph
 Pre-compute matching graph between tables
 Features used for matching
1. Key values overlap
2. Attribute label similarity
3. Attribute values similarity
4. Textual context around tables similarity
5. Table to context similarity
6. Table to table as bag of words similarity
7. URL similarity
8. column width similarity
 Accuracy of the resulting correspondences
 Cameras and movies: 0.95
 Governors and members of parliament: 0.5
Model Brand
S80 Nikon
Easyshare CD44 Kodak
DSC W570 Sony
Optio E60 Pentax
Model Brand
SP 600UZ Olympus
DSC W570 Sony
GX-1S Samsung
A10 Canon
Part No Mfg
DSC W570 Sony
T1460 Benq
Optio E60 Pentax
S8100 Nikon
0.3
0.25
0.5
Slide 32
Query Processing
 Input: Query table, keywords describing extension attribute
 Web tables considered relevant for the query
 share subject attribute value with query table
 and directly match extension attribute
 or are connected to directly matching tables via the matching graph
 Matching score
 directly matching tables: entity overlap / min( l tq l , l tw l )
 indirectly matching tables: propagate score along edges of matching graph
 Predict values
 cluster by value
 sum matching scores per cluster
 choose centroid of cluster with highest score or top-k centroids
Slide 33
Model Brand
S80 ??
T1460 ??
GX-1S ??
A10 ??
Query
Table
Model Brand
S80 Canon
Easyshare CD44 Kodak
DSC W570 Sony
Optio E60 Pentax
Model Brand
S80 Benq
A10 Innostream
A460 Samsung
S710 HTC
0.25
Model Brand
SP 600UZ Olympus
DSC W570 Sony
GX-1S Samsung
A10 Canon
0.5
0.5
Part No Mfg
DSC W570 Sony
T1460 Benq
Optio E60 Pentax
S8100 Nikon
0.4
0.25
0.2
0.25
Direct
matching
tables
Indirect
matching
tables
Slide 34
 Queries
 Ground Truth
 Camera, movies: Bing shopping database
 Baseball, Albums, UK-pm, US-gov: Wikipedia, Freebase
 Size of query table
 12 to 6000 rows
Evaluation
Query Subject Attribute Extension Attribute
Cameras Camera model Brand
Movies Movie name Director
Baseball Team name Player
Albums Musical band Album
UK-pm UK political party Member of parliament
US-gov US state Governor
Slide 35
Evaluation Results
Query Subject attribute Extension Attribute Precision Coverage
Cameras Camera model Brand 0.85 0.93
Movies Movie name Director 0.92 0.97
Baseball Team name Player 0.72 1.00
Albums Musical band Album 0.75 1.00
UK-pm UK political party Member of parliament 0.60 0.91
US-gov US state Governor 0.90 1.00
AVG 0.79 0.97
Slide 36
Shortcoming
 Users need to know the extension attribute beforehand
 No discovery of additional relevant attributes
Slide 37
Unconstraint Search Join
Given a local table and a subject attribute, add all attributes to
the local table that can be filled beyond a density threshold.
No. Region Unemp.
Rate
1 Alsace 11 %
2 Lorraine 12 %
3 Guadeloupe 28 %
4 Centre 10 %
5 Martinique 25 %
… … …
GDP
per Capita
Population
Growth
Overseas
department
…
45.914 € 0,16 % No …
51.233 € -0,05 % No …
19.810 € 1,34 % Yes …
59.502 € NULL NULL …
NULL 2,64 % Yes …
… … …
+
Region density >= 0.8
Slide 38
Features
1. Extend local table with additional attributes from a Linked Data Cloud
2. Follows RDF links into other Linked Data sources
3. Provides for mining extended table using all RapidMiner features
http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/
Prototype: RapidMiner Linked Open Data Extension
Input Table Link Additional Attributes
Slide 39
Resulting Correlations
 Population growth (positive)
 Fast food restaurants (positive)
 Police stations (positive)
 Overseas Department (positive)
 Hospital beds/inhabitants (negative)
 GDP (negative)
 Energy consumption (negative)
Petar Ristoski, Christian Bizer, Heiko Paulheim: Mining the Web of Linked Data with
RapidMiner. Journal of Web Semantics, 2015.
Slide 40
Shortcoming
Requires RDF links
Not applicable to other forms of Web data
(HTML tables, Microdata annotations, data portals, …)
Slide 41
3. The Mannheim Search Join Engine (MSJE)
Collection of tables Table Normalization
Table Storage Table Index
1. Table
Indexing
Input query table Table Preprocessing
Search
2. Table
Search
3. Data
Consolidation
Data collection
User Preferences
Consolidation
MultiJoin Top k Candidates
 Provides for constraint and unconstraint search joins
http://searchjoins.webdatacommons.org
Slide 42
The Search Operator
 Table Ranking
 Subject column overlap
 Two variations:
1. Exact match of tokens
2. Similarity match of tokens
using FastJoin library (fuzzy Jaccard)
 Select TopK Tables
 We use Top 1000 tables in the single column
experiments
All tables that describe the entities from the query table
are relevant.
Relevant
Slide 43
Collective Disambiguation
City Country
Munich Germany
Berlin Germany
Mannheim Germany
Frankfurt Germany
Karlsruhe Germany
City Country
Madison USA
Berlin USA
Chatham USA
Fort Reed USA
Sunville USA
City
Berlin
Madison
Chatham
Perth
Query Table
Web Table 1
Web Table 2
𝒔𝒄𝒐 =
𝟏
𝟒
𝒔𝒄𝒐 =
𝟑
𝟒
Slide 44
Multi-Join Operator
The MultiJoin operator performs a series of left-outer joins
between the query table and all relevant tables.
No. Region
1 Alsace
2 Lorraine
3 Guadeloupe
4 Centre
Unemploy
11 %
12 %
28 %
10 %
Unemp
NULL
NULL
26 %
9.4 %
GDP
45.914 €
51.233 €
NULL
NULL
GDP per C
45.000 €
NULL
19.000 €
59.500 €
Slide 45
Consolidation Operator
 Column Matching
 Combination of label- and instance-based techniques
 Conflict Resolution
1. cluster similar values
2. choose largest cluster
3. choose most frequent value
(alternative: calculate mean)
The consolidation operator merges corresponding columns
and fuses values in order to return a concise result table.
No Region Unemploy GDP
1 Alsace 11 % 45.914 €
2 Lorraine 12 % 51.233 €
3 Guadelo
upe
28 % 19.000 €
4 Centre 10 % 59.500 €
Slide 46
Evaluation: Types of Web Data Used
Microdata
(schema.org)
Wiki Tables
HTML Tables
Linked Data
 Microdata: Web Data Commons Corpus. http://webdatacommons.org/structureddata/
 HTML Tables: Web Data Commons Table Corpus. http://webdatacommons.org/webtables/
 Wiki Tables: Northwestern University Corpus. http://downey-n1.cs.northwestern.edu/public/
 Linked Data: Billion Triples Challenge 2014: http://km.aifb.kit.edu/projects/btc-2014/
Slide 47
Corpus Statistics
 Data originates from over 1 million different websites
 Tables filtered out in preprocessing
 less than 5 rows, less than 3 columns
 no subject attribute
Slide 48
Evaluation Results: Constraint Search Joins
 Size of query tables: 50-200 rows
 Ground truth: IMDB, Wikipedia
Slide 49
Example Result: Extend with Single Column
Slide 50
Provenance Summary
Slide 51
Provenance Details
http://searchjoins.webdatacommons.org/
Slide 52
Evaluation Results: Unconstraint Search Joins
Parameters
 Minimum attribute density: 50%
 Similarity threshold for merging attributes:
0.8 for string attributes and 0.4 for numeric attributes
Slide 53
Reduced Example: Unconstraint Search Joins
Slide 54
Provenance Summary
Slide 55
Summary: Search Join Systems
• Cafarella, et al.: Data Integration for the Relational Web. VLDB 2009.
• Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic
Matching with Web Tables. SIGMOD 2012.
• Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013.
• Lehmberg, et al.: The Mannheim Search Join Engine. Journal of Web Semantics 2015.
Octopus InfoGather WikiTables MSJ Engine
Developer Google Research
University Washington
Microsoft Research
Purdue University
Northwestern
University
University of
Mannheim
Extend
Operation
Single Attribute Single Attribute Multiple Attributes Single Attribute
Multiple Attributes
Corpus Google Web crawl
via Search API
HTML tables from Bing
Web crawl
Tables from
Wikipedia
WDC Web Tables
WDC Microdata
WikiTables data
Linked Data
Slide 56
4. Conclusions and Outlook
 Conclusions
 Search Joins bring together Web Search and DB Joins
 The prototypes show that simple joins are feasible
 Different types of Web data contribute to different queries
 Outlook
 Application of search joins in specialized domains
- knowledge base extension (DBpedia)
- mining intranet data (together with RapidMiner as industry partner)
 How to deal with the overwhelming number of new attributes?
Slide 57
Idea: Schema Complement
 What should be considered “consistent”?
 Possible answer: Combinations of local and Web attributes that
often occur together on the Web.
 Approach
 Generate frequent item set database from Web table schema corpus.
 Retrieve frequency of all combinations of one local and Web attribute.
 Aggregate frequencies for Web attributes.
 Not investigated yet: Combination of “consistency” with other
quality dimensions like “density” and “trustworthiness”.
• Das Sarma, et al.: Finding related tables. SIGMOD 2012.
Add attributes while preserving “consistency” of the schema.

Weitere ähnliche Inhalte

Andere mochten auch

Construction Breakfast Series: Leasing changes will affect your operations. A...
Construction Breakfast Series: Leasing changes will affect your operations. A...Construction Breakfast Series: Leasing changes will affect your operations. A...
Construction Breakfast Series: Leasing changes will affect your operations. A...CBIZ, Inc.
 
Pen and ink on sculpted paper art by Col Mitchell
Pen and ink on sculpted paper art by Col MitchellPen and ink on sculpted paper art by Col Mitchell
Pen and ink on sculpted paper art by Col MitchellCol Mitchell
 
Writing for SEO 101
Writing for SEO 101Writing for SEO 101
Writing for SEO 101Ben Brooks
 
Create Connection: #IABCLI
Create Connection: #IABCLICreate Connection: #IABCLI
Create Connection: #IABCLIMichael Ambjorn
 
Marketing Analytics Report - Aditya, Justin, Kevin and Shweta
Marketing Analytics Report - Aditya, Justin, Kevin and ShwetaMarketing Analytics Report - Aditya, Justin, Kevin and Shweta
Marketing Analytics Report - Aditya, Justin, Kevin and ShwetaKevin Jeffrey
 
Mixed Media Collage with Paint and an Open Heart
Mixed Media Collage with Paint and an Open HeartMixed Media Collage with Paint and an Open Heart
Mixed Media Collage with Paint and an Open Heartglennhirsch
 
Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜
Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜
Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜Mariko Takatori
 
Coworking Europe: Office 4.0
Coworking Europe: Office 4.0Coworking Europe: Office 4.0
Coworking Europe: Office 4.0Vishal Gupta
 
Harvard Business School - Lean Startup Strategy
Harvard Business School  - Lean Startup Strategy Harvard Business School  - Lean Startup Strategy
Harvard Business School - Lean Startup Strategy N Pandya
 
Käthe Kollwitz Krieg Freiwillige
Käthe Kollwitz Krieg FreiwilligeKäthe Kollwitz Krieg Freiwillige
Käthe Kollwitz Krieg FreiwilligeDimitri Kokkonis
 
La historia de la tierra
La historia de la tierraLa historia de la tierra
La historia de la tierramerchealari
 

Andere mochten auch (17)

Union europea
Union europeaUnion europea
Union europea
 
Construction Breakfast Series: Leasing changes will affect your operations. A...
Construction Breakfast Series: Leasing changes will affect your operations. A...Construction Breakfast Series: Leasing changes will affect your operations. A...
Construction Breakfast Series: Leasing changes will affect your operations. A...
 
Pen and ink on sculpted paper art by Col Mitchell
Pen and ink on sculpted paper art by Col MitchellPen and ink on sculpted paper art by Col Mitchell
Pen and ink on sculpted paper art by Col Mitchell
 
Branding • WP Group
Branding • WP GroupBranding • WP Group
Branding • WP Group
 
Karat Marketing and Services, LLC ORM PowerPoint
Karat Marketing and Services, LLC ORM PowerPointKarat Marketing and Services, LLC ORM PowerPoint
Karat Marketing and Services, LLC ORM PowerPoint
 
Procedimiento de acciòn correctiva y preventiva
Procedimiento de acciòn correctiva y preventivaProcedimiento de acciòn correctiva y preventiva
Procedimiento de acciòn correctiva y preventiva
 
Writing for SEO 101
Writing for SEO 101Writing for SEO 101
Writing for SEO 101
 
Create Connection: #IABCLI
Create Connection: #IABCLICreate Connection: #IABCLI
Create Connection: #IABCLI
 
Marketing Analytics Report - Aditya, Justin, Kevin and Shweta
Marketing Analytics Report - Aditya, Justin, Kevin and ShwetaMarketing Analytics Report - Aditya, Justin, Kevin and Shweta
Marketing Analytics Report - Aditya, Justin, Kevin and Shweta
 
Mixed Media Collage with Paint and an Open Heart
Mixed Media Collage with Paint and an Open HeartMixed Media Collage with Paint and an Open Heart
Mixed Media Collage with Paint and an Open Heart
 
Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜
Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜
Poiboy for android 〜激短(1ヶ月)制作フローと気づき〜
 
Global home-care-report-april-2016
Global home-care-report-april-2016Global home-care-report-april-2016
Global home-care-report-april-2016
 
Coworking Europe: Office 4.0
Coworking Europe: Office 4.0Coworking Europe: Office 4.0
Coworking Europe: Office 4.0
 
Harvard Business School - Lean Startup Strategy
Harvard Business School  - Lean Startup Strategy Harvard Business School  - Lean Startup Strategy
Harvard Business School - Lean Startup Strategy
 
differently abled
differently ableddifferently abled
differently abled
 
Käthe Kollwitz Krieg Freiwillige
Käthe Kollwitz Krieg FreiwilligeKäthe Kollwitz Krieg Freiwillige
Käthe Kollwitz Krieg Freiwillige
 
La historia de la tierra
La historia de la tierraLa historia de la tierra
La historia de la tierra
 

Ähnlich wie Data Search and Search Joins (Universität Heidelberg 2015)

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
Semantic Labeling for Quantitative Data using Wikidata
Semantic Labeling for Quantitative Data using WikidataSemantic Labeling for Quantitative Data using Wikidata
Semantic Labeling for Quantitative Data using WikidataPhuc Nguyen
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Futurefeiwin
 
Asp.net Lab manual
Asp.net Lab manualAsp.net Lab manual
Asp.net Lab manualTamil Dhasan
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdfpaijitk
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detectionMostafaAliAbbas
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsJim Kaplan CIA CFE
 
Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques
Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques
Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques Jim Kaplan CIA CFE
 
SAP BO Dashboard Training Online
SAP BO Dashboard Training OnlineSAP BO Dashboard Training Online
SAP BO Dashboard Training Onlineashok training
 
Quick & Easy Data Visualization with Google Visualization API + Google Char...
Quick & Easy Data Visualization with Google Visualization API + Google Char...Quick & Easy Data Visualization with Google Visualization API + Google Char...
Quick & Easy Data Visualization with Google Visualization API + Google Char...Bohyun Kim
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
 

Ähnlich wie Data Search and Search Joins (Universität Heidelberg 2015) (20)

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
Semantic Labeling for Quantitative Data using Wikidata
Semantic Labeling for Quantitative Data using WikidataSemantic Labeling for Quantitative Data using Wikidata
Semantic Labeling for Quantitative Data using Wikidata
 
Data Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and FutureData Mining and the Web_Past_Present and Future
Data Mining and the Web_Past_Present and Future
 
Diadem 1.0
Diadem 1.0Diadem 1.0
Diadem 1.0
 
Asp.net Lab manual
Asp.net Lab manualAsp.net Lab manual
Asp.net Lab manual
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 
Lecture_1_Intro.pdf
Lecture_1_Intro.pdfLecture_1_Intro.pdf
Lecture_1_Intro.pdf
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
GraphQL Basics
GraphQL BasicsGraphQL Basics
GraphQL Basics
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Excel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for AuditorsExcel Pivot Tables and Graphing for Auditors
Excel Pivot Tables and Graphing for Auditors
 
Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques
Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques
Using MS Excel In Your Next Audit - Top Basic & Intermediate Techniques
 
Tableau training course
Tableau training courseTableau training course
Tableau training course
 
Vivarana literature survey
Vivarana literature surveyVivarana literature survey
Vivarana literature survey
 
SAP BO 4.1 Training
SAP BO 4.1 Training SAP BO 4.1 Training
SAP BO 4.1 Training
 
SAP BO Dashboard Training Online
SAP BO Dashboard Training OnlineSAP BO Dashboard Training Online
SAP BO Dashboard Training Online
 
Quick & Easy Data Visualization with Google Visualization API + Google Char...
Quick & Easy Data Visualization with Google Visualization API + Google Char...Quick & Easy Data Visualization with Google Visualization API + Google Char...
Quick & Easy Data Visualization with Google Visualization API + Google Char...
 
Qlikview Training in Bangalore by myTectra
Qlikview Training in Bangalore by myTectraQlikview Training in Bangalore by myTectra
Qlikview Training in Bangalore by myTectra
 
Qlikview Training in Bangalore by myTectra
Qlikview Training in Bangalore by myTectraQlikview Training in Bangalore by myTectra
Qlikview Training in Bangalore by myTectra
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
 

Mehr von Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 

Mehr von Chris Bizer (10)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Kürzlich hochgeladen

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Kürzlich hochgeladen (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Data Search and Search Joins (Universität Heidelberg 2015)

  • 1. Slide 1 Statistical Natural Language Processing Colloquium Universität Heidelberg, 26.11.2015 Data Search and Search Joins Prof. Dr. Christian Bizer
  • 2. Slide 2 Hello Professor Christian Bizer University of Mannheim Research Topics  Web Technologies  Web Data Profiling  Web Data Integration  Web Mining
  • 3. Slide 3 Data and Web Science Group @ University of Mannheim  5 Professors  Heiner Stuckenschmidt  Rainer Gemulla  Christian Bizer  Simone Ponzetto  Heiko Paulheim  25 postdocs and PhD students  http://dws.informatik.uni-mannheim.de/ 1. Research methods for integrating and mining large amounts of heterogeneous information from the Web. 2. Empirically analyze the content and structure of the Web.
  • 4. Slide 4 Outline 1. Motivation for Data Search 2. Types of Data Search 1. Entity Search 2. Table Search 3. Constraint and Unconstraint Search Joins 3. The Mannheim Search Join Engine 1. Methods 2. Evaluation 4. Conclusion and Outlook
  • 5. Slide 5 1. Motivation for Data Search Deluge of structured data on the Web 1. Semantic annotations in HTML pages 2. Linked data 3. Relational HTML tables 4. Data portals Need for search techniques that exploit the structure of the data.
  • 6. Slide 6  ask site owners to embed schema.org annotations into their HTML pages  200+ Classes: Product, Review, LocalBusiness, Person, Place, Event, …  Encoding: Microdata or RDFa Semantic Annotations in HTML Pages More and more Websites semantically markup the content of their HTML pages.
  • 7. Slide 7 Usage of Schema.org Data @ Google Data snippets within search results Data snippets within info boxes
  • 8. Slide 8 Adoption of Semantic Annotations in 2014 620 million HTML pages out of the 2 billion pages provide semantic annotations (30%). 2.72 million pay-level-domains (PLDs) out of the 15.68 million pay-level-domains covered by the crawl provide annotations (17%). Google, 2014*: 5 million websites provide Schema.org data. * Guha in LDOW2014 Keynote WebDataCommons project extracted all Microformat, Microdata, RDFa data from the Common Crawl 2014
  • 9. Slide 9 Topical Focus – Microdata 2014 2014 2013 Class Instances # PLDs PLDs # % # % 1 schema:WebPage 51.757.000 148,893 18,16% 69.712 15,04 2 schema:Article 54.972.000 88,7 10,82% 65.930 14,22 3 schema:Blog 3.787.000 110,663 13,50% 64.709 13,96 4 schema:Product 288.083.000 89,608 10,93% 56.388 12,16 5 schema:PostalAddress 48.804.000 101,086 12,33% 52.446 11,31 6 dv:Breadcrumb 269.088.000 76,894 9,38% 44.187 9,53 7 schema:AggregateRating 59.070.000 50,510 6,16% 36.823 7,94 8 schema:Offer 236.953.000 62,849 7,66% 35.635 7,69 9 schema:LocalBusiness 20.194.000 62,191 7,58% 35.264 7,61 10 schema:BlogPosting 11.458.000 65,397 7,98% 32.056 6,92 11 schema:Organization 101.769.000 52,733 6,43% 24.255 5,23 12 schema:Person 115.376.000 47,936 5,85% 21.107 4,55 13 schema:ImageObject 35.356.000 25,573 3,12% 16.084 3,47 14 dv:Product 12.411.000 16,003 1,95% 13.844 2,99 15 schema:Review 42.561.000 20,124 2,45% 13.137 2,83 16 dv:Review-aggregate 3.964.000 14,094 1,72% 13.075 2,82 17 dv:Organization 3.155.000 10,649 1,30% 9.582 2,07 18 dv:Offer 7.170.000 11,64 1,42% 9.298 2,01 19 dv:Address 2.138.000 9,674 1,18% 8.866 1,91 20 dv:Rating 1.732.000 9,367 1,14% 8.360 1,8  Top Classes  Topics:  CMS and blog metadata  products and offers  ratings and reviews  business listings  address data  ...and a massive long tail schema: = Schema.org dv: = Google Rich Snippet Vocabulary (deprecated)
  • 10. Slide 10 Adoption by E-Commerce Websites Distribution by Alexa Top-15 Shopping Sites Top-Level Domain TLD #PLDs com 38344 co.uk 3605 net 1813 de 1333 pl 1273 com.br 1194 ru 1165 com.au 1062 nl 1002 Website schema:Product Amazon.com T Ebay.com P NetFlix.com T Amazon.co.uk T Walmart.com P etsy.com T Ikea.com P Bestbuy.com P Homedepot.com P Target.com P Groupon.com T Newegg.com P Lowes.com T Macys.com P Nordstrom.com P Adoption by Top-15: 60 %
  • 11. Slide 11 Linked Data B C RDF RDF link A D E RDF links RDF links RDF links RDF RDF RDF RDF RDF RDF RDF RDF RDF Extends the Web with a single global data graph 1. by using RDF to publish structured data on the Web 2. by setting links between data items within different data sources.
  • 12. Slide 12 Linked Data on the Web ~ 1000 data sets (2014)
  • 13. Slide 13 Relational HTML Tables In corpus of 14B raw tables, 154M are “good” relations (1.1%). • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. • WebDataCommons Web Tabel Corpus 2015: 233 million tables for public download.
  • 14. Slide 14 Data Portals Several 100.000 datasets are available via data portals.
  • 15. Slide 15 Summary: Motivation for Data Search Deluge of structured data on the Web 1. Semantic annotations in HTML pages 2. Linked data 3. Relational HTML tables 4. Data portals Deluge of structured data on the Intranet 1. Piles of Excel files 2. Piles of CSV files 3. Other data dumps
  • 16. Slide 16 2. Types of Data Search 1. Entity Search 2. Table Search 3. Constraint and Unconstraint Search Joins
  • 17. Slide 17 Entity Search  Well researched area with mass market adoption  Techniques  hits on different fields contribute differently to ranking  surface forms of entity names are gathered via click log analysis  query refinement using facets Given some keywords and optionally some facet filters, generate ranked list of relevant entities.
  • 18. Slide 18 Table Search  Example: Google Table Search Given some keywords generate ranked list of relevant tables.
  • 19. Slide 19 Table Search: FeatureRank Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.  Combination of query independent and query dependent features  Weights learned using linear regression  Heavily weighted features  hits left-most column  hits header  Result quality: fraction of high scoring relevant tables k Naïve FeatureRank 10 0.26 0.43 20 0.33 0.56 30 0.34 0.66
  • 20. Slide 20  Improve ranking based on deeper understanding of the tables  Subject Attribute Detection (Ventis)  Simple heuristic approach (Accuracy: 83%) - take first column from the left of Web tables that is not a number or a date  SVM Classifier (Accuracy: 94%) - fraction of cells with unique content, variance in the number of tokens in each cell, column index from the left, …  Header Detection (Pimplikar)  one header row: 60%, two or more header rows: 22%, no header: 18%  Match tables against large-scale knowledge base (Ventis)  Web IsA database for types, TextRunner for relations Ventis, et al.: Recovering Semantics of Tables on the Web. VLDB 2011. Pimplikar, Sarawagi: Answering table queries on the web using column keywords, VLDB 2012. Ranking using Deeper Understanding of Tables
  • 21. Slide 21 Microsoft Power Query for Excel  Excel add-in offering table search  Provides various adapters to index intranet and web data Excel files CVS files XML files Text files SQL databases Azure marketplace Sharepoint lists HDFS Facebook API SAP BI Universes Salesforce Objects
  • 22. Slide 22 Shortcoming: Manual Data Integration Required
  • 23. Slide 23 Constraint Search Join No. Region Unemployment 1 Alsace 11 % 2 Lorraine 12 % 3 Guadeloupe 28 % 4 Centre 10 % 5 Martinique 25 % … … … GDP per Capita 45.914 € 51.233 € 19.810 € 59.502 € 21,527 € … + Region „GDP per Capita“ A Constraint Search Join is a join operation which extends a local table with an additional user-specified attribute based on a large corpus of heterogeneous tabular data.
  • 24. Slide 24 Assumptions on Table Corpus 1. Entity-Attributes-Tables  One entity per row 2. Subject Attribute  name of the entity  string, no number or other data type  relatively unique values  used as pseudo-key Rank Film Studio Director Length 1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min 2. Alien Brandwine Ridley Scott 117 min 3. Black Moon NEF Louis Malle 100 min
  • 25. Slide 25 Input/Output: Constraint Search Join  Input: 1. Corpus of heterogeneous tables 2. Query table 3. Subject attribute specification 4. Keywords describing extension attribute  Output:  Query table complemented with extension attribute
  • 26. Slide 26 Elements of a Search Join s() : Search operator determines the set of relevant tables m() : MultiJoin operator performs a series of left-outer joins between the query table and all relevant tables c() : Consolidation operator fuses attribute values in order to return a concise result table containing high-quality data 𝑅 = 𝑐(𝑚 𝑞, 𝑠 𝑞,𝑠,𝑎 𝑇 𝑤𝑒𝑏 )
  • 27. Slide 27 The Search Operator  Dimension of Relevancy 1. Right Attribute  table should contain the extension attribute  requires schema matching 2. Entity Coverage  table should cover many entities from the query table  requires identity resolution 3. Timeliness  data should refer to desired point in time 4. Trustworthiness  data should be trustworthy The search operator determines the set of relevant tables.
  • 28. Slide 28 Multi-Join Operator  Input q = query table Tr = set of relevant tables  Output te = extended query table The multi-join operator performs a series of left-outer joins on the subject attribute between the query table and all relevant tables. No. Region 1 Alsace 2 Lorraine 3 Guadeloupe 4 Centre GDP 45.914 € 51.233 € NULL NULL GDP per Capita 45.000 € NULL 19.000 € 59.500 € GDP (2012) 45.363 NULL NULL 59.230 GDP [current, US$] 48,900 $ NULL 20,200 $ 63,260 $
  • 29. Slide 29 Consolidation Operator  Input te = extended query table  Output tr = result table  The operator might employ various conflict resolution functions  majority vote  clustered vote  considering timeliness of data  considering source trustworthiness  …. The consolidation operator fuses attribute values in order to return a concise result table containing high-quality data. No Region GDP (current, €) 1 Alsace 45.914 2 Lorraine 51.233 3 Guadeloupe 19.000 4 Centre 59.500
  • 30. Slide 30 Infogather  Prototype developed by Microsoft Research  Data corpus: 573 million HTML tables from Bing Crawl (2011)  Splits HTML Tables into Binary-Entity-Attribute Tables Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012. Region Unemployment Alsace 11 % Lorraine 12 % Guadeloupe 28 % Region GDP per Capita Alsace 45.914 € Lorraine 51.233 € Guadeloupe 19.810 € Subject Attribute Subject Attribute
  • 31. Slide 31 Matching Graph  Pre-compute matching graph between tables  Features used for matching 1. Key values overlap 2. Attribute label similarity 3. Attribute values similarity 4. Textual context around tables similarity 5. Table to context similarity 6. Table to table as bag of words similarity 7. URL similarity 8. column width similarity  Accuracy of the resulting correspondences  Cameras and movies: 0.95  Governors and members of parliament: 0.5 Model Brand S80 Nikon Easyshare CD44 Kodak DSC W570 Sony Optio E60 Pentax Model Brand SP 600UZ Olympus DSC W570 Sony GX-1S Samsung A10 Canon Part No Mfg DSC W570 Sony T1460 Benq Optio E60 Pentax S8100 Nikon 0.3 0.25 0.5
  • 32. Slide 32 Query Processing  Input: Query table, keywords describing extension attribute  Web tables considered relevant for the query  share subject attribute value with query table  and directly match extension attribute  or are connected to directly matching tables via the matching graph  Matching score  directly matching tables: entity overlap / min( l tq l , l tw l )  indirectly matching tables: propagate score along edges of matching graph  Predict values  cluster by value  sum matching scores per cluster  choose centroid of cluster with highest score or top-k centroids
  • 33. Slide 33 Model Brand S80 ?? T1460 ?? GX-1S ?? A10 ?? Query Table Model Brand S80 Canon Easyshare CD44 Kodak DSC W570 Sony Optio E60 Pentax Model Brand S80 Benq A10 Innostream A460 Samsung S710 HTC 0.25 Model Brand SP 600UZ Olympus DSC W570 Sony GX-1S Samsung A10 Canon 0.5 0.5 Part No Mfg DSC W570 Sony T1460 Benq Optio E60 Pentax S8100 Nikon 0.4 0.25 0.2 0.25 Direct matching tables Indirect matching tables
  • 34. Slide 34  Queries  Ground Truth  Camera, movies: Bing shopping database  Baseball, Albums, UK-pm, US-gov: Wikipedia, Freebase  Size of query table  12 to 6000 rows Evaluation Query Subject Attribute Extension Attribute Cameras Camera model Brand Movies Movie name Director Baseball Team name Player Albums Musical band Album UK-pm UK political party Member of parliament US-gov US state Governor
  • 35. Slide 35 Evaluation Results Query Subject attribute Extension Attribute Precision Coverage Cameras Camera model Brand 0.85 0.93 Movies Movie name Director 0.92 0.97 Baseball Team name Player 0.72 1.00 Albums Musical band Album 0.75 1.00 UK-pm UK political party Member of parliament 0.60 0.91 US-gov US state Governor 0.90 1.00 AVG 0.79 0.97
  • 36. Slide 36 Shortcoming  Users need to know the extension attribute beforehand  No discovery of additional relevant attributes
  • 37. Slide 37 Unconstraint Search Join Given a local table and a subject attribute, add all attributes to the local table that can be filled beyond a density threshold. No. Region Unemp. Rate 1 Alsace 11 % 2 Lorraine 12 % 3 Guadeloupe 28 % 4 Centre 10 % 5 Martinique 25 % … … … GDP per Capita Population Growth Overseas department … 45.914 € 0,16 % No … 51.233 € -0,05 % No … 19.810 € 1,34 % Yes … 59.502 € NULL NULL … NULL 2,64 % Yes … … … … + Region density >= 0.8
  • 38. Slide 38 Features 1. Extend local table with additional attributes from a Linked Data Cloud 2. Follows RDF links into other Linked Data sources 3. Provides for mining extended table using all RapidMiner features http://dws.informatik.uni-mannheim.de/en/research/rapidminer-lod-extension/ Prototype: RapidMiner Linked Open Data Extension Input Table Link Additional Attributes
  • 39. Slide 39 Resulting Correlations  Population growth (positive)  Fast food restaurants (positive)  Police stations (positive)  Overseas Department (positive)  Hospital beds/inhabitants (negative)  GDP (negative)  Energy consumption (negative) Petar Ristoski, Christian Bizer, Heiko Paulheim: Mining the Web of Linked Data with RapidMiner. Journal of Web Semantics, 2015.
  • 40. Slide 40 Shortcoming Requires RDF links Not applicable to other forms of Web data (HTML tables, Microdata annotations, data portals, …)
  • 41. Slide 41 3. The Mannheim Search Join Engine (MSJE) Collection of tables Table Normalization Table Storage Table Index 1. Table Indexing Input query table Table Preprocessing Search 2. Table Search 3. Data Consolidation Data collection User Preferences Consolidation MultiJoin Top k Candidates  Provides for constraint and unconstraint search joins http://searchjoins.webdatacommons.org
  • 42. Slide 42 The Search Operator  Table Ranking  Subject column overlap  Two variations: 1. Exact match of tokens 2. Similarity match of tokens using FastJoin library (fuzzy Jaccard)  Select TopK Tables  We use Top 1000 tables in the single column experiments All tables that describe the entities from the query table are relevant. Relevant
  • 43. Slide 43 Collective Disambiguation City Country Munich Germany Berlin Germany Mannheim Germany Frankfurt Germany Karlsruhe Germany City Country Madison USA Berlin USA Chatham USA Fort Reed USA Sunville USA City Berlin Madison Chatham Perth Query Table Web Table 1 Web Table 2 𝒔𝒄𝒐 = 𝟏 𝟒 𝒔𝒄𝒐 = 𝟑 𝟒
  • 44. Slide 44 Multi-Join Operator The MultiJoin operator performs a series of left-outer joins between the query table and all relevant tables. No. Region 1 Alsace 2 Lorraine 3 Guadeloupe 4 Centre Unemploy 11 % 12 % 28 % 10 % Unemp NULL NULL 26 % 9.4 % GDP 45.914 € 51.233 € NULL NULL GDP per C 45.000 € NULL 19.000 € 59.500 €
  • 45. Slide 45 Consolidation Operator  Column Matching  Combination of label- and instance-based techniques  Conflict Resolution 1. cluster similar values 2. choose largest cluster 3. choose most frequent value (alternative: calculate mean) The consolidation operator merges corresponding columns and fuses values in order to return a concise result table. No Region Unemploy GDP 1 Alsace 11 % 45.914 € 2 Lorraine 12 % 51.233 € 3 Guadelo upe 28 % 19.000 € 4 Centre 10 % 59.500 €
  • 46. Slide 46 Evaluation: Types of Web Data Used Microdata (schema.org) Wiki Tables HTML Tables Linked Data  Microdata: Web Data Commons Corpus. http://webdatacommons.org/structureddata/  HTML Tables: Web Data Commons Table Corpus. http://webdatacommons.org/webtables/  Wiki Tables: Northwestern University Corpus. http://downey-n1.cs.northwestern.edu/public/  Linked Data: Billion Triples Challenge 2014: http://km.aifb.kit.edu/projects/btc-2014/
  • 47. Slide 47 Corpus Statistics  Data originates from over 1 million different websites  Tables filtered out in preprocessing  less than 5 rows, less than 3 columns  no subject attribute
  • 48. Slide 48 Evaluation Results: Constraint Search Joins  Size of query tables: 50-200 rows  Ground truth: IMDB, Wikipedia
  • 49. Slide 49 Example Result: Extend with Single Column
  • 52. Slide 52 Evaluation Results: Unconstraint Search Joins Parameters  Minimum attribute density: 50%  Similarity threshold for merging attributes: 0.8 for string attributes and 0.4 for numeric attributes
  • 53. Slide 53 Reduced Example: Unconstraint Search Joins
  • 55. Slide 55 Summary: Search Join Systems • Cafarella, et al.: Data Integration for the Relational Web. VLDB 2009. • Yakout, et al.: InfoGather: Entity Augmentation and Attribute Discovery By Holistic Matching with Web Tables. SIGMOD 2012. • Bhagavatula, et al.: Methods for Exploring and Mining Tables on Wikipedia. KDD IDEA 2013. • Lehmberg, et al.: The Mannheim Search Join Engine. Journal of Web Semantics 2015. Octopus InfoGather WikiTables MSJ Engine Developer Google Research University Washington Microsoft Research Purdue University Northwestern University University of Mannheim Extend Operation Single Attribute Single Attribute Multiple Attributes Single Attribute Multiple Attributes Corpus Google Web crawl via Search API HTML tables from Bing Web crawl Tables from Wikipedia WDC Web Tables WDC Microdata WikiTables data Linked Data
  • 56. Slide 56 4. Conclusions and Outlook  Conclusions  Search Joins bring together Web Search and DB Joins  The prototypes show that simple joins are feasible  Different types of Web data contribute to different queries  Outlook  Application of search joins in specialized domains - knowledge base extension (DBpedia) - mining intranet data (together with RapidMiner as industry partner)  How to deal with the overwhelming number of new attributes?
  • 57. Slide 57 Idea: Schema Complement  What should be considered “consistent”?  Possible answer: Combinations of local and Web attributes that often occur together on the Web.  Approach  Generate frequent item set database from Web table schema corpus.  Retrieve frequency of all combinations of one local and Web attribute.  Aggregate frequencies for Web attributes.  Not investigated yet: Combination of “consistency” with other quality dimensions like “density” and “trustworthiness”. • Das Sarma, et al.: Finding related tables. SIGMOD 2012. Add attributes while preserving “consistency” of the schema.

Hinweis der Redaktion

  1. Say subject attribute is connection to search
  2. Similar vs. exact Problems with: time: Team soccer player, population Sets: cast of movie Heterogenious attribute values: Film genre