Extending Tables with Data from over a Million Websites

Chris Bizer
Chris BizerProfessor um University of Mannheim
Slide 1 
International Semantic Web Conference 
Riva del Garda, Italy, 22.10.2014 
Semantic Web Challenge – Big Data Track 
Extending Tables with Data from 
over a Million Websites 
Oliver Lehmberg, Dominique Ritze, Petar Ristoski, 
Kai Eckert, Heiko Paulheim, Christian Bizer
Slide 2 
Extend a local table with additional columns 
using different types of Web data. 
Region 
Un-employment 
Alsace 11 % 
Lorraine 12 % 
Guadeloupe 28 % 
Centre 10 % 
Martinique 25 % 
GDP 
per Capita 
Population 
Growth 
45.914 € 0,16 % 
51.233 € -0,05 % 
19.810 € 1,34 % 
59.502 € 1,76 % 
NULL 2,64 % 
+ 
Goal
Slide 3 
Operation 1: Extend Local Table with Single Column 
Given a local table and keywords describing the extension 
column, add the extension column to the table and fill it 
with data from the Web. 
„GDP per Capita“ 
Region Unemployment 
Alsace 11 % 
Lorraine 12 % 
Guadeloupe 28 % 
Centre 10 % 
Martinique 25 % 
… … 
GDP per Capita 
45.914 € 
51.233 € 
19.810 € 
59.502 € 
21,527 € 
… 
+
Slide 4 
Operation 2: Extend Local Table with Many Columns 
Given a local table, add all columns to the table that can 
be filled beyond a density threshold. 
Region Unemp. 
Rate 
Alsace 11 % 
Lorraine 12 % 
Guadeloupe 28 % 
Centre 10 % 
Martinique 25 % 
… … 
GDP 
per Capita 
Population 
Growth 
Overseas 
departments 
… 
45.914 € 0,16 % No … 
51.233 € -0,05 % No … 
19.810 € 1,34 % Yes … 
59.502 € NULL NULL … 
NULL 2,64 % Yes … 
… … … 
+ 
density >= 0.8
Slide 5 
Types of Web Data Used 
Microdata 
(schema.org) 
Wiki Tables 
HTML Tables 
Linked Data
Slide 6 
Billion Triple Challenge Dataset 2014 
4 billion triples crawled from 47,000 websites.
Slide 7 
Web Data Commons - Microdata Corpus 
250 million triples from 463,000 websites. 
 Extracted from Common Crawl 2013 web corpus 
 2.2 billion HTML pages from 12.8 million websites 
 Mostly using the schema.org vocabulary 
 Main topics 
 Products 
 Reviews 
 Organisations / LocalBusiness 
 Events 
Download: http://webdatacommons.org/structureddata/
Slide 8 
Web Data Commons – Web Tables Corpus 
Around 1% of all HTML tables contain structured data. 
 we used 35 million English HTML tables. 
 extracted from the Common Crawl 2012 web corpus 
 selected out of 11.2 billion raw tables
Slide 9 
Web Data Commons – Web Tables Corpus 
 Column Statistics 
Column #Tables 
name 4,600,000 
price 3,700,000 
date 2,700,000 
artist 2,100,000 
location 1,200,000 
year 1,000,000 
manufacturer 375,000 
counrty 340,000 
isbn 99,000 
area 95,000 
population 86,000 
 Subject Column Values 
Value #Rows 
usa 135,000 
germany 91,000 
greece 42,000 
new york 59,000 
london 37,000 
athens 11,000 
david beckham 3,000 
ronaldinho 1,200 
oliver kahn 710 
twist shout 2,000 
yellow submarine 1,400 
Download: http://webdatacommons.org/webtables/
Slide 10 
WikiTables 
1.4 million tables from English Wikipedia. 
 extracted by Northwestern University 
 from the 2013 Wikipedia XML dump 
 only tables, no infoboxes 
Download: http://downey-n1.cs.northwestern.edu/public/
Slide 11 
Internal Data Model: Entity-Attributes-Tables 
 One entity per row 
 Subject Column = Name of the entity 
 HTML tables: Most unique string column, break ties by taking leftmost. 
Rank Film Studio Director Length 
1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min 
2. Alien Brandwine Ridley Scott 117 min 
3. Black Moon NEF Louis Malle 100 min 
 Table generation from Linked Data and Microdata 
 generate one table per class and website 
 subject column: rdfs:label, foaf:name, x:name 
 we exploit common vocabularies
Slide 12 
Indexed Tables 
 Selection Conditions: 
1. Minimum size of 3 columns and 5 rows 
2. Subject column detection successful 
 Total # of tables: 36.3 million 
 Total # of PLDs: ~ 1.5 million 
 Total # of triples: 3.0 billion
Slide 13 
The Mannheim Search Joins Engine (MSJE) 
Collection of tables Table Normalization 
Table Storage Table Index 
1. Table 
Indexing 
Input query table Table Preprocessing 
Search 
2. Table 
Search 
3. Data 
Consolidation 
Data collection 
User Preferences 
Consolidation 
MultiJoin Top k Candidates
Slide 14 
The Search Operator 
The Search operator determines the set of relevant Web tables. 
 Table Ranking 
 subject column value overlap 
 extended Jaccard Similarity (FastJoin) 
 Select TopK Tables 
 1000 tables in the single column experiments 
Relevant
Slide 15 
Multi-Join Operator 
The MultiJoin operator performs a series of left-outer joins 
between the query table and all tables in the input set. 
No. Region 
1 Alsace 
2 Lorraine 
3 Guadeloupe 
4 Centre 
Unemploy 
11 % 
12 % 
28 % 
10 % 
Unemploy 
NULL 
NULL 
NULL 
9.4 % 
GDP 
45.914 € 
51.233 € 
NULL 
NULL 
GDP per C 
45.000 € 
NULL 
19.000 € 
59.500 €
Slide 16 
Consolidation Operator 
The consolidation operator merges corresponding columns 
and fuses values in order to return a concise result table. 
 Column Matching 
 Combination of label- and instance-based techniques 
 Conflict Resolution 
 Strings: majority vote 
 Numeric values: average, 
median, clustering and vote 
No Region Unemploy GDP 
1 Alsace 11 % 45.914 € 
2 Lorraine 12 % 51.233 € 
3 Guadelo 
upe 
28 % 19.000 € 
4 Centre 10 % 59.500 €
Slide 17 
http://searchjoins.webdatacommons.org
Slide 18 
Result: Extend with Single Column
Slide 19 
Provenance Summary
Slide 20 
Provenance Details
Slide 21 
Evaluation Results 
Author Head‐quarter 
Industry Area Capital Code Currency Popu‐lation 
Ingre‐dient 
Cast Director Genre Year Artist Team 
Book Company Country Drug Film Song Soccer 
Player 
100% 
80% 
60% 
40% 
20% 
0% 
coverage 93% 94% 94% 100% 100% 100% 94% 100% 87% 94% 97% 97% 96% 99% 88% 
precision 96% 96% 94% 95% 100% 94% 96% 64% 89% 85% 97% 86% 97% 95% 67% 
Coverage: Percentage of entities for which a value was found. 
Precision: Manually evaluated using Wikipedia, IMDB, Amazon.
Slide 22 
Result: Extend with Many Columns 
505 columns are added 
and filled with data from 2071 tables.
Slide 23 
Provenance Summary
Slide 24 
Provenance Details for “area (sq. km)”
Search Joins bring together Web Search and DB Joins. 
Slide 25 
Conclusion 
 The prototype shows that simple queries are feasible. 
 The Web is one application domain for search joins, 
corporate intranets are the other. 
 The overlooked Big Data Vs: Variety and Veracity
1 von 25

Recomendados

Search Joins with the Web - ICDT2014 Invited Lecture von
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
3.5K views73 Folien
Evolving the Web into a Global Dataspace – Advances and Applications von
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
5.5K views76 Folien
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch... von
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
2.6K views53 Folien
Adoption of the Linked Data Best Practices in Different Topical Domains von
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
3.1K views28 Folien
The Graph Structure of the Web - Aggregated by Pay-Level Domain von
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
3.4K views20 Folien
Mining the Web of Linked Data with RapidMiner von
Mining the Web of Linked Data with RapidMinerMining the Web of Linked Data with RapidMiner
Mining the Web of Linked Data with RapidMinerHeiko Paulheim
4K views14 Folien

Más contenido relacionado

Was ist angesagt?

Linked data life cycles von
Linked data life cyclesLinked data life cycles
Linked data life cyclesMichael Hausenblas
5.2K views32 Folien
2013 open analytics-meetup-mortar von
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortarOpen Analytics
1.6K views28 Folien
The Semantic Data Web, Sören Auer, University of Leipzig von
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigLOD2 Creating Knowledge out of Interlinked Data
4K views59 Folien
Graph Structure in the Web - Revisited. WWW2014 Web Science Track von
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackChris Bizer
5.1K views24 Folien
Building an open democracy with open data + von
Building an open democracy with open data + Building an open democracy with open data +
Building an open democracy with open data + Irina Bolychevsky
1.1K views65 Folien
The Semantic Web – A Vision Come True, or Giving Up the Great Plan? von
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?Martin Hepp
2.3K views19 Folien

Was ist angesagt?(20)

2013 open analytics-meetup-mortar von Open Analytics
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
Open Analytics1.6K views
Graph Structure in the Web - Revisited. WWW2014 Web Science Track von Chris Bizer
Graph Structure in the Web - Revisited. WWW2014 Web Science TrackGraph Structure in the Web - Revisited. WWW2014 Web Science Track
Graph Structure in the Web - Revisited. WWW2014 Web Science Track
Chris Bizer5.1K views
Building an open democracy with open data + von Irina Bolychevsky
Building an open democracy with open data + Building an open democracy with open data +
Building an open democracy with open data +
Irina Bolychevsky1.1K views
The Semantic Web – A Vision Come True, or Giving Up the Great Plan? von Martin Hepp
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
The Semantic Web – A Vision Come True, or Giving Up the Great Plan?
Martin Hepp2.3K views
How links can make your open data even greater von Cristina Sarasua
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
Cristina Sarasua309 views
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap... von Data Beers
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
Data Beers559 views
Uk discovery-jisc-project-showcase von RDTF-Discovery
Uk discovery-jisc-project-showcaseUk discovery-jisc-project-showcase
Uk discovery-jisc-project-showcase
RDTF-Discovery1.1K views
The Power of Semantic Technologies to Explore Linked Open Data von Ontotext
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext1.3K views
Putting the L in front: from Open Data to Linked Open Data von Martin Kaltenböck
Putting the L in front: from Open Data to Linked Open DataPutting the L in front: from Open Data to Linked Open Data
Putting the L in front: from Open Data to Linked Open Data
Martin Kaltenböck1.7K views
The methods and practices of Linked Open Data von Dongpo Deng
The methods and practices of Linked Open DataThe methods and practices of Linked Open Data
The methods and practices of Linked Open Data
Dongpo Deng887 views
Registration / Certification Interoperability Architecture (overlay peer-review) von Herbert Van de Sompel
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)
Web Data Extraction: A Crash Course von Giorgio Orsi
Web Data Extraction: A Crash CourseWeb Data Extraction: A Crash Course
Web Data Extraction: A Crash Course
Giorgio Orsi2.6K views
20180226 data driven smart governance von Dongpo Deng
20180226 data driven smart governance20180226 data driven smart governance
20180226 data driven smart governance
Dongpo Deng3.9K views
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data von 21Style
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open DataMuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
MuseoTorino, first italian project using a GraphDB, RDFa, Linked Open Data
21Style1.4K views
Data.dcs: Converting Legacy Data into Linked Data von Matthew Rowe
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
Matthew Rowe590 views
Linked data experience at Macmillan: Building discovery services for scientif... von Michele Pasin
Linked data experience at Macmillan: Building discovery services for scientif...Linked data experience at Macmillan: Building discovery services for scientif...
Linked data experience at Macmillan: Building discovery services for scientif...
Michele Pasin1.4K views

Destacado

Infographic - Food and Beverage Barcode Labeling von
Infographic - Food and Beverage Barcode LabelingInfographic - Food and Beverage Barcode Labeling
Infographic - Food and Beverage Barcode LabelingLoftware
392 views1 Folie
The Role of Technology in Food Processing Compliance and Traceability von
The Role of Technology in Food Processing Compliance and TraceabilityThe Role of Technology in Food Processing Compliance and Traceability
The Role of Technology in Food Processing Compliance and TraceabilityBlytheco
2K views34 Folien
Open Knowledge Repositories: Enablers of Data Integration across Business Col... von
Open Knowledge Repositories: Enablers of Data Integration across Business Col...Open Knowledge Repositories: Enablers of Data Integration across Business Col...
Open Knowledge Repositories: Enablers of Data Integration across Business Col...Monika Solanki
627 views6 Folien
Realising the Potential of Algal Biomass Production through Semantic Web an... von
Realising the Potential of Algal Biomass Production   through Semantic Web an...Realising the Potential of Algal Biomass Production   through Semantic Web an...
Realising the Potential of Algal Biomass Production through Semantic Web an...Monika Solanki
814 views37 Folien
Building Ontologies for Algal Biomass Operations 2012 von
Building Ontologies for Algal Biomass Operations 2012Building Ontologies for Algal Biomass Operations 2012
Building Ontologies for Algal Biomass Operations 2012Monika Solanki
733 views28 Folien
Querying Linked Data and Büchi automata von
Querying Linked Data and Büchi automataQuerying Linked Data and Büchi automata
Querying Linked Data and Büchi automataKonstantinos Giannakis
1.3K views22 Folien

Destacado(20)

Infographic - Food and Beverage Barcode Labeling von Loftware
Infographic - Food and Beverage Barcode LabelingInfographic - Food and Beverage Barcode Labeling
Infographic - Food and Beverage Barcode Labeling
Loftware392 views
The Role of Technology in Food Processing Compliance and Traceability von Blytheco
The Role of Technology in Food Processing Compliance and TraceabilityThe Role of Technology in Food Processing Compliance and Traceability
The Role of Technology in Food Processing Compliance and Traceability
Blytheco2K views
Open Knowledge Repositories: Enablers of Data Integration across Business Col... von Monika Solanki
Open Knowledge Repositories: Enablers of Data Integration across Business Col...Open Knowledge Repositories: Enablers of Data Integration across Business Col...
Open Knowledge Repositories: Enablers of Data Integration across Business Col...
Monika Solanki627 views
Realising the Potential of Algal Biomass Production through Semantic Web an... von Monika Solanki
Realising the Potential of Algal Biomass Production   through Semantic Web an...Realising the Potential of Algal Biomass Production   through Semantic Web an...
Realising the Potential of Algal Biomass Production through Semantic Web an...
Monika Solanki814 views
Building Ontologies for Algal Biomass Operations 2012 von Monika Solanki
Building Ontologies for Algal Biomass Operations 2012Building Ontologies for Algal Biomass Operations 2012
Building Ontologies for Algal Biomass Operations 2012
Monika Solanki733 views
From Biomass to Energy via Semantic Web and Linked data von Monika Solanki
From Biomass to Energy via Semantic Web and Linked dataFrom Biomass to Energy via Semantic Web and Linked data
From Biomass to Energy via Semantic Web and Linked data
Monika Solanki805 views
The potential role of open data in supply chain integration von Christopher Brewster
The potential role of open data in supply chain integrationThe potential role of open data in supply chain integration
The potential role of open data in supply chain integration
The Internet of Lettuces: Legibility, Data and Alternative Food Networks von Christopher Brewster
The Internet of Lettuces: Legibility, Data and Alternative Food NetworksThe Internet of Lettuces: Legibility, Data and Alternative Food Networks
The Internet of Lettuces: Legibility, Data and Alternative Food Networks
Representing Supply Chain Events on the Web of Data von Monika Solanki
Representing Supply Chain Events on the Web of DataRepresenting Supply Chain Events on the Web of Data
Representing Supply Chain Events on the Web of Data
Monika Solanki2K views
The curious case of Blockchain Technology von Ritesh Mehrotra
The curious case of Blockchain TechnologyThe curious case of Blockchain Technology
The curious case of Blockchain Technology
Ritesh Mehrotra1.7K views
Linked data driven EPCIS Event-based Traceability across Supply chain busine... von Monika Solanki
Linked data driven EPCIS Event-based Traceability across  Supply chain busine...Linked data driven EPCIS Event-based Traceability across  Supply chain busine...
Linked data driven EPCIS Event-based Traceability across Supply chain busine...
Monika Solanki1.1K views
Detecting EPCIS exceptions in linked traceability streams across supply cha... von Monika Solanki
Detecting   EPCIS exceptions in linked traceability streams across supply cha...Detecting   EPCIS exceptions in linked traceability streams across supply cha...
Detecting EPCIS exceptions in linked traceability streams across supply cha...
Monika Solanki1.7K views
Consuming Linked data in Supply Chains: Enabling data visibility via Linked P... von Monika Solanki
Consuming Linked data in Supply Chains: Enabling data visibility via Linked P...Consuming Linked data in Supply Chains: Enabling data visibility via Linked P...
Consuming Linked data in Supply Chains: Enabling data visibility via Linked P...
Monika Solanki1.3K views
Global Supply Chain Innovation Summit, Shanghai von Deborah Weinswig
Global Supply Chain Innovation Summit, ShanghaiGlobal Supply Chain Innovation Summit, Shanghai
Global Supply Chain Innovation Summit, Shanghai
Deborah Weinswig490 views
Semantic web and Linked Data von Hyun Namgoong
Semantic web and Linked DataSemantic web and Linked Data
Semantic web and Linked Data
Hyun Namgoong679 views
Linked data driven EPCIS Event based Traceability across Supply chain busine... von Monika Solanki
Linked data driven EPCIS Event based Traceability across  Supply chain busine...Linked data driven EPCIS Event based Traceability across  Supply chain busine...
Linked data driven EPCIS Event based Traceability across Supply chain busine...
Monika Solanki1.5K views
Linking transformations in EPCIS governing supply chain business processes von Monika Solanki
Linking transformations in EPCIS governing supply chain business processesLinking transformations in EPCIS governing supply chain business processes
Linking transformations in EPCIS governing supply chain business processes
Monika Solanki1.1K views
How Blockchain Can Help Retailers Fight Fraud, Boost Margins and Build Brands von Cognizant
How Blockchain Can Help Retailers Fight Fraud, Boost Margins and Build BrandsHow Blockchain Can Help Retailers Fight Fraud, Boost Margins and Build Brands
How Blockchain Can Help Retailers Fight Fraud, Boost Margins and Build Brands
Cognizant1.1K views
EPCIS Event-Based Traceability in Pharmaceutical Supply Chains via Automated ... von Monika Solanki
EPCIS Event-Based Traceability in Pharmaceutical Supply Chains via Automated ...EPCIS Event-Based Traceability in Pharmaceutical Supply Chains via Automated ...
EPCIS Event-Based Traceability in Pharmaceutical Supply Chains via Automated ...
Monika Solanki3.4K views

Similar a Extending Tables with Data from over a Million Websites

Data Search and Search Joins (Universität Heidelberg 2015) von
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
423 views57 Folien
Unevenly Distributed von
Unevenly DistributedUnevenly Distributed
Unevenly DistributedC4Media
487 views40 Folien
Asp Aug2008 von
Asp Aug2008Asp Aug2008
Asp Aug2008FaseehAarar
207 views49 Folien
Diadem 1.0 von
Diadem 1.0Diadem 1.0
Diadem 1.0Giorgio Orsi
880 views87 Folien
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher von
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe RocherINTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocherapidays
20 views35 Folien
Joey gonzalez, graph lab, m lconf 2013 von
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013MLconf
3.2K views44 Folien

Similar a Extending Tables with Data from over a Million Websites(20)

Data Search and Search Joins (Universität Heidelberg 2015) von Chris Bizer
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
Chris Bizer423 views
Unevenly Distributed von C4Media
Unevenly DistributedUnevenly Distributed
Unevenly Distributed
C4Media487 views
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher von apidays
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe RocherINTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
INTERFACE by apidays 2023 - API Green Score, Yannick Tremblais, Groupe Rocher
apidays20 views
Joey gonzalez, graph lab, m lconf 2013 von MLconf
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
MLconf3.2K views
Mining a Large Web Corpus von Robert Meusel
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
Robert Meusel12.5K views
Text Mining with RapidMiner von ertekg
Text Mining with RapidMinerText Mining with RapidMiner
Text Mining with RapidMiner
ertekg1.7K views
Text Mining with Rapid Miner. von Gurdal Ertek
Text Mining with Rapid Miner.Text Mining with Rapid Miner.
Text Mining with Rapid Miner.
Gurdal Ertek149 views
MWLUG 2014: Modern Domino (workshop) von Peter Presnell
MWLUG 2014: Modern Domino (workshop)MWLUG 2014: Modern Domino (workshop)
MWLUG 2014: Modern Domino (workshop)
Peter Presnell1.9K views
RuleML Challenge: Extracting Data from the Deep Web with Global-as-View Media... von doenz
RuleML Challenge: Extracting Data from the Deep Web with Global-as-View Media...RuleML Challenge: Extracting Data from the Deep Web with Global-as-View Media...
RuleML Challenge: Extracting Data from the Deep Web with Global-as-View Media...
doenz399 views
Scalable Distributed Real-Time Clustering for Big Data Streams von Antonio Severien
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Antonio Severien3.1K views
Interactive Latency in Big Data Visualization von bigdataviz_bay
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
bigdataviz_bay1.2K views
Introduction to Big Data von Albert Bifet
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Albert Bifet1.5K views
Math 141 Exam Final Exam Name____________________.docx von endawalling
Math 141 Exam Final Exam             Name____________________.docxMath 141 Exam Final Exam             Name____________________.docx
Math 141 Exam Final Exam Name____________________.docx
endawalling19 views
Math 141 Exam Final Exam Name____________________.docx von wkyra78
Math 141 Exam Final Exam             Name____________________.docxMath 141 Exam Final Exam             Name____________________.docx
Math 141 Exam Final Exam Name____________________.docx
wkyra789 views
Daniel Egan Msdn Tech Days Oc von Daniel Egan
Daniel Egan Msdn Tech Days OcDaniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days Oc
Daniel Egan1.3K views
ASSIGNMENT BOOKLET 2018 DIPLOMA IN IT 3 YEARS 1ST YEAR von Don Dooley
ASSIGNMENT BOOKLET 2018 DIPLOMA IN IT 3 YEARS 1ST YEARASSIGNMENT BOOKLET 2018 DIPLOMA IN IT 3 YEARS 1ST YEAR
ASSIGNMENT BOOKLET 2018 DIPLOMA IN IT 3 YEARS 1ST YEAR
Don Dooley4 views

Más de Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? von
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
39 views56 Folien
Integrating Product Data from the Semantic Web using Deep Learning Techniques von
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
35 views47 Folien
Using the Semantic Web as Training Data for Product Matching von
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
552 views26 Folien
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web von
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
873 views53 Folien
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... von
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
933 views58 Folien
Exploring the Application Potential of Relational Web Tables von
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
679 views25 Folien

Más de Chris Bizer(7)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration? von Chris Bizer
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
Chris Bizer39 views
Integrating Product Data from the Semantic Web using Deep Learning Techniques von Chris Bizer
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Chris Bizer35 views
Using the Semantic Web as Training Data for Product Matching von Chris Bizer
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
Chris Bizer552 views
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web von Chris Bizer
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
Chris Bizer873 views
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the... von Chris Bizer
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Chris Bizer933 views
Exploring the Application Potential of Relational Web Tables von Chris Bizer
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
Chris Bizer679 views
Evolving the Web into a Global Database - Advances and Applications. von Chris Bizer
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
Chris Bizer1.6K views

Último

IETF 118: Starlink Protocol Performance von
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol PerformanceAPNIC
354 views22 Folien
How to think like a threat actor for Kubernetes.pptx von
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptxLibbySchulze1
5 views33 Folien
information von
informationinformation
informationkhelgishekhar
9 views4 Folien
WEB 2.O TOOLS: Empowering education.pptx von
WEB 2.O TOOLS: Empowering education.pptxWEB 2.O TOOLS: Empowering education.pptx
WEB 2.O TOOLS: Empowering education.pptxnarmadhamanohar21
16 views16 Folien
Building trust in our information ecosystem: who do we trust in an emergency von
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergencyTina Purnat
106 views18 Folien
PORTFOLIO 1 (Bret Michael Pepito).pdf von
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdfbrejess0410
8 views6 Folien

Último(9)

IETF 118: Starlink Protocol Performance von APNIC
IETF 118: Starlink Protocol PerformanceIETF 118: Starlink Protocol Performance
IETF 118: Starlink Protocol Performance
APNIC354 views
How to think like a threat actor for Kubernetes.pptx von LibbySchulze1
How to think like a threat actor for Kubernetes.pptxHow to think like a threat actor for Kubernetes.pptx
How to think like a threat actor for Kubernetes.pptx
LibbySchulze15 views
Building trust in our information ecosystem: who do we trust in an emergency von Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat106 views
PORTFOLIO 1 (Bret Michael Pepito).pdf von brejess0410
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdf
brejess04108 views
Marketing and Community Building in Web3 von Federico Ast
Marketing and Community Building in Web3Marketing and Community Building in Web3
Marketing and Community Building in Web3
Federico Ast12 views

Extending Tables with Data from over a Million Websites

  • 1. Slide 1 International Semantic Web Conference Riva del Garda, Italy, 22.10.2014 Semantic Web Challenge – Big Data Track Extending Tables with Data from over a Million Websites Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Kai Eckert, Heiko Paulheim, Christian Bizer
  • 2. Slide 2 Extend a local table with additional columns using different types of Web data. Region Un-employment Alsace 11 % Lorraine 12 % Guadeloupe 28 % Centre 10 % Martinique 25 % GDP per Capita Population Growth 45.914 € 0,16 % 51.233 € -0,05 % 19.810 € 1,34 % 59.502 € 1,76 % NULL 2,64 % + Goal
  • 3. Slide 3 Operation 1: Extend Local Table with Single Column Given a local table and keywords describing the extension column, add the extension column to the table and fill it with data from the Web. „GDP per Capita“ Region Unemployment Alsace 11 % Lorraine 12 % Guadeloupe 28 % Centre 10 % Martinique 25 % … … GDP per Capita 45.914 € 51.233 € 19.810 € 59.502 € 21,527 € … +
  • 4. Slide 4 Operation 2: Extend Local Table with Many Columns Given a local table, add all columns to the table that can be filled beyond a density threshold. Region Unemp. Rate Alsace 11 % Lorraine 12 % Guadeloupe 28 % Centre 10 % Martinique 25 % … … GDP per Capita Population Growth Overseas departments … 45.914 € 0,16 % No … 51.233 € -0,05 % No … 19.810 € 1,34 % Yes … 59.502 € NULL NULL … NULL 2,64 % Yes … … … … + density >= 0.8
  • 5. Slide 5 Types of Web Data Used Microdata (schema.org) Wiki Tables HTML Tables Linked Data
  • 6. Slide 6 Billion Triple Challenge Dataset 2014 4 billion triples crawled from 47,000 websites.
  • 7. Slide 7 Web Data Commons - Microdata Corpus 250 million triples from 463,000 websites.  Extracted from Common Crawl 2013 web corpus  2.2 billion HTML pages from 12.8 million websites  Mostly using the schema.org vocabulary  Main topics  Products  Reviews  Organisations / LocalBusiness  Events Download: http://webdatacommons.org/structureddata/
  • 8. Slide 8 Web Data Commons – Web Tables Corpus Around 1% of all HTML tables contain structured data.  we used 35 million English HTML tables.  extracted from the Common Crawl 2012 web corpus  selected out of 11.2 billion raw tables
  • 9. Slide 9 Web Data Commons – Web Tables Corpus  Column Statistics Column #Tables name 4,600,000 price 3,700,000 date 2,700,000 artist 2,100,000 location 1,200,000 year 1,000,000 manufacturer 375,000 counrty 340,000 isbn 99,000 area 95,000 population 86,000  Subject Column Values Value #Rows usa 135,000 germany 91,000 greece 42,000 new york 59,000 london 37,000 athens 11,000 david beckham 3,000 ronaldinho 1,200 oliver kahn 710 twist shout 2,000 yellow submarine 1,400 Download: http://webdatacommons.org/webtables/
  • 10. Slide 10 WikiTables 1.4 million tables from English Wikipedia.  extracted by Northwestern University  from the 2013 Wikipedia XML dump  only tables, no infoboxes Download: http://downey-n1.cs.northwestern.edu/public/
  • 11. Slide 11 Internal Data Model: Entity-Attributes-Tables  One entity per row  Subject Column = Name of the entity  HTML tables: Most unique string column, break ties by taking leftmost. Rank Film Studio Director Length 1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min 2. Alien Brandwine Ridley Scott 117 min 3. Black Moon NEF Louis Malle 100 min  Table generation from Linked Data and Microdata  generate one table per class and website  subject column: rdfs:label, foaf:name, x:name  we exploit common vocabularies
  • 12. Slide 12 Indexed Tables  Selection Conditions: 1. Minimum size of 3 columns and 5 rows 2. Subject column detection successful  Total # of tables: 36.3 million  Total # of PLDs: ~ 1.5 million  Total # of triples: 3.0 billion
  • 13. Slide 13 The Mannheim Search Joins Engine (MSJE) Collection of tables Table Normalization Table Storage Table Index 1. Table Indexing Input query table Table Preprocessing Search 2. Table Search 3. Data Consolidation Data collection User Preferences Consolidation MultiJoin Top k Candidates
  • 14. Slide 14 The Search Operator The Search operator determines the set of relevant Web tables.  Table Ranking  subject column value overlap  extended Jaccard Similarity (FastJoin)  Select TopK Tables  1000 tables in the single column experiments Relevant
  • 15. Slide 15 Multi-Join Operator The MultiJoin operator performs a series of left-outer joins between the query table and all tables in the input set. No. Region 1 Alsace 2 Lorraine 3 Guadeloupe 4 Centre Unemploy 11 % 12 % 28 % 10 % Unemploy NULL NULL NULL 9.4 % GDP 45.914 € 51.233 € NULL NULL GDP per C 45.000 € NULL 19.000 € 59.500 €
  • 16. Slide 16 Consolidation Operator The consolidation operator merges corresponding columns and fuses values in order to return a concise result table.  Column Matching  Combination of label- and instance-based techniques  Conflict Resolution  Strings: majority vote  Numeric values: average, median, clustering and vote No Region Unemploy GDP 1 Alsace 11 % 45.914 € 2 Lorraine 12 % 51.233 € 3 Guadelo upe 28 % 19.000 € 4 Centre 10 % 59.500 €
  • 18. Slide 18 Result: Extend with Single Column
  • 21. Slide 21 Evaluation Results Author Head‐quarter Industry Area Capital Code Currency Popu‐lation Ingre‐dient Cast Director Genre Year Artist Team Book Company Country Drug Film Song Soccer Player 100% 80% 60% 40% 20% 0% coverage 93% 94% 94% 100% 100% 100% 94% 100% 87% 94% 97% 97% 96% 99% 88% precision 96% 96% 94% 95% 100% 94% 96% 64% 89% 85% 97% 86% 97% 95% 67% Coverage: Percentage of entities for which a value was found. Precision: Manually evaluated using Wikipedia, IMDB, Amazon.
  • 22. Slide 22 Result: Extend with Many Columns 505 columns are added and filled with data from 2071 tables.
  • 24. Slide 24 Provenance Details for “area (sq. km)”
  • 25. Search Joins bring together Web Search and DB Joins. Slide 25 Conclusion  The prototype shows that simple queries are feasible.  The Web is one application domain for search joins, corporate intranets are the other.  The overlooked Big Data Vs: Variety and Veracity