Slide 1 
International Semantic Web Conference 
Riva del Garda, Italy, 22.10.2014 
Semantic Web Challenge – Big Data Track...
Slide 2 
Extend a local table with additional columns 
using different types of Web data. 
Region 
Un-employment 
Alsace 1...
Slide 3 
Operation 1: Extend Local Table with Single Column 
Given a local table and keywords describing the extension 
co...
Slide 4 
Operation 2: Extend Local Table with Many Columns 
Given a local table, add all columns to the table that can 
be...
Slide 5 
Types of Web Data Used 
Microdata 
(schema.org) 
Wiki Tables 
HTML Tables 
Linked Data
Slide 6 
Billion Triple Challenge Dataset 2014 
4 billion triples crawled from 47,000 websites.
Slide 7 
Web Data Commons - Microdata Corpus 
250 million triples from 463,000 websites. 
 Extracted from Common Crawl 20...
Slide 8 
Web Data Commons – Web Tables Corpus 
Around 1% of all HTML tables contain structured data. 
 we used 35 million...
Slide 9 
Web Data Commons – Web Tables Corpus 
 Column Statistics 
Column #Tables 
name 4,600,000 
price 3,700,000 
date ...
Slide 10 
WikiTables 
1.4 million tables from English Wikipedia. 
 extracted by Northwestern University 
 from the 2013 ...
Slide 11 
Internal Data Model: Entity-Attributes-Tables 
 One entity per row 
 Subject Column = Name of the entity 
 HT...
Slide 12 
Indexed Tables 
 Selection Conditions: 
1. Minimum size of 3 columns and 5 rows 
2. Subject column detection su...
Slide 13 
The Mannheim Search Joins Engine (MSJE) 
Collection of tables Table Normalization 
Table Storage Table Index 
1....
Slide 14 
The Search Operator 
The Search operator determines the set of relevant Web tables. 
 Table Ranking 
 subject ...
Slide 15 
Multi-Join Operator 
The MultiJoin operator performs a series of left-outer joins 
between the query table and a...
Slide 16 
Consolidation Operator 
The consolidation operator merges corresponding columns 
and fuses values in order to re...
Slide 17 
http://searchjoins.webdatacommons.org
Slide 18 
Result: Extend with Single Column
Slide 19 
Provenance Summary
Slide 20 
Provenance Details
Slide 21 
Evaluation Results 
Author Head‐quarter 
Industry Area Capital Code Currency Popu‐lation 
Ingre‐dient 
Cast Dire...
Slide 22 
Result: Extend with Many Columns 
505 columns are added 
and filled with data from 2071 tables.
Slide 23 
Provenance Summary
Slide 24 
Provenance Details for “area (sq. km)”
Search Joins bring together Web Search and DB Joins. 
Slide 25 
Conclusion 
 The prototype shows that simple queries are ...
Nächste SlideShare
Wird geladen in …5
×

Extending Tables with Data from over a Million Websites

2.171 Aufrufe

Veröffentlicht am

The slideset describes the Mannheim Search Join Engine and was used to present our submission to the Semantic Web Challenge 2014.

More information about the Semantic Web Challenge:
http://challenge.semanticweb.org/

Paper about the Mannheim Search Join Engine:
http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/Lehmberg-Ritze-Ristoski-Eckert-Paulheim-Bizer-TableExtension-SemanticWebChallenge-ISWC2014-Paper.pdf

Abstract:

This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Image you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film lover and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim SearchJoin Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the SearchJoin Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim SearchJoin Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios.

Veröffentlicht in: Internet
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
2.171
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
11
Aktionen
Geteilt
0
Downloads
20
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Extending Tables with Data from over a Million Websites

  1. 1. Slide 1 International Semantic Web Conference Riva del Garda, Italy, 22.10.2014 Semantic Web Challenge – Big Data Track Extending Tables with Data from over a Million Websites Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Kai Eckert, Heiko Paulheim, Christian Bizer
  2. 2. Slide 2 Extend a local table with additional columns using different types of Web data. Region Un-employment Alsace 11 % Lorraine 12 % Guadeloupe 28 % Centre 10 % Martinique 25 % GDP per Capita Population Growth 45.914 € 0,16 % 51.233 € -0,05 % 19.810 € 1,34 % 59.502 € 1,76 % NULL 2,64 % + Goal
  3. 3. Slide 3 Operation 1: Extend Local Table with Single Column Given a local table and keywords describing the extension column, add the extension column to the table and fill it with data from the Web. „GDP per Capita“ Region Unemployment Alsace 11 % Lorraine 12 % Guadeloupe 28 % Centre 10 % Martinique 25 % … … GDP per Capita 45.914 € 51.233 € 19.810 € 59.502 € 21,527 € … +
  4. 4. Slide 4 Operation 2: Extend Local Table with Many Columns Given a local table, add all columns to the table that can be filled beyond a density threshold. Region Unemp. Rate Alsace 11 % Lorraine 12 % Guadeloupe 28 % Centre 10 % Martinique 25 % … … GDP per Capita Population Growth Overseas departments … 45.914 € 0,16 % No … 51.233 € -0,05 % No … 19.810 € 1,34 % Yes … 59.502 € NULL NULL … NULL 2,64 % Yes … … … … + density >= 0.8
  5. 5. Slide 5 Types of Web Data Used Microdata (schema.org) Wiki Tables HTML Tables Linked Data
  6. 6. Slide 6 Billion Triple Challenge Dataset 2014 4 billion triples crawled from 47,000 websites.
  7. 7. Slide 7 Web Data Commons - Microdata Corpus 250 million triples from 463,000 websites.  Extracted from Common Crawl 2013 web corpus  2.2 billion HTML pages from 12.8 million websites  Mostly using the schema.org vocabulary  Main topics  Products  Reviews  Organisations / LocalBusiness  Events Download: http://webdatacommons.org/structureddata/
  8. 8. Slide 8 Web Data Commons – Web Tables Corpus Around 1% of all HTML tables contain structured data.  we used 35 million English HTML tables.  extracted from the Common Crawl 2012 web corpus  selected out of 11.2 billion raw tables
  9. 9. Slide 9 Web Data Commons – Web Tables Corpus  Column Statistics Column #Tables name 4,600,000 price 3,700,000 date 2,700,000 artist 2,100,000 location 1,200,000 year 1,000,000 manufacturer 375,000 counrty 340,000 isbn 99,000 area 95,000 population 86,000  Subject Column Values Value #Rows usa 135,000 germany 91,000 greece 42,000 new york 59,000 london 37,000 athens 11,000 david beckham 3,000 ronaldinho 1,200 oliver kahn 710 twist shout 2,000 yellow submarine 1,400 Download: http://webdatacommons.org/webtables/
  10. 10. Slide 10 WikiTables 1.4 million tables from English Wikipedia.  extracted by Northwestern University  from the 2013 Wikipedia XML dump  only tables, no infoboxes Download: http://downey-n1.cs.northwestern.edu/public/
  11. 11. Slide 11 Internal Data Model: Entity-Attributes-Tables  One entity per row  Subject Column = Name of the entity  HTML tables: Most unique string column, break ties by taking leftmost. Rank Film Studio Director Length 1. Star Wars –Episode 1 Lucasfilm George Lucas 121 min 2. Alien Brandwine Ridley Scott 117 min 3. Black Moon NEF Louis Malle 100 min  Table generation from Linked Data and Microdata  generate one table per class and website  subject column: rdfs:label, foaf:name, x:name  we exploit common vocabularies
  12. 12. Slide 12 Indexed Tables  Selection Conditions: 1. Minimum size of 3 columns and 5 rows 2. Subject column detection successful  Total # of tables: 36.3 million  Total # of PLDs: ~ 1.5 million  Total # of triples: 3.0 billion
  13. 13. Slide 13 The Mannheim Search Joins Engine (MSJE) Collection of tables Table Normalization Table Storage Table Index 1. Table Indexing Input query table Table Preprocessing Search 2. Table Search 3. Data Consolidation Data collection User Preferences Consolidation MultiJoin Top k Candidates
  14. 14. Slide 14 The Search Operator The Search operator determines the set of relevant Web tables.  Table Ranking  subject column value overlap  extended Jaccard Similarity (FastJoin)  Select TopK Tables  1000 tables in the single column experiments Relevant
  15. 15. Slide 15 Multi-Join Operator The MultiJoin operator performs a series of left-outer joins between the query table and all tables in the input set. No. Region 1 Alsace 2 Lorraine 3 Guadeloupe 4 Centre Unemploy 11 % 12 % 28 % 10 % Unemploy NULL NULL NULL 9.4 % GDP 45.914 € 51.233 € NULL NULL GDP per C 45.000 € NULL 19.000 € 59.500 €
  16. 16. Slide 16 Consolidation Operator The consolidation operator merges corresponding columns and fuses values in order to return a concise result table.  Column Matching  Combination of label- and instance-based techniques  Conflict Resolution  Strings: majority vote  Numeric values: average, median, clustering and vote No Region Unemploy GDP 1 Alsace 11 % 45.914 € 2 Lorraine 12 % 51.233 € 3 Guadelo upe 28 % 19.000 € 4 Centre 10 % 59.500 €
  17. 17. Slide 17 http://searchjoins.webdatacommons.org
  18. 18. Slide 18 Result: Extend with Single Column
  19. 19. Slide 19 Provenance Summary
  20. 20. Slide 20 Provenance Details
  21. 21. Slide 21 Evaluation Results Author Head‐quarter Industry Area Capital Code Currency Popu‐lation Ingre‐dient Cast Director Genre Year Artist Team Book Company Country Drug Film Song Soccer Player 100% 80% 60% 40% 20% 0% coverage 93% 94% 94% 100% 100% 100% 94% 100% 87% 94% 97% 97% 96% 99% 88% precision 96% 96% 94% 95% 100% 94% 96% 64% 89% 85% 97% 86% 97% 95% 67% Coverage: Percentage of entities for which a value was found. Precision: Manually evaluated using Wikipedia, IMDB, Amazon.
  22. 22. Slide 22 Result: Extend with Many Columns 505 columns are added and filled with data from 2071 tables.
  23. 23. Slide 23 Provenance Summary
  24. 24. Slide 24 Provenance Details for “area (sq. km)”
  25. 25. Search Joins bring together Web Search and DB Joins. Slide 25 Conclusion  The prototype shows that simple queries are feasible.  The Web is one application domain for search joins, corporate intranets are the other.  The overlooked Big Data Vs: Variety and Veracity

×