Linked Data for Information Extraction 
Challenge 2014 
Tasks and Results 
Robert Meusel and Heiko Paulheim
2 
Task 
Creation of an information extraction system that scrape 
structured information from HTML web sites. 
 Training...
3 
The Common Crawl Foundation (CC) 
 Non-profit foundation dedicated to building and maintaining 
an open crawl of the W...
4 
The WebDataCommons Project 
Extraction of Structured Data from the Common Crawl Corpora 
 Extracts information annotat...
5 
Extracting the Data 
 Webmaster markup their information within the HTML page 
directly using one of the three markup ...
6 
The Original Dataset of 2013 
 Over 1.7 million domains using at least one markup language 
 Over 17 billion quads wi...
7 
Extraction of Challenge Dataset 
 Selected a subset of over 10k web pages from the corpus 
including over 450k extract...
8 
Creation of the Gold Standard 
 Input: Annotated HTML Pages & Triples (extracted with Any23) 
 After extraction of tr...
9 
Overview: Dataset Creation and Evaluation Process 
Linked Data for Information Extractin Challenge 2014 - Task and Resu...
10 
Evaluation 
 Methodology: We consider each triple within extracted 
statements (submission) and extracted statements ...
11 
Challenge Results 
 We got one submission (which you will learn about in some 
minutes) 
 The submission outperforms...
12 
Outlook: LD4IE Challenge 2015 
 Include more classes (e.g. Microdata and/or RDFa) 
 Add negative examples to generat...
Nächste SlideShare
Wird geladen in …5
×

Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

1.282 Aufrufe

Veröffentlicht am

The Linked Data for Information Extraction challenge explores aims at extracting structured data from Web pages. It is based on a subset of the Web Data Commons Microformats dataset.

For the challenge, original annotated pages are provided, as well as the triples extracted from them. Based on that information, participants have to design an Information extraction system for extracting that information from other web pages. In this year's challenge, we focus on hCard data, i.e., information about persons. The use case of such a system could be the assembly of a large database on person data.

The systems are evaluated on a test set of annotated web pages, from which all annotations have been removed. The participants have to extract triples from those pages and send in their resulting triple files. The submitted files are evaluated against the gold standard of the original triples, ranking the solutions by F-measure.

Veröffentlicht in: Technologie
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.282
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
91
Aktionen
Geteilt
0
Downloads
7
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

  1. 1. Linked Data for Information Extraction Challenge 2014 Tasks and Results Robert Meusel and Heiko Paulheim
  2. 2. 2 Task Creation of an information extraction system that scrape structured information from HTML web sites.  Training dataset was created from HTML pages, which are annotated using Microformats hCard.  The data is a subset of the WebDataCommons Microformats Dataset.  The original data is provided by the Common Crawl Foundation, the largest public available collection of web crawls Linked Data for Information Extractin Challenge 2014 - Task and Results
  3. 3. 3 The Common Crawl Foundation (CC)  Non-profit foundation dedicated to building and maintaining an open crawl of the Web  9 crawl corpora from 2008 till 2014 available so far  Crawling Strategies: • Earlier crawled using BFS (with link discovery) seeded with a large list of ranked Seeds (PageRank), current crawls are gathered using a >6billion URL seed list from the blekko search index • By this, all crawls represent the popular part of the Web  Data availability • CC provides three different datasets for each crawl • All data can be freely downloaded from AWS S3 Linked Data for Information Extractin Challenge 2014 - Task and Results
  4. 4. 4 The WebDataCommons Project Extraction of Structured Data from the Common Crawl Corpora  Extracts information annotated with the Markup languages Microformats, Microdata and RDFa  Till now, three different datasets gathered from crawls of 2010, 2012, and 2013 RDFa Microdata Microformats Linked Data for Information Extractin Challenge 2014 - Task and Results
  5. 5. 5 Extracting the Data  Webmaster markup their information within the HTML page directly using one of the three markup languages  Using Any23 (http://any23.apache.org/) those information are extracted as RDF triples Any23 1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Product> . 2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG Fuu00DFballschuh"@de . 3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://schema.org/Offer> . 4. _:node1 <http://schema.org/Offer/price> "u20AC 219,95"@de . 5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de . 6. … Linked Data for Information Extractin Challenge 2014 - Task and Results
  6. 6. 6 The Original Dataset of 2013  Over 1.7 million domains using at least one markup language  Over 17 billion quads with over 4 billion records (typed entities)  hCard the most dominant among domains Linked Data for Information Extractin Challenge 2014 - Task and Results
  7. 7. 7 Extraction of Challenge Dataset  Selected a subset of over 10k web pages from the corpus including over 450k extracted triples (annotated with MF hCard) • Training: 9 877 web pages / 373 501 triples • Test: 2 379 web pages / 85 248 triples Linked Data for Information Extractin Challenge 2014 - Task and Results
  8. 8. 8 Creation of the Gold Standard  Input: Annotated HTML Pages & Triples (extracted with Any23)  After extraction of triples, all hCard tags are replaced • Replacement by random generated tags • stable per page, but different across pages • Replacement of comments: as CMS systems like to comment <!– here is the name of the company -->  Output • Training: • Annotated HTML Page • Cleaned HTML Page • Triples • Testing: • Cleaned HTML Page • Triples (not public) Linked Data for Information Extractin Challenge 2014 - Task and Results
  9. 9. 9 Overview: Dataset Creation and Evaluation Process Linked Data for Information Extractin Challenge 2014 - Task and Results
  10. 10. 10 Evaluation  Methodology: We consider each triple within extracted statements (submission) and extracted statements (Any23 from original test HTML pages) as equal if they have the same predicate and object for one page.  Baseline: Each page has at least one statement declaring there is one VCard _:1 rdf:type hcard:Vcard . Linked Data for Information Extractin Challenge 2014 - Task and Results
  11. 11. 11 Challenge Results  We got one submission (which you will learn about in some minutes)  The submission outperforms the baseline for Recall and F-Measure  The Gold Standard is not perfect, as within the data, we also find names and other attributes without a giving a type (whenever webmasters did not model this) Even a perfect extraction system would not reach a precision of 1. Linked Data for Information Extractin Challenge 2014 - Task and Results
  12. 12. 12 Outlook: LD4IE Challenge 2015  Include more classes (e.g. Microdata and/or RDFa)  Add negative examples to generate a more realistic setting • as today, systems can assume there is something within the test sample • challenge of making sure, that in the negative examples there is no not marked data included  Improve representativity of the challenge dataset • Wide-spread CMS systems automatically allow marking up of articles, posts etc. • Eliminate such bias, if present for next challenges <html> Linked Data for Information Extractin Challenge 2014 - Task and Results <html> MF:hCard </html> <html> </html> <html> MF:hCard </html> </html> <html> Microdata </html> <html> RDFa </html>

×