The Linked Data for Information Extraction challenge explores aims at extracting structured data from Web pages. It is based on a subset of the Web Data Commons Microformats dataset.
For the challenge, original annotated pages are provided, as well as the triples extracted from them. Based on that information, participants have to design an Information extraction system for extracting that information from other web pages. In this year's challenge, we focus on hCard data, i.e., information about persons. The use case of such a system could be the assembly of a large database on person data.
The systems are evaluated on a test set of annotated web pages, from which all annotations have been removed. The participants have to extract triples from those pages and send in their resulting triple files. The submitted files are evaluated against the gold standard of the original triples, ranking the solutions by F-measure.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014
1. Linked Data for Information Extraction
Challenge 2014
Tasks and Results
Robert Meusel and Heiko Paulheim
2. 2
Task
Creation of an information extraction system that scrape
structured information from HTML web sites.
Training dataset was created from HTML pages, which are
annotated using Microformats hCard.
The data is a subset of the WebDataCommons Microformats
Dataset.
The original data is provided by the Common Crawl Foundation,
the largest public available collection of web crawls
Linked Data for Information Extractin Challenge 2014 - Task and Results
3. 3
The Common Crawl Foundation (CC)
Non-profit foundation dedicated to building and maintaining
an open crawl of the Web
9 crawl corpora from 2008 till 2014 available so far
Crawling Strategies:
• Earlier crawled using BFS (with link discovery) seeded with a large list of ranked
Seeds (PageRank), current crawls are gathered using a >6billion URL seed list
from the blekko search index
• By this, all crawls represent the popular part of the Web
Data availability
• CC provides three different datasets for each crawl
• All data can be freely downloaded from AWS S3
Linked Data for Information Extractin Challenge 2014 - Task and Results
4. 4
The WebDataCommons Project
Extraction of Structured Data from the Common Crawl Corpora
Extracts information annotated with the Markup languages
Microformats, Microdata and RDFa
Till now, three different datasets gathered from crawls of 2010,
2012, and 2013
RDFa
Microdata
Microformats
Linked Data for Information Extractin Challenge 2014 - Task and Results
5. 5
Extracting the Data
Webmaster markup their information within the HTML page
directly using one of the three markup languages
Using Any23 (http://any23.apache.org/) those information are
extracted as RDF triples
Any23
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/Product> .
2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG
Fuu00DFballschuh"@de .
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/Offer> .
4. _:node1 <http://schema.org/Offer/price> "u20AC 219,95"@de .
5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de .
6. …
Linked Data for Information Extractin Challenge 2014 - Task and Results
6. 6
The Original Dataset of 2013
Over 1.7 million domains using at least one markup language
Over 17 billion quads with over 4 billion records (typed entities)
hCard the most dominant among domains
Linked Data for Information Extractin Challenge 2014 - Task and Results
7. 7
Extraction of Challenge Dataset
Selected a subset of over 10k web pages from the corpus
including over 450k extracted triples (annotated with MF hCard)
• Training: 9 877 web pages / 373 501 triples
• Test: 2 379 web pages / 85 248 triples
Linked Data for Information Extractin Challenge 2014 - Task and Results
8. 8
Creation of the Gold Standard
Input: Annotated HTML Pages & Triples (extracted with Any23)
After extraction of triples, all hCard tags are replaced
• Replacement by random generated tags
• stable per page, but different across pages
• Replacement of comments: as CMS systems like to comment
<!– here is the name of the company -->
Output
• Training:
• Annotated HTML Page
• Cleaned HTML Page
• Triples
• Testing:
• Cleaned HTML Page
• Triples (not public)
Linked Data for Information Extractin Challenge 2014 - Task and Results
9. 9
Overview: Dataset Creation and Evaluation Process
Linked Data for Information Extractin Challenge 2014 - Task and Results
10. 10
Evaluation
Methodology: We consider each triple within extracted
statements (submission) and extracted statements (Any23 from
original test HTML pages) as equal if they have the same
predicate and object for one page.
Baseline: Each page has at least one statement declaring there
is one VCard
_:1 rdf:type hcard:Vcard .
Linked Data for Information Extractin Challenge 2014 - Task and Results
11. 11
Challenge Results
We got one submission (which you will learn about in some
minutes)
The submission outperforms the baseline for Recall and F-Measure
The Gold Standard is not perfect, as within the data, we also
find names and other attributes without a giving a type
(whenever webmasters did not model this) Even a perfect
extraction system would not reach a precision of 1.
Linked Data for Information Extractin Challenge 2014 - Task and Results
12. 12
Outlook: LD4IE Challenge 2015
Include more classes (e.g. Microdata and/or RDFa)
Add negative examples to generate a more realistic setting
• as today, systems can assume there is something within the test sample
• challenge of making sure, that in the negative examples there is no not marked
data included
Improve representativity of the challenge dataset
• Wide-spread CMS systems automatically allow marking up of articles, posts etc.
• Eliminate such bias, if present for next challenges
<html>
Linked Data for Information Extractin Challenge 2014 - Task and Results
<html>
MF:hCard
</html>
<html>
</html>
<html>
MF:hCard
</html>
</html>
<html>
Microdata
</html>
<html>
RDFa
</html>