Focused Crawling 
for Structured Data 
Robert Meusel, Peter Mika, 
and Roi Blanco
HTML pages embed directly 
markup languages to annotate 
items using different vocabularies 
1._:node1 <http://www.w3.org/...
3 
Deployment of Markup Languages 
14% of all sites use markup languages to annotate 
their data (status 2013) [Meusel2014...
4 
Motivation 
• Existing datasets/crawls do not focus on structured data 
• Common Crawl Foundation uses PageRank and Bre...
5 
Main Idea 
• Adapting the idea of focused crawling 
• Similarities: 
• Evaluation of content based on a objective funct...
6 
Online Learning for Focused Crawling 
• Capability to incorporates real-time feedback 
• Improves performance 
• Adapts...
7 
Exploration vs. Exploitation 
Selecting the page with the highest confidence for 
supporting our objective, might not a...
8 
Bandit-Based Selection 
• Bin each URL to the host it belongs to 
• Each host represents one bandit 
• Calculate the ex...
9 
Scoring Functions 
Incorporate knowledge in score calculation for bandit/host: 
• Best Score (Pure classification-based...
10 
System Workflow 
Online 
Classifier 
Bandits 
Crawler 
URL 
Parser 
Semantic 
Parser 
Classified 
URL 
URL 
HTML 
Page...
11 
Setup for Experiments 
• Data originates from the Common Crawl Corpus 2012 
• including over 3.5 billion HTML pages 
•...
12 
Experiment Description 
Measure: Number of relevant pages retrieved within the first 1 
million pages crawled. 
1. Onl...
13 
Results: Online vs. Offline 
• Both methods outperform Breadth-First Search (BFS) 
• Static approach: 340K 
• Adaptive...
14 
Results: Pure Online Classification vs. +Bandit-based 
• Success rate based scoring functions show most promising resu...
15 
Results: λ > 0 
• Including randomness seems not to have an effect 
• Beneficial effect of λ > 0 is shown e.g. for the...
16 
Results: Decaying λ 
Decaying λ over time, means the reduction of randomness while 
crawling more pages. 
• Success ra...
17 
Adaptation to more specific Objective 
• General objective is narrowed down to: 
• Pages making use of the markup lang...
18 
Results: Adaptation to more specific Objective 
• 3.5% of pages include such information 
• In general: Observation of...
19 
Conclusion 
• Improvement by 26% in comparison to pure online 
classification-based selection strategy for general obj...
20 
Open Challenges 
• Expand the approach to exploit results from one bandit to the 
other bandits (contextual bandits) 
...
21 
More Information 
• Paper accepted at ACM International Conference on 
Information and Knowledge Management in Shangha...
Nächste SlideShare
Wird geladen in …5
×

Focused Crawling for Structured Data

1.519 Aufrufe

Veröffentlicht am

Veröffentlicht in: Wissenschaft
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
1.519
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
29
Aktionen
Geteilt
0
Downloads
17
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Focused Crawling for Structured Data

  1. 1. Focused Crawling for Structured Data Robert Meusel, Peter Mika, and Roi Blanco
  2. 2. HTML pages embed directly markup languages to annotate items using different vocabularies 1._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# 2._:node1 <http://schema.org/Product/name> "Predator 2 Markup Languages in HTML Pages <html> … <body> … <div id="main-section" class="performance left" data-sku=" M17242_580“> 580" itemscope itemtype="http://schema.org/Product"> h1 itemprop="name"> Predator Instinct FG Fußballschuh <h1> Predator Instinct FG Fußballschuh </h1> <div> div itemscope itemtype="http://schema.org/Offer" itemprop="offers"> type> <http://schema.org/Product> . itemprop="priceCurrency" content="EUR"> itemprop="price" data-sale-price=" 219.95">219,95</span> <meta content="EUR"> <span data-sale-price="219.95">219,95</span> … </body> </html> Instinct FG Fußballschuh"@de . 3._:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns# type> <http://schema.org/Offer> . 4._:node1 <http://schema.org/Offer/price> "219,95"@de . 5._:node1 <http://schema.org/Offer/priceCurrency> "EUR" . 6.… Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  3. 3. 3 Deployment of Markup Languages 14% of all sites use markup languages to annotate their data (status 2013) [Meusel2014] • Broad topical variations from Articles over Products to Recipe [Bizer2013] • Multiple strong drivers pushing the deployment • Search engine companies initiative on Schema.org • Open Graph Protocol used by Facebook Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  4. 4. 4 Motivation • Existing datasets/crawls do not focus on structured data • Common Crawl Foundation uses PageRank and Breadth-First Search • Datasets, as the WebDataCommons corpus extracted from these corpora, are likely to miss large amounts of data [Meusel2014] • Structured information • Hundreds of million pages • Up-to-date information • Publicly available Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  5. 5. 5 Main Idea • Adapting the idea of focused crawling • Similarities: • Evaluation of content based on a objective function • Differences: • Typically focused by topic, not quality/amount of data collected • Because of that, typically no direct feedback about crawled pages available Possibility to incorporate the feedback directly into our system to improve classification of newly discovered URLs. Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  6. 6. 6 Online Learning for Focused Crawling • Capability to incorporates real-time feedback • Improves performance • Adapts to concept drifts • Possible features • URL-based features; mainly tokens from the URL-String itself • Features describing information from the parent(s) of the URL • Features describing information from the siblings of the URL • Free open-source software available (e.g. Massive Online Analysis Library by Bifet et al.) Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  7. 7. 7 Exploration vs. Exploitation Selecting the page with the highest confidence for supporting our objective, might not always be the best choice • Decision/Classification is based on gathered knowledge • Knowledge can be incomplete • Crawled too few pages • Knowledge can get invalid • Reaching part of the Web with different behavior Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  8. 8. 8 Bandit-Based Selection • Bin each URL to the host it belongs to • Each host represents one bandit • Calculate the expected score for each bandit based on a scoring function • Select the degree of randomness λ • λ between 0 and 1 • For each turn draw a random number z • z > λ: select the bandit with highest score • else: select a random bandit Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  9. 9. 9 Scoring Functions Incorporate knowledge in score calculation for bandit/host: • Best Score (Pure classification-based selection) • Negative Absolute Bad • Success Rate • Absolute Good · Best Score • Success Rate · Best Score • Thompson Sampling Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  10. 10. 10 System Workflow Online Classifier Bandits Crawler URL Parser Semantic Parser Classified URL URL HTML Page URLs Feedback Seeds Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  11. 11. 11 Setup for Experiments • Data originates from the Common Crawl Corpus 2012 • including over 3.5 billion HTML pages • Extracted a subset of 5.5 million linked pages • Including 450k different hosts • Identified all pages within the subset containing at least one markup language (using the WebDataCommons corpus) • 27.5% of all pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  12. 12. 12 Experiment Description Measure: Number of relevant pages retrieved within the first 1 million pages crawled. 1. Online vs. batch-based classification with 100K, 250K, and 1M pages 2. Pure online classification vs. enhanced with bandit-based selection (λ=0) 3. Improvements with different λ 4. Improvements with decaying λ Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  13. 13. 13 Results: Online vs. Offline • Both methods outperform Breadth-First Search (BFS) • Static approach: 340K • Adaptive approach: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  14. 14. 14 Results: Pure Online Classification vs. +Bandit-based • Success rate based scoring functions show most promising results • Negative absolute bad scoring performs like BFS • Success rate function: 628K • Pure online-classification: 539K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  15. 15. 15 Results: λ > 0 • Including randomness seems not to have an effect • Beneficial effect of λ > 0 is shown e.g. for the success rate function within the first 400K crawled pages Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  16. 16. 16 Results: Decaying λ Decaying λ over time, means the reduction of randomness while crawling more pages. • Success rate function with decaying λ = 0.5: 673K • Static λ: 628K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  17. 17. 17 Adaptation to more specific Objective • General objective is narrowed down to: • Pages making use of the markup language Microdata and • Include at least five marked up statements • Example: 1. A page including information about a movie 2. The movie has the name Se7en 3. with a rating of 8.7 out of 10 4. and it was released in 1995 5. This information is maintained by imdb.com Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  18. 18. 18 Results: Adaptation to more specific Objective • 3.5% of pages include such information • In general: Observation of beneficial effects using our approach • Static λ = 0.2: 120K • Decaying λ = 0.5: 108K Percentage of relevant pages Fetched web pages Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  19. 19. 19 Conclusion • Improvement by 26% in comparison to pure online classification-based selection strategy for general objective • Improvement by 66% for the more specific objective • Success rate based scoring functions shows most promising results for objectives Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  20. 20. 20 Open Challenges • Expand the approach to exploit results from one bandit to the other bandits (contextual bandits) • Introduce a more fine grained grading of the crawled pages (multi-class problem) • Take into account the quality of gathered information (beside richness) • Adapt the process to traditional topical focused crawling • Publishing of code and data to the community Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai
  21. 21. 21 More Information • Paper accepted at ACM International Conference on Information and Knowledge Management in Shanghai, China • ACM Digital Library: Focused Crawling for Structured Data • Detailed Descriptions and Source Code: • Anthelion Webpage • Datasets: • Common Crawl Foundation Corpora • WebDataCommons Corpora Meusel, Mika, Blanco: Focused Crawling for Structured Data @ CIKM 2014, Shanghai

×