Large Scale Crawling with Apache Nutch and Friends

Large Scale Crawling with

Apache
and friends...

Julien Nioche
julien@digitalpebble.com
LUCENE/SOLR REVOLUTION EU 2013

About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
–
–
–
–

Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning

 Strong focus on Open Source & Apache ecosystem
 VP Apache Nutch
 User | Contributor | Committer
–
–
–
–
–

Tika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth

2 / 43

Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

3 / 43

Nutch?
 “Distributed framework for large scale web crawling”
(but does not have to be large scale at all)

 Apache TLP since May 2010
 Based on Apache Hadoop

 Indexing and Search by

4 / 43

A bit of history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache

5 / 43

Recent Releases

trunk

1.0

1.1 1.2

1.3

1.4 1.5.1 1.6

1.7

2.x
2.0 2.1

06/09

06/10

06/11

06/12

2.2.1

06/13

6 / 43

Why use Nutch?
 Usual reasons
– Open source with a business-friendly license, mature, community, ...

 Scalability
– Tried and tested on very large scale
– Standard Hadoop

 Features
–
–
–
–

Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised

7 / 43

Use cases
 Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

 with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

8 / 43

Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–

Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index

SimilarPages.com
–
–
–
–
–

Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved

Size

9 / 43

CommonCrawl
http://commoncrawl.org/
 Open repository of web crawl data
 2012 dataset : 3.83 billion docs
 ARC files on Amazon S3
 Using Nutch 1.7
 A few modifications to Nutch code
– https://github.com/Aloisius/nutch

 Next release imminent
10 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

11 / 43

Installation
 http://nutch.apache.org/downloads.html
 1.7 => src and bin distributions
 2.2.1 => src only
 'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts

 Binary distribution for 1.x == runtime/local

12 / 43

Configuration and resources
 Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf

 Specify configuration in nutch-site.xml
– Leave nutch-default alone!

 At least :
<property>
<name>http.agent.name</name>
<value>WhateverNameDescribesMyMightyCrawler</value>
</property>

13 / 43

Running it!
 bin/crawl script : typical sequence of steps
 bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….

 Local mode : great for testing and debugging
 Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters

14 / 43

Monitor Crawl with MapReduce UI

15 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

17 / 43

Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)

Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]

 Repeat steps 2 to 7
 Or use the all-in-one crawl script
18 / 43

Main steps from a data perspective
Seed
List

Segment

CrawlDB

/crawl_generate/
/crawl_fetch/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB
19 / 43

Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction

seed
i=1
i=2
i=3

[Slide courtesy of A. Bialecki]

20 / 43

An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
–
–
–
–
–
–
–
–

Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)

21 / 43

Features
 Fetcher
–
–
–
–

Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

22 / 43

Features (cont.)
 Protocols
– Http, file, ftp, https
– Respects robots.txt directives

 Scheduling
– Fixed or adaptive

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

23 / 43

Features (cont.)
 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well

 Other plugins
–
–
–
–
–

CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

 Pluggable indexing
– SOLR | ES etc...

24 / 43

Indexing
 Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377

 ElasticSearch
– Version 0.90.1

 AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

 Easy to build your own
– Text, DB, etc...

25 / 43

Typical Nutch document
 Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–

url
content
title
anchor
site
boost
digest
segment
host
type

 Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata

26 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

27 / 43

NUTCH 2.x
 2.0 released in July 2012
 2.2.1 in July 2013
 Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...

 Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA
28 / 43

Apache GORA
 http://gora.apache.org/
 ORM for NoSQL databases
– and limited SQL support + file based storage

 Current version 0.3
 DataStore implementations
●
●
●

Accumulo
Cassandra
HBase

●
●
●

Avro
DynamoDB
SQL (broken)

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)
29 / 43

AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

30 / 43

Mapping file (backend specific – Hbase)
<gora-orm>
<table name="webpage">
<family name="p" maxVersions="1"/> 
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">

<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

31 / 43

DataStore operations
 Basic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

32 / 43

GORA in Nutch
 AVRO schema provided and java code pre-generated
 Mapping files provided for backends
– can be modified if necessary
 Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
 http://wiki.apache.org/nutch/Nutch2Tutorial

33 / 43

Benefits
 Storage still distributed and replicated
 … but one big table
– status, metadata, content, text → one place
– no more segments

 Resume-able fetch and parse steps
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

 Simplify the Nutch code
 Potentially faster (e.g. update step)
34 / 43

Drawbacks
 More stuff to install and configure
– Higher hardware requirements

 Current performance :-(
–
–
–
–
–
–

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!

 Not as stable as Nutch 1.x

35 / 43

2.x Work in progress
 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)

 Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset

36 / 43

Outline
 Overview
 Main steps
 Nutch 2.x

37 / 43

Future
 1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x

 New functionalities
–
–
–
–

Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)

 Move to new MapReduce API
– Use Nutch on Hadoop 2.x

38 / 43

More delegation
 Great deal done in recent years (SOLR, Tika)
 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering

 PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain

39 / 43

Longer term
 Hadoop 2.x & YARN
 Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …

 End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch

 See https://github.com/DigitalPebble/storm-crawler
40 / 43

Where to find out more?
 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

41 / 43

Large Scale Crawling with Apache Nutch and Friends

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Large Scale Crawling with Apache Nutch and Friends

Similar to Large Scale Crawling with Apache Nutch and Friends (20)

More from lucenerevolution

More from lucenerevolution (20)

Recently uploaded

Recently uploaded (20)

Large Scale Crawling with Apache Nutch and Friends