This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
WSO2's API Vision: Unifying Control, Empowering Developers
Large scale crawling with Apache Nutch
1. Large Scale Crawling with
Apache
Julien Nioche
julien@digitalpebble.com
ApacheCon Europe 2012
2. About myself
DigitalPebble Ltd, Bristol (UK)
Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Data Mining
Strong focus on Open Source & Apache ecosystem
Apache Nutch VP
Apache Tika committer
User | Contributor
– SOLR, Lucene
– GATE, UIMA
– Mahout
– Behemoth
2 / 37
3. Objectives
Overview of the project
Nutch in a nutshell
Nutch 2.x
Future developments
3 / 37
4. Nutch?
“Distributed framework for large scale web crawling”
– but does not have to be large scale at all
– or even on the web (file-protocol)
Apache TLP since May 2010
Based on Apache Hadoop
Indexing and Search
4 / 37
5. Short history
2002/2003 : Started By Doug Cutting & Mike Caffarella
2004 : sub-project of Lucene @Apache
2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache
2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache
May 2010 : TLP project at Apache
June 2012 : Nutch 1.5.1
Oct 2012 : Nutch 2.1
5 / 37
7. Community
6 active committers / PMC members
– 4 within the last 18 months
Constant stream of new contributions & bug reports
Steady numbers of mailing list subscribers and traffic
Nutch is a very healthy 10-year old
9 / 37
8. Why use Nutch?
Usual reasons
– Mature, business-friendly license, community, ...
Scalability
– Tried and tested on very large scale
– Hadoop cluster : installation and skills
Features
– e.g. Index with SOLR
– PageRank implementation
– Can be extended with plugins
10 / 37
9. Not the best option when ...
Hadoop based == batch processing == high latency
– No guarantee that a page will be fetched / parsed / indexed within X
minutes|hours
Javascript / Ajax not supported (yet)
11 / 37
10. Use cases
Crawl for IR
– Generic or vertical
– Index and Search with SOLR
– Single node to large clusters on Cloud
… but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth)
12 / 37
11. Customer cases
Specificity (Verticality)
Usecase : BetterJobs.com
– Single server
– Aggregates content from job portals
– Extracts and normalizes structure (description,
requirements, locations)
– ~1M pages total
– Feeds SOLR index
Usecase : SimilarPages.com
– Large cluster on Amazon EC2 (up to 400
nodes)
– Fetched & parsed 3 billion pages
– 10+ billion pages in crawlDB (~100TB data)
– 200+ million lists of similarities
– No indexing / search involved
Scale
13 / 37
12. Typical Nutch Steps
Same in 1.x and 2.x
Sequence of batch operations
1) Inject → populates CrawlDB from seed list
2) Generate → Selects URLS to fetch in segment
3) Fetch → Fetches URLs from segment
4) Parse → Parses content (text + metadata)
5) UpdateDB → Updates CrawlDB (new URLs, new status...)
6) InvertLinks → Build Webgraph
7) SOLRIndex → Send docs to SOLR
8) SOLRDedup → Remove duplicate docs based on signature
Repeat steps 2 to 8
Or use the all-in-one crawl script
14 / 37
13. Main steps
Seed
List CrawlDB Segment
/
/crawl_fetch/
crawl_generate/
/content/
/crawl_parse/
/parse_data/
/parse_text/
LinkDB
15 / 37
14. Frontier expansion
Manual “discovery”
– Adding new URLs by
hand, “seeding”
Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control seed
– Requires content
i=1
parsing and link
extraction
i=2
i=3
[Slide courtesy of A. Bialecki]
16 / 37
15. An extensible framework
Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints
Endpoints
– Protocol
– Parser
– HtmlParseFilter (ParseFilter in Nutch 2.x)
– ScoringFilter (used in various places)
– URLFilter (ditto)
– URLNormalizer (ditto)
– IndexingFilter
17 / 37
16. Features
Fetcher
– Multi-threaded fetcher
– Follows robots.txt
– Groups URLs per hostname / domain / IP
– Limit the number of URLs for round of fetching
– Default values are polite but can be made more aggressive
Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom scoring plugins
Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank
18 / 37
18. Features (cont.)
Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well
Other plugins
– CreativeCommons
– Feeds
– Language Identification
– Rel tags
– Arbitrary Metadata
Indexing to SOLR
– Bespoke schema
20 / 37
19. Data Structures in 1.x
MapReduce jobs => I/O : Hadoop [Sequence|Map]Files
CrawlDB => status of known pages
MapFile : <Text,CrawlDatum>
byte status; [fetched? Unfetched? Failed? Redir?]
long fetchTime;
byte retries;
CrawlDB int fetchInterval;
float score = 1.0f;
byte[] signature = null;
long modifiedTime;
org.apache.hadoop.io.MapWritable metaData;
Input of : generate - index
Output of : inject - update
21 / 37
20. Data Structures 1.x
Segment => round of fetching
Identified by a timestamp
Segment
/crawl_generate/ → SequenceFile<Text,CrawlDatum>
/crawl_fetch/ → MapFile<Text,CrawlDatum>
/content/ → MapFile<Text,Content>
/crawl_parse/ → SequenceFile<Text,CrawlDatum>
/parse_data/ → MapFile<Text,ParseData>
/parse_text/ → MapFile<Text,ParseText>
Can have multiple versions of a page in different
segments
22 / 37
21. Data Structures – 1.x
linkDB => storage for Web Graph
MapFile : <Text,Inlinks>
Inlinks : HashSet <Inlink>
LinkDB Inlink :
String fromUrl
String anchor
Output of : invertlinks
Input of : SOLRIndex
23 / 37
22. NUTCH 2.x
2.0 released in July 2012
2.1 in October 2012
Common features as 1.x
– delegation to SOLR, TIKA, MapReduce etc...
Moved to table-based architecture
– Wealth of NoSQL projects in last few years
Abstraction over storage layer → Apache GORA
24 / 37
23. Apache GORA
http://gora.apache.org/
ORM for NoSQL databases
– and limited SQL support + file based storage
0.2.1 released in August 2012
DataStore implementations
● Accumulo ● Avro
● Cassandra ● DynamoDB (soon)
● HBase ● SQL
Serialization with Apache AVRO
Object-to-datastore mappings (backend-specific)
25 / 37
27. GORA in Nutch
AVRO schema provided and java code pre-generated
Mapping files provided for backends
– can be modified if necessary
Need to rebuild to get dependencies for backend
– No binary distribution of Nutch 2.x
http://wiki.apache.org/nutch/Nutch2Tutorial
29 / 37
28. Benefits
Storage still distributed and replicated
but one big table
– status, metadata, content, text → one place
Simplified logic in Nutch
– Simpler code for updating / merging information
More efficient (?)
– No need to read / write entire structure to update records
– No comparison available yet + early days for GORA
Easier interaction with other resources
– Third-party code just need to use GORA and schema
30 / 37
29. Drawbacks
More stuff to install and configure :-)
Not as stable as Nutch 1.x
Dependent on success of Gora
31 / 37
30. 2.x Work in progress
Stabilise backend implementations
– GORA-Hbase most reliable
Synchronize features with 1.x
– e.g. has ElasticSearch but missing LinkRank equivalent
Filter enabled scans (GORA-119)
– Don't need to de-serialize the whole dataset
32 / 37
31. Future
Both 1.x and 2.x in parallel
– but more frequent releases for 2.x
New functionalities
– Support for SOLRCloud
– Sitemap (from Crawler Commons library)
– Canonical tag
– More indexers (e.g. ElasticSearch) + pluggable indexers?
33 / 37
32. More delegation
Great deal done in recent years (SOLR, Tika)
Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– Robots.txt parsing
– URL normalisation / filtering
PageRank-like computations to graph library
– e.g. Apache Giraph
– Should be more efficient as well
34 / 37
33. Where to find out more?
Project page : http://nutch.apache.org/
Wiki : http://wiki.apache.org/nutch/
Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org
Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...
Support / consulting :
– http://wiki.apache.org/nutch/Support
35 / 37
I'll be talking about large scale document processing and more specifically about Behemoth which is an open source project based on Hadoop
A few words about myself just before I start... What I mean by Text Engineering is a variety of activities ranging from .... What makes the identity of DP is The main projects I am involved in are …
Note that I mention crawling and not web search → used not only for search + used to do indexing and search using Lucene but now delegate this to SOLR
Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters
Main steps in Nutch More actions available Shell Wrappers around hadoop commands
Main steps in Nutch More actions available Shell Wrappers around hadoop commands
Endpoints are called in various places URL filters and normalisers in a lot of places Same for Soring Filters