Introduction to apache nutch

Introduction to Apache Nutch
Sung
sungrem@sigmoid.com
Sigmoid

Quick Introduction
Web crawling/Web scrapping
Web crawlers,robots,spiders etc.

Web as a directed graph
Nodes (vertices): URL-s as unique identifiers
Edges (links): hyperlinks like <a href=”targetUrl”/>
Edge labels: <a href=”..”>anchor text</a>
Often represented as adjacency (neighbor) lists
Traversal: follow the edges, breadth-first,
depthfirst,random

Apache Nutch
Open source web crawler written in Java
Founded in 2003 by Doug Cutting, the Lucene
creator, and Mike Cafarella
Apache project since 2004 (sub-project of
Lucene)

Features at a glance
Plug-in architecture,highly modular.
− Most behavior can be changed via plugins
Multi-protocol, multi-threaded, distributed
crawler
Plugin-based content processing
(parsing, filtering)

Some more features...
Scalable data processing framework
− Map-reduce processing
Full-text indexer & search engine
− Using Solr or ElasticSearch
− Support for distributed search
Robust API and integration options

Nutch Project Status & Conclusion
Two actively developed branches( nutch 1.x and
nutch 2.x)
Key difference(2.x from 1.x): Storage is
abstracted away from any specific underlying data
store by using Apache Gora for handling object to
persistent mappings.
Check out the source code to try on:
https://github.com/apache/nutch

Introduction to apache nutch

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Introduction to apache nutch

Ähnlich wie Introduction to apache nutch (20)

Mehr von Sigmoid

Mehr von Sigmoid (12)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to apache nutch