3. Web as a directed graph
Nodes (vertices): URL-s as unique identifiers
Edges (links): hyperlinks like <a href=”targetUrl”/>
Edge labels: <a href=”..”>anchor text</a>
Often represented as adjacency (neighbor) lists
Traversal: follow the edges, breadth-first,
depthfirst,random
4. Apache Nutch
Open source web crawler written in Java
Founded in 2003 by Doug Cutting, the Lucene
creator, and Mike Cafarella
Apache project since 2004 (sub-project of
Lucene)
5. Features at a glance
Plug-in architecture,highly modular.
− Most behavior can be changed via plugins
Multi-protocol, multi-threaded, distributed
crawler
Plugin-based content processing
(parsing, filtering)
6. Some more features...
Scalable data processing framework
− Map-reduce processing
Full-text indexer & search engine
− Using Solr or ElasticSearch
− Support for distributed search
Robust API and integration options
19. Nutch Project Status & Conclusion
Two actively developed branches( nutch 1.x and
nutch 2.x)
Key difference(2.x from 1.x): Storage is
abstracted away from any specific underlying data
store by using Apache Gora for handling object to
persistent mappings.
Check out the source code to try on:
https://github.com/apache/nutch