SlideShare a Scribd company logo
1 of 43
Download to read offline
Large Scale Crawling with

Apache
and friends...

Julien Nioche
julien@digitalpebble.com
LUCENE/SOLR REVOLUTION EU 2013
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
–
–
–
–

Web Crawling
Natural Language Processing
Information Retrieval
Machine Learning

 Strong focus on Open Source & Apache ecosystem
 VP Apache Nutch
 User | Contributor | Committer
–
–
–
–
–

Tika
SOLR, Lucene
GATE, UIMA
Mahout
Behemoth

2 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

3 / 43
Nutch?
 “Distributed framework for large scale web crawling”
(but does not have to be large scale at all)

 Apache TLP since May 2010
 Based on Apache Hadoop

 Indexing and Search by

4 / 43
A bit of history
 2002/2003 : Started By Doug Cutting & Mike Caffarella
 2005 : MapReduce implementation in Nutch
– 2006 : Hadoop sub-project of Lucene @Apache

 2006/7 : Parser and MimeType in Tika
– 2008 : Tika sub-project of Lucene @Apache

 May 2010 : TLP project at Apache
 Sept 2010 : Storage abstraction in Nutch 2.x
– 2012 : Gora TLP @Apache

5 / 43
Recent Releases

trunk

1.0

1.1 1.2

1.3

1.4 1.5.1 1.6

1.7

2.x
2.0 2.1

06/09

06/10

06/11

06/12

2.2.1

06/13

6 / 43
Why use Nutch?
 Usual reasons
– Open source with a business-friendly license, mature, community, ...

 Scalability
– Tried and tested on very large scale
– Standard Hadoop

 Features
–
–
–
–

Index with SOLR / ES / CloudSearch
PageRank implementation
Loads of existing plugins
Can easily be extended / customised

7 / 43
Use cases
 Crawl for search
– Generic or vertical
– Index and Search with SOLR and al.
– Single node to large clusters on Cloud

 … but also
– Data Mining
– NLP (e.g.Sentiment Analysis)
– ML

 with
– MAHOUT / UIMA / GATE
– Use Behemoth as glueware
(https://github.com/DigitalPebble/behemoth)

8 / 43
Customer cases
Specificity (Verticality)
BetterJobs.com (CareerBuilder)
–
–
–
–
–

Single server
Aggregates content from job portals
Extracts and normalizes structure (description,
requirements, locations)
~2M pages total
Feeds SOLR index

SimilarPages.com
–
–
–
–
–

Large cluster on Amazon EC2 (up to 400
nodes)
Fetched & parsed 3 billion pages
10+ billion pages in crawlDB (~100TB data)
200+ million lists of similarities
No indexing / search involved

Size

9 / 43
CommonCrawl
http://commoncrawl.org/
 Open repository of web crawl data
 2012 dataset : 3.83 billion docs
 ARC files on Amazon S3
 Using Nutch 1.7
 A few modifications to Nutch code
– https://github.com/Aloisius/nutch

 Next release imminent
10 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

11 / 43
Installation
 http://nutch.apache.org/downloads.html
 1.7 => src and bin distributions
 2.2.1 => src only
 'ant clean runtime'
– runtime/local => local mode (test and debug)
– runtime/deploy => job jar for Hadoop + scripts

 Binary distribution for 1.x == runtime/local

12 / 43
Configuration and resources
 Changes in $NUTCH_HOME/conf
– Need recompiling with 'ant runtime'
– Local mode => can be made directly in runtime/local/conf

 Specify configuration in nutch-site.xml
– Leave nutch-default alone!

 At least :
<property>
  <name>http.agent.name</name>
  <value>WhateverNameDescribesMyMightyCrawler</value>
</property>

13 / 43
Running it!
 bin/crawl script : typical sequence of steps
 bin/nutch : individual Nutch commands
– Inject / generate / fetch / parse / update ….

 Local mode : great for testing and debugging
 Recommended : deploy + Hadoop (pseudo) distrib mode
– Parallelism
– MapReduce UI to monitor crawl, check logs, counters

14 / 43
Monitor Crawl with MapReduce UI

15 / 43
Counters and logs

16 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

17 / 43
Typical Nutch Steps
 Same in 1.x and 2.x
 Sequence of batch operations
1)
2)
3)
4)
5)
6)
7)

Inject → populates CrawlDB from seed list
Generate → Selects URLS to fetch in segment
Fetch → Fetches URLs from segment
Parse → Parses content (text + metadata)
UpdateDB → Updates CrawlDB (new URLs, new status...)
InvertLinks → Build Webgraph
Index → Send docs to [SOLR | ES | CloudSearch | … ]

 Repeat steps 2 to 7
 Or use the all-in-one crawl script
18 / 43
Main steps from a data perspective
Seed
List

Segment

CrawlDB

/crawl_generate/
/crawl_fetch/
/content/
/crawl_parse/
/parse_data/
/parse_text/

LinkDB
19 / 43
Frontier expansion
 Manual “discovery”
– Adding new URLs by
hand, “seeding”

 Automatic discovery
of new resources
(frontier expansion)
– Not all outlinks are
equally useful - control
– Requires content
parsing and link
extraction

seed
i=1
i=2
i=3

[Slide courtesy of A. Bialecki]

20 / 43
An extensible framework
 Plugins
– Activated with parameter 'plugin.includes'
– Implement one or more endpoints

 Endpoints
–
–
–
–
–
–
–
–

Protocol
Parser
HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x)
ScoringFilter (used in various places)
URLFilter (ditto)
URLNormalizer (ditto)
IndexingFilter
IndexWriter (NEW IN 1.7!)

21 / 43
Features
 Fetcher
–
–
–
–

Multi-threaded fetcher
Queues URLs per hostname / domain / IP
Limit the number of URLs for round of fetching
Default values are polite but can be made more aggressive

 Crawl Strategy
– Breadth-first but can be depth-first
– Configurable via custom ScoringFilters

 Scoring
– OPIC (On-line Page Importance Calculation) by default
– LinkRank

22 / 43
Features (cont.)
 Protocols
– Http, file, ftp, https
– Respects robots.txt directives

 Scheduling
– Fixed or adaptive

 URL filters
– Regex, FSA, TLD, prefix, suffix

 URL normalisers
– Default, regex

23 / 43
Features (cont.)
 Parsing with Apache Tika
– Hundreds of formats supported
– But some legacy parsers as well

 Other plugins
–
–
–
–
–

CreativeCommons
Feeds
Language Identification
Rel tags
Arbitrary Metadata

 Pluggable indexing
– SOLR | ES etc...

24 / 43
Indexing
 Apache SOLR
– schema.xml in conf/
– SOLR 3.4
– JIRA issue for SOLRCloud
• https://issues.apache.org/jira/browse/NUTCH-1377

 ElasticSearch
– Version 0.90.1

 AWS CloudSearch
– WIP : https://issues.apache.org/jira/browse/NUTCH-1517

 Easy to build your own
– Text, DB, etc...

25 / 43
Typical Nutch document
 Some of the fields (IndexingFilters in plugins or core code)
–
–
–
–
–
–
–
–
–
–

url
content
title
anchor
site
boost
digest
segment
host
type

 Configurable ones
– meta tags (keywords, description etc...)
– arbitrary metadata

26 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

27 / 43
NUTCH 2.x
 2.0 released in July 2012
 2.2.1 in July 2013
 Common features as 1.x
– MapReduce, Tika, delegation to SOLR, etc...

 Moved to 'big table'-like architecture
– Wealth of NoSQL projects in last few years

 Abstraction over storage layer → Apache GORA
28 / 43
Apache GORA
 http://gora.apache.org/
 ORM for NoSQL databases
– and limited SQL support + file based storage

 Current version 0.3
 DataStore implementations
●
●
●

Accumulo
Cassandra
HBase

●
●
●

Avro
DynamoDB
SQL (broken)

 Serialization with Apache AVRO
 Object-to-datastore mappings (backend-specific)
29 / 43
AVRO Schema => Java code
{"name": "WebPage",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "baseUrl", "type": ["null", "string"] },
{"name": "status", "type": "int"},
{"name": "fetchTime", "type": "long"},
{"name": "prevFetchTime", "type": "long"},
{"name": "fetchInterval", "type": "int"},
{"name": "retriesSinceFetch", "type": "int"},
{"name": "modifiedTime", "type": "long"},
{"name": "protocolStatus", "type": {
"name": "ProtocolStatus",
"type": "record",
"namespace": "org.apache.nutch.storage",
"fields": [
{"name": "code", "type": "int"},
{"name": "args", "type": {"type": "array", "items": "string"}},
{"name": "lastModified", "type": "long"}
]
}},
[…]

30 / 43
Mapping file (backend specific – Hbase)
<gora-orm>
<table name="webpage">
<family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters -->
<family name="f" maxVersions="1"/>
<family name="s" maxVersions="1"/>
<family name="il" maxVersions="1"/>
<family name="ol" maxVersions="1"/>
<family name="h" maxVersions="1"/>
<family name="mtdt" maxVersions="1"/>
<family name="mk" maxVersions="1"/>
</table>
<class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
<!-- fetch fields
-->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>

31 / 43
DataStore operations
 Basic operations
– get(K key)
– put(K key, T obj)
– delete(K key)

 Querying
– execute(Query<K, T> query) → Result<K,T>
– deleteByQuery(Query<K, T> query)

 Wrappers for Apache Hadoop
– GORAInput|OutputFormat
– GoraRecordReader|Writer
– GORAMapper|Reducer

32 / 43
GORA in Nutch
 AVRO schema provided and java code pre-generated
 Mapping files provided for backends
– can be modified if necessary
 Need to rebuild to get dependencies for backend
– hence source only distribution of Nutch 2.x
 http://wiki.apache.org/nutch/Nutch2Tutorial

33 / 43
Benefits
 Storage still distributed and replicated
 … but one big table
– status, metadata, content, text → one place
– no more segments

 Resume-able fetch and parse steps
 Easier interaction with other resources
– Third-party code just need to use GORA and schema

 Simplify the Nutch code
 Potentially faster (e.g. update step)
34 / 43
Drawbacks
 More stuff to install and configure
– Higher hardware requirements

 Current performance :-(
–
–
–
–
–
–

http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
N2+HBase : 2.7x slower than 1.x
N2+Cassandra : 4.4x slower than 1.x
due mostly to GORA layer : not inherent to Hbase or Cassandra
https://issues.apache.org/jira/browse/GORA-119 → filtered scans
Not all backends provide data locality!

 Not as stable as Nutch 1.x

35 / 43
2.x Work in progress
 Stabilise backend implementations
– GORA-Hbase most reliable

 Synchronize features with 1.x
– e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph)
– No pluggable indexers yet (NUTCH-1568)

 Filter enabled scans
– GORA-119
• => don't need to de-serialize the whole dataset

36 / 43
Outline
 Overview
 Installation and setup
 Main steps
 Nutch 2.x
 Future developments

37 / 43
Future
 1.x and 2.x to coexist in parallel
– 2.x not yet a replacement of 1.x

 New functionalities
–
–
–
–

Support for SOLRCloud
Sitemap (from CrawlerCommons library)
Canonical tag
Generic deduplication (NUTCH-656)

 Move to new MapReduce API
– Use Nutch on Hadoop 2.x

38 / 43
More delegation
 Great deal done in recent years (SOLR, Tika)
 Share code with crawler-commons
(http://code.google.com/p/crawler-commons/)
– Fetcher / protocol handling
– URL normalisation / filtering

 PageRank-like computations to graph library
– Apache Giraph
– Should be more efficient + less code to maintain

39 / 43
Longer term
 Hadoop 2.x & YARN
 Convergence of batch and streaming
– Storm / Samza / Storm-YARN / …

 End of 100% batch operations ?
– Fetch and parse as streaming ?
– Always be fetching
– Generate / update / pagerank remain batch

 See https://github.com/DigitalPebble/storm-crawler
40 / 43
Where to find out more?
 Project page : http://nutch.apache.org/
 Wiki : http://wiki.apache.org/nutch/
 Mailing lists :
– user@nutch.apache.org
– dev@nutch.apache.org

 Chapter in 'Hadoop the Definitive Guide' (T. White)
– Understanding Hadoop is essential anyway...

 Support / consulting :
– http://wiki.apache.org/nutch/Support

41 / 43
Questions

?
42 / 43
43 / 43

More Related Content

What's hot

An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitabial
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solrMike Frampton
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst AgainVarun Thacker
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaKnoldus Inc.
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Sameer Tiwari
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosJoe Stein
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosJoe Stein
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache MesosJoe Stein
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiSatoshi Nagayasu
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화NAVER D2
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Oleksiy Panchenko
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAmir Sedighi
 

What's hot (20)

An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Nutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkitNutch - web-scale search engine toolkit
Nutch - web-scale search engine toolkit
 
Web scraping with nutch solr
Web scraping with nutch solrWeb scraping with nutch solr
Web scraping with nutch solr
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Meet Solr For The Tirst Again
Meet Solr For The Tirst AgainMeet Solr For The Tirst Again
Meet Solr For The Tirst Again
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...Accessing external hadoop data sources using pivotal e xtension framework (px...
Accessing external hadoop data sources using pivotal e xtension framework (px...
 
You know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900msYou know, for search. Querying 24 Billion Documents in 900ms
You know, for search. Querying 24 Billion Documents in 900ms
 
Making Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache MesosMaking Apache Kafka Elastic with Apache Mesos
Making Apache Kafka Elastic with Apache Mesos
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 TaipeiPostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화[2D1]Elasticsearch 성능 최적화
[2D1]Elasticsearch 성능 최적화
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
SphinxSE with MySQL
SphinxSE with MySQLSphinxSE with MySQL
SphinxSE with MySQL
 
An introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoopAn introduction to Big-Data processing applying hadoop
An introduction to Big-Data processing applying hadoop
 

Viewers also liked

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash courseTommaso Teofili
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platformmteutelink
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01David Smiley
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-endgagravarr
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache TikaPaolo Mottadelli
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content TransformationAlfresco Software
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrLucidworks (Archived)
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wildJulien Nioche
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaLucidworks
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDavid Gil Sánchez
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaPaolo Mottadelli
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrIván Campaña Naranjo
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolToni de la Fuente
 

Viewers also liked (20)

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Open source enterprise search and retrieval platform
Open source enterprise search and retrieval platformOpen source enterprise search and retrieval platform
Open source enterprise search and retrieval platform
 
Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01Populate your Search index, NEST 2016-01
Populate your Search index, NEST 2016-01
 
Apache Tika end-to-end
Apache Tika end-to-endApache Tika end-to-end
Apache Tika end-to-end
 
Content Analysis with Apache Tika
Content Analysis with Apache TikaContent Analysis with Apache Tika
Content Analysis with Apache Tika
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
ProjectHub
ProjectHubProjectHub
ProjectHub
 
Search engine
Search engineSearch engine
Search engine
 
Metadata Extraction and Content Transformation
Metadata Extraction and Content TransformationMetadata Extraction and Content Transformation
Metadata Extraction and Content Transformation
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, ClouderaReal-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
 
Drupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsquedaDrupal + Solr Mejorando la experiencia de búsqueda
Drupal + Solr Mejorando la experiencia de búsqueda
 
Content analysis for ECM with Apache Tika
Content analysis for ECM with Apache TikaContent analysis for ECM with Apache Tika
Content analysis for ECM with Apache Tika
 
Mejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache SolrMejorando la búsqueda Web con Apache Solr
Mejorando la búsqueda Web con Apache Solr
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 

Similar to Large Scale Crawling with Apache Nutch and Friends

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerGeorge Ang
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...DataStax Academy
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesThanigai Vellore
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperRahul Jain
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Lucas Jellema
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble BehemothSteve Loughran
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOPaolo Cristofaro
 
'Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash''Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash'Cloud Elements
 

Similar to Large Scale Crawling with Apache Nutch and Friends (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Design and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web CrawlerDesign and Implementation of a High- Performance Distributed Web Crawler
Design and Implementation of a High- Performance Distributed Web Crawler
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, ...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
 
Leveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot ArchitecturesLeveraging Hadoop in Polyglot Architectures
Leveraging Hadoop in Polyglot Architectures
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble Behemoth
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OOVirtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
Virtuoso RDF Triple Store Analysis Benchmark & mapping tools RDF / OO
 
'Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash''Scalable Logging and Analytics with LogStash'
'Scalable Logging and Analytics with LogStash'
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Recently uploaded (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Large Scale Crawling with Apache Nutch and Friends

  • 1. Large Scale Crawling with Apache and friends... Julien Nioche julien@digitalpebble.com LUCENE/SOLR REVOLUTION EU 2013
  • 2. About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – – – – Web Crawling Natural Language Processing Information Retrieval Machine Learning  Strong focus on Open Source & Apache ecosystem  VP Apache Nutch  User | Contributor | Committer – – – – – Tika SOLR, Lucene GATE, UIMA Mahout Behemoth 2 / 43
  • 3. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 3 / 43
  • 4. Nutch?  “Distributed framework for large scale web crawling” (but does not have to be large scale at all)  Apache TLP since May 2010  Based on Apache Hadoop  Indexing and Search by 4 / 43
  • 5. A bit of history  2002/2003 : Started By Doug Cutting & Mike Caffarella  2005 : MapReduce implementation in Nutch – 2006 : Hadoop sub-project of Lucene @Apache  2006/7 : Parser and MimeType in Tika – 2008 : Tika sub-project of Lucene @Apache  May 2010 : TLP project at Apache  Sept 2010 : Storage abstraction in Nutch 2.x – 2012 : Gora TLP @Apache 5 / 43
  • 6. Recent Releases trunk 1.0 1.1 1.2 1.3 1.4 1.5.1 1.6 1.7 2.x 2.0 2.1 06/09 06/10 06/11 06/12 2.2.1 06/13 6 / 43
  • 7. Why use Nutch?  Usual reasons – Open source with a business-friendly license, mature, community, ...  Scalability – Tried and tested on very large scale – Standard Hadoop  Features – – – – Index with SOLR / ES / CloudSearch PageRank implementation Loads of existing plugins Can easily be extended / customised 7 / 43
  • 8. Use cases  Crawl for search – Generic or vertical – Index and Search with SOLR and al. – Single node to large clusters on Cloud  … but also – Data Mining – NLP (e.g.Sentiment Analysis) – ML  with – MAHOUT / UIMA / GATE – Use Behemoth as glueware (https://github.com/DigitalPebble/behemoth) 8 / 43
  • 9. Customer cases Specificity (Verticality) BetterJobs.com (CareerBuilder) – – – – – Single server Aggregates content from job portals Extracts and normalizes structure (description, requirements, locations) ~2M pages total Feeds SOLR index SimilarPages.com – – – – – Large cluster on Amazon EC2 (up to 400 nodes) Fetched & parsed 3 billion pages 10+ billion pages in crawlDB (~100TB data) 200+ million lists of similarities No indexing / search involved Size 9 / 43
  • 10. CommonCrawl http://commoncrawl.org/  Open repository of web crawl data  2012 dataset : 3.83 billion docs  ARC files on Amazon S3  Using Nutch 1.7  A few modifications to Nutch code – https://github.com/Aloisius/nutch  Next release imminent 10 / 43
  • 11. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 11 / 43
  • 12. Installation  http://nutch.apache.org/downloads.html  1.7 => src and bin distributions  2.2.1 => src only  'ant clean runtime' – runtime/local => local mode (test and debug) – runtime/deploy => job jar for Hadoop + scripts  Binary distribution for 1.x == runtime/local 12 / 43
  • 13. Configuration and resources  Changes in $NUTCH_HOME/conf – Need recompiling with 'ant runtime' – Local mode => can be made directly in runtime/local/conf  Specify configuration in nutch-site.xml – Leave nutch-default alone!  At least : <property>   <name>http.agent.name</name>   <value>WhateverNameDescribesMyMightyCrawler</value> </property> 13 / 43
  • 14. Running it!  bin/crawl script : typical sequence of steps  bin/nutch : individual Nutch commands – Inject / generate / fetch / parse / update ….  Local mode : great for testing and debugging  Recommended : deploy + Hadoop (pseudo) distrib mode – Parallelism – MapReduce UI to monitor crawl, check logs, counters 14 / 43
  • 15. Monitor Crawl with MapReduce UI 15 / 43
  • 17. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 17 / 43
  • 18. Typical Nutch Steps  Same in 1.x and 2.x  Sequence of batch operations 1) 2) 3) 4) 5) 6) 7) Inject → populates CrawlDB from seed list Generate → Selects URLS to fetch in segment Fetch → Fetches URLs from segment Parse → Parses content (text + metadata) UpdateDB → Updates CrawlDB (new URLs, new status...) InvertLinks → Build Webgraph Index → Send docs to [SOLR | ES | CloudSearch | … ]  Repeat steps 2 to 7  Or use the all-in-one crawl script 18 / 43
  • 19. Main steps from a data perspective Seed List Segment CrawlDB /crawl_generate/ /crawl_fetch/ /content/ /crawl_parse/ /parse_data/ /parse_text/ LinkDB 19 / 43
  • 20. Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i=1 i=2 i=3 [Slide courtesy of A. Bialecki] 20 / 43
  • 21. An extensible framework  Plugins – Activated with parameter 'plugin.includes' – Implement one or more endpoints  Endpoints – – – – – – – – Protocol Parser HtmlParseFilter (a.k.a ParseFilter in Nutch 2.x) ScoringFilter (used in various places) URLFilter (ditto) URLNormalizer (ditto) IndexingFilter IndexWriter (NEW IN 1.7!) 21 / 43
  • 22. Features  Fetcher – – – – Multi-threaded fetcher Queues URLs per hostname / domain / IP Limit the number of URLs for round of fetching Default values are polite but can be made more aggressive  Crawl Strategy – Breadth-first but can be depth-first – Configurable via custom ScoringFilters  Scoring – OPIC (On-line Page Importance Calculation) by default – LinkRank 22 / 43
  • 23. Features (cont.)  Protocols – Http, file, ftp, https – Respects robots.txt directives  Scheduling – Fixed or adaptive  URL filters – Regex, FSA, TLD, prefix, suffix  URL normalisers – Default, regex 23 / 43
  • 24. Features (cont.)  Parsing with Apache Tika – Hundreds of formats supported – But some legacy parsers as well  Other plugins – – – – – CreativeCommons Feeds Language Identification Rel tags Arbitrary Metadata  Pluggable indexing – SOLR | ES etc... 24 / 43
  • 25. Indexing  Apache SOLR – schema.xml in conf/ – SOLR 3.4 – JIRA issue for SOLRCloud • https://issues.apache.org/jira/browse/NUTCH-1377  ElasticSearch – Version 0.90.1  AWS CloudSearch – WIP : https://issues.apache.org/jira/browse/NUTCH-1517  Easy to build your own – Text, DB, etc... 25 / 43
  • 26. Typical Nutch document  Some of the fields (IndexingFilters in plugins or core code) – – – – – – – – – – url content title anchor site boost digest segment host type  Configurable ones – meta tags (keywords, description etc...) – arbitrary metadata 26 / 43
  • 27. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 27 / 43
  • 28. NUTCH 2.x  2.0 released in July 2012  2.2.1 in July 2013  Common features as 1.x – MapReduce, Tika, delegation to SOLR, etc...  Moved to 'big table'-like architecture – Wealth of NoSQL projects in last few years  Abstraction over storage layer → Apache GORA 28 / 43
  • 29. Apache GORA  http://gora.apache.org/  ORM for NoSQL databases – and limited SQL support + file based storage  Current version 0.3  DataStore implementations ● ● ● Accumulo Cassandra HBase ● ● ● Avro DynamoDB SQL (broken)  Serialization with Apache AVRO  Object-to-datastore mappings (backend-specific) 29 / 43
  • 30. AVRO Schema => Java code {"name": "WebPage", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "baseUrl", "type": ["null", "string"] }, {"name": "status", "type": "int"}, {"name": "fetchTime", "type": "long"}, {"name": "prevFetchTime", "type": "long"}, {"name": "fetchInterval", "type": "int"}, {"name": "retriesSinceFetch", "type": "int"}, {"name": "modifiedTime", "type": "long"}, {"name": "protocolStatus", "type": { "name": "ProtocolStatus", "type": "record", "namespace": "org.apache.nutch.storage", "fields": [ {"name": "code", "type": "int"}, {"name": "args", "type": {"type": "array", "items": "string"}}, {"name": "lastModified", "type": "long"} ] }}, […] 30 / 43
  • 31. Mapping file (backend specific – Hbase) <gora-orm> <table name="webpage"> <family name="p" maxVersions="1"/> <!-- This can also have params like compression, bloom filters --> <family name="f" maxVersions="1"/> <family name="s" maxVersions="1"/> <family name="il" maxVersions="1"/> <family name="ol" maxVersions="1"/> <family name="h" maxVersions="1"/> <family name="mtdt" maxVersions="1"/> <family name="mk" maxVersions="1"/> </table> <class table="webpage" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage"> <!-- fetch fields --> <field name="baseUrl" family="f" qualifier="bas"/> <field name="status" family="f" qualifier="st"/> <field name="prevFetchTime" family="f" qualifier="pts"/> <field name="fetchTime" family="f" qualifier="ts"/> <field name="fetchInterval" family="f" qualifier="fi"/> <field name="retriesSinceFetch" family="f" qualifier="rsf"/> 31 / 43
  • 32. DataStore operations  Basic operations – get(K key) – put(K key, T obj) – delete(K key)  Querying – execute(Query<K, T> query) → Result<K,T> – deleteByQuery(Query<K, T> query)  Wrappers for Apache Hadoop – GORAInput|OutputFormat – GoraRecordReader|Writer – GORAMapper|Reducer 32 / 43
  • 33. GORA in Nutch  AVRO schema provided and java code pre-generated  Mapping files provided for backends – can be modified if necessary  Need to rebuild to get dependencies for backend – hence source only distribution of Nutch 2.x  http://wiki.apache.org/nutch/Nutch2Tutorial 33 / 43
  • 34. Benefits  Storage still distributed and replicated  … but one big table – status, metadata, content, text → one place – no more segments  Resume-able fetch and parse steps  Easier interaction with other resources – Third-party code just need to use GORA and schema  Simplify the Nutch code  Potentially faster (e.g. update step) 34 / 43
  • 35. Drawbacks  More stuff to install and configure – Higher hardware requirements  Current performance :-( – – – – – – http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html N2+HBase : 2.7x slower than 1.x N2+Cassandra : 4.4x slower than 1.x due mostly to GORA layer : not inherent to Hbase or Cassandra https://issues.apache.org/jira/browse/GORA-119 → filtered scans Not all backends provide data locality!  Not as stable as Nutch 1.x 35 / 43
  • 36. 2.x Work in progress  Stabilise backend implementations – GORA-Hbase most reliable  Synchronize features with 1.x – e.g. missing LinkRank equivalent (GSOC 2013 – use Apache Giraph) – No pluggable indexers yet (NUTCH-1568)  Filter enabled scans – GORA-119 • => don't need to de-serialize the whole dataset 36 / 43
  • 37. Outline  Overview  Installation and setup  Main steps  Nutch 2.x  Future developments 37 / 43
  • 38. Future  1.x and 2.x to coexist in parallel – 2.x not yet a replacement of 1.x  New functionalities – – – – Support for SOLRCloud Sitemap (from CrawlerCommons library) Canonical tag Generic deduplication (NUTCH-656)  Move to new MapReduce API – Use Nutch on Hadoop 2.x 38 / 43
  • 39. More delegation  Great deal done in recent years (SOLR, Tika)  Share code with crawler-commons (http://code.google.com/p/crawler-commons/) – Fetcher / protocol handling – URL normalisation / filtering  PageRank-like computations to graph library – Apache Giraph – Should be more efficient + less code to maintain 39 / 43
  • 40. Longer term  Hadoop 2.x & YARN  Convergence of batch and streaming – Storm / Samza / Storm-YARN / …  End of 100% batch operations ? – Fetch and parse as streaming ? – Always be fetching – Generate / update / pagerank remain batch  See https://github.com/DigitalPebble/storm-crawler 40 / 43
  • 41. Where to find out more?  Project page : http://nutch.apache.org/  Wiki : http://wiki.apache.org/nutch/  Mailing lists : – user@nutch.apache.org – dev@nutch.apache.org  Chapter in 'Hadoop the Definitive Guide' (T. White) – Understanding Hadoop is essential anyway...  Support / consulting : – http://wiki.apache.org/nutch/Support 41 / 43