SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
A quick introduction to 
Storm Crawler 
Julien Nioche 
julien@digitalpebble.com 
@digitalpebble 
ApacheCon EU 2014 - Budapest
2 / 15 
About myself 
 DigitalPebble Ltd, Bristol (UK) 
 Specialised in Text Engineering 
– Web Crawling 
– Natural Language Processing 
– Information Retrieval 
– Machine Learning 
 Strong focus on Open Source & Apache ecosystem 
 PMC Chair Apache Nutch 
 User | Contributor | Committer 
– Tika 
– SOLR, Lucene 
– GATE, UIMA 
– Mahout 
– Behemoth
What is it? 
 Collection of resources (SDK) for building web crawlers on 
Apache Storm 
 https://github.com/DigitalPebble/storm-crawler 
 Artefacts available from Maven Central 
 Apache License v2 
3 / 15 
 Scalable 
 Low latency 
 Easily extensible
What it is not 
 A ready-to-use, feature-complete, recursive web crawler 
– Might be something like that as a separate project using S/C later 
4 / 15 
 e.g. no PageRank or explicit ranking of pages 
– Build your own 
 No fancy UI, dashboards, etc... 
– Build your own
Comparison with Nutch 
 Nutch is batch driven : little control on when URLs are 
fetched 
5 / 15 
– Potential issue for use cases where need sessions 
– latency++ 
 Fetching only one of the steps in Nutch 
– SC : 'always be fetching' (Ken Krugler); better use of resources 
 Make it even more flexible 
– Typical case : few custom classes (at least a Topology) the rest are just 
dependencies and standard S/C components 
 Not ready-to use as Nutch : it's a SDK 
 Would not have existed without it 
– Borrowed code and concepts
6 / 15 
Overview of resources 
https://www.flickr.com/photos/dipster1/1403240351/
7 / 15 
FetcherBolt 
 Multi-threaded 
 Polite 
– Puts incoming tuples into internal queues based on IP/domain/hostname 
– Sets delay between requests from same queue 
– Respects robots.txt 
 Protocol-neutral 
– Protocol implementations are pluggable 
– HTTP implementation taken from Nutch 
 Output 
– String URL 
– byte[] content 
– HashMap<String, String[]> metadata
8 / 15 
ParserBolt 
 Based on Apache Tika 
 Supports most commonly used doc formats 
– HTML, PDF, DOC etc... 
 Calls ParseFilters on document 
– e.g. scrape info with XPathFilter 
 Calls URLFilters on outlinks 
– e.g normalize and / or blacklists URLs based on RegExps 
 Output 
– String URL 
– byte[] content 
– HashMap<String, String[]> metadata 
– String text 
– Set<String> outlinks
9 / 15 
Other resources 
 ElasticSearchBolt 
– Sends fields to ElasticSearch for indexing 
– (deprecated by resources in elasticsearch-hadoop?) 
 URLPartitionerBolt 
– Generates a key based on the hostname / domain / IP of URL 
– Output : 
‱ String URL 
‱ String key 
‱ String metadata 
– Useful for fieldGrouping
10 / 15 
Other resources 
 ConfigurableTopology 
– Overrides config with local YAML file 
– Simple switch for running in local mode 
– Abstract class to be extended 
 Simple Spouts (for testing) 
– FileSpout / RandomURLSpout 
 Various Metrics-related stuff 
– Including a MetricsConsumer for https://www.librato.com/ 
 FetchQueue package 
– BlockingURLSpout and ShardedQueue abstraction
11 / 15 
Integrate it! 
 Write your the Spout for your usecase 
– Will work fine existing resources as long as it generates URL, metadata 
 Typical scenario 
– Group URLs to fetch into separate external queues based on host or 
domain (AWS SQS, Apache Kafka) 
– Write Spout for it and throttle with topology.max.spout.pending 
– So that can enforce politeness without getting timeout on Tuples → fail 
– Parse and extract 
– Send new URLs to queues 
 Can use various forms of persistence for URLs 
– ElasticSearch, DynamoDB, Hbase, etc...
12 / 15 
Some use cases (prototype stage) 
 Processing of streams of data (natural fit for Storm) 
– http://www.weborama.com 
 Monitoring of finite set of URLs 
– http://www.ontopic.io (more on them later) 
– http://www.shopstyle.com : scraping + indexing 
 One-off non-recursive crawling 
– http://www.stolencamerafinder.com/ : scraping + indexing 
 Recursive crawler 
– WIP
13 / 15 
What's next? 
 All-in-one crawler project built on SC 
– Also a good example of how to use SC 
 Additional Parse/URLFilters 
 More tests and documentation 
 A nice logo (this is an invitation) 
 A better name?
14 / 15 
Questions 
?
15 / 15

Weitere Àhnliche Inhalte

Was ist angesagt?

January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
Yahoo Developer Network
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
DataWorks Summit
 

Was ist angesagt? (20)

Web scraping with nutch solr part 2
Web scraping with nutch solr part 2Web scraping with nutch solr part 2
Web scraping with nutch solr part 2
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Get started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache MesosGet started with Developing Frameworks in Go on Apache Mesos
Get started with Developing Frameworks in Go on Apache Mesos
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Developing Frameworks for Apache Mesos
Developing Frameworks  for Apache MesosDeveloping Frameworks  for Apache Mesos
Developing Frameworks for Apache Mesos
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
RESTful API – How to Consume, Extract, Store and Visualize Data with InfluxDB...
 
Multi-tenant Apache Storm as a service
Multi-tenant Apache Storm as a serviceMulti-tenant Apache Storm as a service
Multi-tenant Apache Storm as a service
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 

Andere mochten auch

Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
JAXLondon2014
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

Andere mochten auch (17)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Low latency Java apps
Low latency Java appsLow latency Java apps
Low latency Java apps
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
DDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFrameworkDDD Framework for Java: JdonFramework
DDD Framework for Java: JdonFramework
 
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesUnderstanding Akka Streams, Back Pressure, and Asynchronous Architectures
Understanding Akka Streams, Back Pressure, and Asynchronous Architectures
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
Distributed stream processing with Apache Kafka
Distributed stream processing with Apache KafkaDistributed stream processing with Apache Kafka
Distributed stream processing with Apache Kafka
 
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 

Ähnlich wie A quick introduction to Storm Crawler

EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
Florent Georges
 
Presentation_1367055087514
Presentation_1367055087514Presentation_1367055087514
Presentation_1367055087514
Alexander Nevidimov
 

Ähnlich wie A quick introduction to Storm Crawler (20)

Apache Marmotta - Introduction
Apache Marmotta - IntroductionApache Marmotta - Introduction
Apache Marmotta - Introduction
 
Digital Pebble Behemoth
Digital Pebble BehemothDigital Pebble Behemoth
Digital Pebble Behemoth
 
Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java Everything you wanted to know about writing async, concurrent http apps in java
Everything you wanted to know about writing async, concurrent http apps in java
 
EXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp frameworkEXPath: the packaging system and the webapp framework
EXPath: the packaging system and the webapp framework
 
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU:  Deep Dive into Building Streaming Applications with Apache PulsarOSS EU:  Deep Dive into Building Streaming Applications with Apache Pulsar
OSS EU: Deep Dive into Building Streaming Applications with Apache Pulsar
 
Open Cloud Computing Interface Presentation
Open Cloud Computing Interface PresentationOpen Cloud Computing Interface Presentation
Open Cloud Computing Interface Presentation
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraReal-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
webtech1b.ppt
webtech1b.pptwebtech1b.ppt
webtech1b.ppt
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Webtech1b
Webtech1bWebtech1b
Webtech1b
 
Presentation_1367055087514
Presentation_1367055087514Presentation_1367055087514
Presentation_1367055087514
 
Webtech1b - hello 123 123
Webtech1b - hello 123 123Webtech1b - hello 123 123
Webtech1b - hello 123 123
 

KĂŒrzlich hochgeladen

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

KĂŒrzlich hochgeladen (20)

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Chinsurah Escorts ☎8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎8617697112 Starting From 5K to 15K High Profile Escorts ...
 

A quick introduction to Storm Crawler

  • 1. A quick introduction to Storm Crawler Julien Nioche julien@digitalpebble.com @digitalpebble ApacheCon EU 2014 - Budapest
  • 2. 2 / 15 About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Machine Learning  Strong focus on Open Source & Apache ecosystem  PMC Chair Apache Nutch  User | Contributor | Committer – Tika – SOLR, Lucene – GATE, UIMA – Mahout – Behemoth
  • 3. What is it?  Collection of resources (SDK) for building web crawlers on Apache Storm  https://github.com/DigitalPebble/storm-crawler  Artefacts available from Maven Central  Apache License v2 3 / 15  Scalable  Low latency  Easily extensible
  • 4. What it is not  A ready-to-use, feature-complete, recursive web crawler – Might be something like that as a separate project using S/C later 4 / 15  e.g. no PageRank or explicit ranking of pages – Build your own  No fancy UI, dashboards, etc... – Build your own
  • 5. Comparison with Nutch  Nutch is batch driven : little control on when URLs are fetched 5 / 15 – Potential issue for use cases where need sessions – latency++  Fetching only one of the steps in Nutch – SC : 'always be fetching' (Ken Krugler); better use of resources  Make it even more flexible – Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard S/C components  Not ready-to use as Nutch : it's a SDK  Would not have existed without it – Borrowed code and concepts
  • 6. 6 / 15 Overview of resources https://www.flickr.com/photos/dipster1/1403240351/
  • 7. 7 / 15 FetcherBolt  Multi-threaded  Polite – Puts incoming tuples into internal queues based on IP/domain/hostname – Sets delay between requests from same queue – Respects robots.txt  Protocol-neutral – Protocol implementations are pluggable – HTTP implementation taken from Nutch  Output – String URL – byte[] content – HashMap<String, String[]> metadata
  • 8. 8 / 15 ParserBolt  Based on Apache Tika  Supports most commonly used doc formats – HTML, PDF, DOC etc...  Calls ParseFilters on document – e.g. scrape info with XPathFilter  Calls URLFilters on outlinks – e.g normalize and / or blacklists URLs based on RegExps  Output – String URL – byte[] content – HashMap<String, String[]> metadata – String text – Set<String> outlinks
  • 9. 9 / 15 Other resources  ElasticSearchBolt – Sends fields to ElasticSearch for indexing – (deprecated by resources in elasticsearch-hadoop?)  URLPartitionerBolt – Generates a key based on the hostname / domain / IP of URL – Output : ‱ String URL ‱ String key ‱ String metadata – Useful for fieldGrouping
  • 10. 10 / 15 Other resources  ConfigurableTopology – Overrides config with local YAML file – Simple switch for running in local mode – Abstract class to be extended  Simple Spouts (for testing) – FileSpout / RandomURLSpout  Various Metrics-related stuff – Including a MetricsConsumer for https://www.librato.com/  FetchQueue package – BlockingURLSpout and ShardedQueue abstraction
  • 11. 11 / 15 Integrate it!  Write your the Spout for your usecase – Will work fine existing resources as long as it generates URL, metadata  Typical scenario – Group URLs to fetch into separate external queues based on host or domain (AWS SQS, Apache Kafka) – Write Spout for it and throttle with topology.max.spout.pending – So that can enforce politeness without getting timeout on Tuples → fail – Parse and extract – Send new URLs to queues  Can use various forms of persistence for URLs – ElasticSearch, DynamoDB, Hbase, etc...
  • 12. 12 / 15 Some use cases (prototype stage)  Processing of streams of data (natural fit for Storm) – http://www.weborama.com  Monitoring of finite set of URLs – http://www.ontopic.io (more on them later) – http://www.shopstyle.com : scraping + indexing  One-off non-recursive crawling – http://www.stolencamerafinder.com/ : scraping + indexing  Recursive crawler – WIP
  • 13. 13 / 15 What's next?  All-in-one crawler project built on SC – Also a good example of how to use SC  Additional Parse/URLFilters  More tests and documentation  A nice logo (this is an invitation)  A better name?
  • 14. 14 / 15 Questions ?