SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Low latency scalable web crawling on
Apache Storm
Julien Nioche
julien@digitalpebble.com
Berlin Buzzwords 01/06/2015
Storm Crawler
digitalpebble
2
About myself
 DigitalPebble Ltd, Bristol (UK)
 Specialised in Text Engineering
– Web Crawling
– Natural Language Processing
– Information Retrieval
– Machine Learning
 Strong focus on Open Source & Apache ecosystem
– Nutch
– Tika
– GATE, UIMA
– SOLR, Elasticearch
– Behemoth
3
 Collection of resources (SDK) for building web crawlers on Apache Storm
 https://github.com/DigitalPebble/storm-crawler
 Apache License v2
 Artefacts available from Maven Central
 Active and growing fast
 Version 0.5 just released!
Storm-Crawler : what is it?
=> Can scale
=> Can do low latency
=> Easy to use and extend
=> Nice features (politeness, scraping, sitemaps etc...)
4
What it is not
 A well packaged application
– It's a SDK : it requires some minimal programming and configuration
 No global processing of the pages e.g.PageRank
 No fancy UI, dashboards, etc...
– Build your own or integrate in your existing ones
– But can use Storm UI + metrics to various backends (Librato, ElasticSearch)
5
Comparison with Apache Nutch
6
StormCrawler vs Nutch
 Nutch is batch driven : little control on when URLs are fetched
– Potential issue for use cases where need sessions
– latency++
 Fetching only one of the steps in Nutch
– SC : 'always fetching' ; better use of resources
 More flexible
– Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard
SC components
– Logical crawls : multiple crawlers with their own scheduling easier with SC via queues
 Not ready-to use as Nutch : it's a SDK
 Would not have existed without it
– Borrowed some code and concepts
– Contributed some stuff back
7
Apache Storm
8
Apache Storm
 Distributed real time computation system
 http://storm.apache.org/
 Scalable, fault-tolerant, polyglot and fun!
 Implemented in Clojure + Java
 “Realtime analytics, online machine learning, continuous computation,
distributed RPC, ETL, and more”
9
Topology = spouts + bolts + tuples + streams
10
What's in Storm Crawler?
https://www.flickr.com/photos/dipster1/1403240351/
11
Resources: core vs external
 Core
– Fetcher(s) Bolts
– JsoupParserBolt
– SitemapParserBolt
– ...
 External
– Metrics-related : inc. connector for Librato
– ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics
– Tika-based parser bolt
 User-maintained external resources
 Generic Storm resources (e.g. spouts)
12
FetcherBolt
 Multi-threaded
 Polite
– Puts incoming tuples into internal queues based on IP/domain/hostname
– Sets delay between requests from same queue
– Internal fetch threads
– Respects robots.txt
 Protocol-neutral
– Protocol implementations are pluggable
– Default HTTP implementation based on Apache HttpClient
 Also have a SimpleFetcherBolt
– Need handle politeness elsewhere (spout?)
13
JSoupParserBolt
 HTML only but have a Tika-based one in external
 Extracts texts and outlinks
 Calls URLFilters on outlinks
– normalize and / or blacklists URLs
 Calls ParseFilters on document
– e.g. scrape info with XpathFilter
– Enrich metadata content
14
ParseFilter
 Extracts information from document
– Called by *ParserBolt(s)
– Configured via JSON file
– Interface
• filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata)
– com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java
• Xpath expressions
• Info extracted stored in metadata
• Used later for indexing
15
URLFilter
 Control crawl expansion
 Delete or rewrites URLs
 Configured in JSON file
 Interface
– String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)
 Basic
 MaxDepth
 Host
 RegexURLFilter
 RegexURLNormalizer
16
Basic Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout (1) pull URLs from an external source 
(2) group per host / domain / IP
(3) fetch URLs
Default
stream
(4) parse content
(3) fetch URLs
(5) index text and metadata
17
Basic Topology : Tuple perspective
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
(1) <String url, Metadata m>
(2) <String url, String key, Metadata m>
Default
stream
(4) <String url, byte[] content, Metadata m, String
text>
(3) <String url, byte[] content, Metadata m>
(1)
(2)
(3)
(4)
18
Basic scenario
 Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...
 Indexer : any form of storage but often index e.g. SOLR, ElasticSearch
 Components in grey : default ones from SC
 Use case : non-recursive crawls (i.e. no links discovered), URLs known in
advance. Failures don't matter too much.
 “Great! What about recursive crawls and / or failure handling?”
19
Frontier expansion
 Manual “discovery”
– Adding new URLs by hand,
“seeding”
 Automatic discovery of new
resources (frontier
expansion)
– Not all outlinks are equally useful
- control
– Requires content parsing and link
extraction
seed
i = 1
i = 2
i = 3
[Slide courtesy of A. Bialecki]
20
Recursive Crawl Topology
Fetcher
Parser
URLPartitioner
Indexer
Some Spout
Default
stream
Status Updater
Status stream
Switch
DB
(unique
URLS)
21
Status stream
 Used for handling errors and update info about URLs
 Newly discovered URLs from parser
public enum Status {
DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}
 StatusUpdater : writes to storage
 Can/should extend AbstractStatusUpdaterBolt
22
External : ElasticSearch
 IndexerBolt
– Indexes URL / metadata / text for search
– extends AbstractIndexerBolt
 StatusUpdaterBolt
– extends AbstractStatusUpdaterBolt
– URL / status / metadata / nextFetchDate in status index
 ElasticSearchSpout
– Reads from status index
– Sends URL + Metadata tuples down topology
 MetricsConsumer
– indexes Storm metrics for displaying e.g. with Kibana
23
How to use it?
24
How to use it?
 Write your own Topology class (or hack the example one)
 Put resources in src/main/resources
– URLFilters, ParseFilters etc...
 Build uber-jar with Maven
– mvn clean package
 Set the configuration
– External YAML file (e.g. crawler-conf.yaml)
 Call Storm
storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-dependencies.jar
com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
25
Topology in code
Grouping!
Politeness and data 
locality
26
27
Case Studies
 'No follow' crawl #1: existing list of URLs only – one off
– http://www.stolencamerafinder.com/ : [RabbitMQ + Elasticsearch]
 'No follow' crawl #2: Streams of URLs
– http://www.weborama.com : [queues? + HBase]
 Monitoring of finite set of URLs / non recursive crawl
– http://www.shopstyle.com : scraping + indexing [DynamoDB + AWS SQS]
– http://www.ontopic.io : [Redis + Kafka + ElasticSearch]
– http://www.careerbuilder.com : [RabbitMQ + Redis + ElasticSearch]
 Full recursive crawler
– http://www.shopstyle.com : discovery of product pages [DynamoDB]
28
What's next?
 Just released 0.5
– Improved WIKI documentation but can always do better
 All-in-one crawler based on ElasticSearch
– Also a good example of how to use SC
– Separate project
– Most resources already available in external
 Additional Parse/URLFilters
– Basic URL Normalizer #120
 Parse -> generate output for more than one document #117
– Change to the ParseFilter interface
 Selenium-based protocol implementation
– Handle AJAX pages
29
Resources
 https://storm.apache.org/
 https://github.com/DigitalPebble/storm-crawler/wiki
 http://nutch.apache.org/
 “Storm Applied”
http://www.manning.com/sallen/
30
Questions
?
31

Weitere ähnliche Inhalte

Was ist angesagt?

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...JAXLondon2014
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5Samuel Rash
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Spark Summit
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQXin Wang
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Chris Mattmann
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud ComputingBryan Bende
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache StormP. Taylor Goetz
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN Jim Dowling
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 

Was ist angesagt? (20)

Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
Detecting Events on the Web in Real Time with Java, Kafka and ZooKeeper - Jam...
 
2011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v52011 06-30-hadoop-summit v5
2011 06-30-hadoop-summit v5
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Realtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQRealtime Statistics based on Apache Storm and RocketMQ
Realtime Statistics based on Apache Storm and RocketMQ
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN On-premise Spark as a Service with YARN
On-premise Spark as a Service with YARN
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 

Ähnlich wie Low latency scalable web crawling on Apache Storm

An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm CrawlerJulien Nioche
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache NutchJulien Nioche
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at BristechJulien Nioche
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
Nex clipper 1905_summary_eng
Nex clipper 1905_summary_engNex clipper 1905_summary_eng
Nex clipper 1905_summary_engJinyong Kim
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redactedRyan Breed
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchSteve Watt
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos EngineeringSIGHUP
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with TerraformMathieu Herbert
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environmentsDocker, Inc.
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Mathieu Herbert
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 

Ähnlich wie Low latency scalable web crawling on Apache Storm (20)

An introduction to Storm Crawler
An introduction to Storm CrawlerAn introduction to Storm Crawler
An introduction to Storm Crawler
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large scale crawling with Apache Nutch
Large scale crawling with Apache NutchLarge scale crawling with Apache Nutch
Large scale crawling with Apache Nutch
 
StormCrawler at Bristech
StormCrawler at BristechStormCrawler at Bristech
StormCrawler at Bristech
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Nex clipper 1905_summary_eng
Nex clipper 1905_summary_engNex clipper 1905_summary_eng
Nex clipper 1905_summary_eng
 
breed_python_tx_redacted
breed_python_tx_redactedbreed_python_tx_redacted
breed_python_tx_redacted
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
Practical Chaos Engineering
Practical Chaos EngineeringPractical Chaos Engineering
Practical Chaos Engineering
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Infrastructure as Code with Terraform
Infrastructure as Code with TerraformInfrastructure as Code with Terraform
Infrastructure as Code with Terraform
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Proactive ops for container orchestration environments
Proactive ops for container orchestration environmentsProactive ops for container orchestration environments
Proactive ops for container orchestration environments
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018Infrastructure as Code - Terraform - Devfest 2018
Infrastructure as Code - Terraform - Devfest 2018
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 

Kürzlich hochgeladen

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 

Kürzlich hochgeladen (20)

Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Best Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdfBest Web Development Agency- Idiosys USA.pdf
Best Web Development Agency- Idiosys USA.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 

Low latency scalable web crawling on Apache Storm

  • 1. Low latency scalable web crawling on Apache Storm Julien Nioche julien@digitalpebble.com Berlin Buzzwords 01/06/2015 Storm Crawler digitalpebble
  • 2. 2 About myself  DigitalPebble Ltd, Bristol (UK)  Specialised in Text Engineering – Web Crawling – Natural Language Processing – Information Retrieval – Machine Learning  Strong focus on Open Source & Apache ecosystem – Nutch – Tika – GATE, UIMA – SOLR, Elasticearch – Behemoth
  • 3. 3  Collection of resources (SDK) for building web crawlers on Apache Storm  https://github.com/DigitalPebble/storm-crawler  Apache License v2  Artefacts available from Maven Central  Active and growing fast  Version 0.5 just released! Storm-Crawler : what is it? => Can scale => Can do low latency => Easy to use and extend => Nice features (politeness, scraping, sitemaps etc...)
  • 4. 4 What it is not  A well packaged application – It's a SDK : it requires some minimal programming and configuration  No global processing of the pages e.g.PageRank  No fancy UI, dashboards, etc... – Build your own or integrate in your existing ones – But can use Storm UI + metrics to various backends (Librato, ElasticSearch)
  • 6. 6 StormCrawler vs Nutch  Nutch is batch driven : little control on when URLs are fetched – Potential issue for use cases where need sessions – latency++  Fetching only one of the steps in Nutch – SC : 'always fetching' ; better use of resources  More flexible – Typical case : few custom classes (at least a Topology) the rest are just dependencies and standard SC components – Logical crawls : multiple crawlers with their own scheduling easier with SC via queues  Not ready-to use as Nutch : it's a SDK  Would not have existed without it – Borrowed some code and concepts – Contributed some stuff back
  • 8. 8 Apache Storm  Distributed real time computation system  http://storm.apache.org/  Scalable, fault-tolerant, polyglot and fun!  Implemented in Clojure + Java  “Realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more”
  • 9. 9 Topology = spouts + bolts + tuples + streams
  • 10. 10 What's in Storm Crawler? https://www.flickr.com/photos/dipster1/1403240351/
  • 11. 11 Resources: core vs external  Core – Fetcher(s) Bolts – JsoupParserBolt – SitemapParserBolt – ...  External – Metrics-related : inc. connector for Librato – ElasticSearch : Indexer, Spout, StatusUpdater, connector for metrics – Tika-based parser bolt  User-maintained external resources  Generic Storm resources (e.g. spouts)
  • 12. 12 FetcherBolt  Multi-threaded  Polite – Puts incoming tuples into internal queues based on IP/domain/hostname – Sets delay between requests from same queue – Internal fetch threads – Respects robots.txt  Protocol-neutral – Protocol implementations are pluggable – Default HTTP implementation based on Apache HttpClient  Also have a SimpleFetcherBolt – Need handle politeness elsewhere (spout?)
  • 13. 13 JSoupParserBolt  HTML only but have a Tika-based one in external  Extracts texts and outlinks  Calls URLFilters on outlinks – normalize and / or blacklists URLs  Calls ParseFilters on document – e.g. scrape info with XpathFilter – Enrich metadata content
  • 14. 14 ParseFilter  Extracts information from document – Called by *ParserBolt(s) – Configured via JSON file – Interface • filter(String URL, byte[] content, DocumentFragment doc, Metadata metadata) – com.digitalpebble.storm.crawler.parse.filter.XpathFilter.java • Xpath expressions • Info extracted stored in metadata • Used later for indexing
  • 15. 15 URLFilter  Control crawl expansion  Delete or rewrites URLs  Configured in JSON file  Interface – String filter(URL sourceUrl, Metadata sourceMetadata,String urlToFilter)  Basic  MaxDepth  Host  RegexURLFilter  RegexURLNormalizer
  • 16. 16 Basic Crawl Topology Fetcher Parser URLPartitioner Indexer Some Spout (1) pull URLs from an external source  (2) group per host / domain / IP (3) fetch URLs Default stream (4) parse content (3) fetch URLs (5) index text and metadata
  • 17. 17 Basic Topology : Tuple perspective Fetcher Parser URLPartitioner Indexer Some Spout (1) <String url, Metadata m> (2) <String url, String key, Metadata m> Default stream (4) <String url, byte[] content, Metadata m, String text> (3) <String url, byte[] content, Metadata m> (1) (2) (3) (4)
  • 18. 18 Basic scenario  Spout : simple queue e.g. RabbitMQ, AWS SQS, etc...  Indexer : any form of storage but often index e.g. SOLR, ElasticSearch  Components in grey : default ones from SC  Use case : non-recursive crawls (i.e. no links discovered), URLs known in advance. Failures don't matter too much.  “Great! What about recursive crawls and / or failure handling?”
  • 19. 19 Frontier expansion  Manual “discovery” – Adding new URLs by hand, “seeding”  Automatic discovery of new resources (frontier expansion) – Not all outlinks are equally useful - control – Requires content parsing and link extraction seed i = 1 i = 2 i = 3 [Slide courtesy of A. Bialecki]
  • 21. 21 Status stream  Used for handling errors and update info about URLs  Newly discovered URLs from parser public enum Status { DISCOVERED, FETCHED, FETCH_ERROR, REDIRECTION, ERROR;}  StatusUpdater : writes to storage  Can/should extend AbstractStatusUpdaterBolt
  • 22. 22 External : ElasticSearch  IndexerBolt – Indexes URL / metadata / text for search – extends AbstractIndexerBolt  StatusUpdaterBolt – extends AbstractStatusUpdaterBolt – URL / status / metadata / nextFetchDate in status index  ElasticSearchSpout – Reads from status index – Sends URL + Metadata tuples down topology  MetricsConsumer – indexes Storm metrics for displaying e.g. with Kibana
  • 24. 24 How to use it?  Write your own Topology class (or hack the example one)  Put resources in src/main/resources – URLFilters, ParseFilters etc...  Build uber-jar with Maven – mvn clean package  Set the configuration – External YAML file (e.g. crawler-conf.yaml)  Call Storm storm jar target/storm-crawler-core-0.5-SNAPSHOT-jar-with-dependencies.jar com.digitalpebble.storm.crawler.CrawlTopology -conf crawler-conf.yaml
  • 26. 26
  • 27. 27 Case Studies  'No follow' crawl #1: existing list of URLs only – one off – http://www.stolencamerafinder.com/ : [RabbitMQ + Elasticsearch]  'No follow' crawl #2: Streams of URLs – http://www.weborama.com : [queues? + HBase]  Monitoring of finite set of URLs / non recursive crawl – http://www.shopstyle.com : scraping + indexing [DynamoDB + AWS SQS] – http://www.ontopic.io : [Redis + Kafka + ElasticSearch] – http://www.careerbuilder.com : [RabbitMQ + Redis + ElasticSearch]  Full recursive crawler – http://www.shopstyle.com : discovery of product pages [DynamoDB]
  • 28. 28 What's next?  Just released 0.5 – Improved WIKI documentation but can always do better  All-in-one crawler based on ElasticSearch – Also a good example of how to use SC – Separate project – Most resources already available in external  Additional Parse/URLFilters – Basic URL Normalizer #120  Parse -> generate output for more than one document #117 – Change to the ParseFilter interface  Selenium-based protocol implementation – Handle AJAX pages
  • 29. 29 Resources  https://storm.apache.org/  https://github.com/DigitalPebble/storm-crawler/wiki  http://nutch.apache.org/  “Storm Applied” http://www.manning.com/sallen/
  • 31. 31