Introduction to Storm Crawler [ https://github.com/DigitalPebble/storm-crawler], a collection of resources for building low-latency, large scale web crawlers on Storm available under Apache License.
Fast Feather talk given at ApacheCon EU 2014 Budapest
Chinsurah Escorts âïž8617697112 Starting From 5K to 15K High Profile Escorts ...
Â
A quick introduction to Storm Crawler
1. A quick introduction to
Storm Crawler
Julien Nioche
julien@digitalpebble.com
@digitalpebble
ApacheCon EU 2014 - Budapest
2. 2 / 15
About myself
ï§ DigitalPebble Ltd, Bristol (UK)
ï§ Specialised in Text Engineering
â Web Crawling
â Natural Language Processing
â Information Retrieval
â Machine Learning
ï§ Strong focus on Open Source & Apache ecosystem
ï§ PMC Chair Apache Nutch
ï§ User | Contributor | Committer
â Tika
â SOLR, Lucene
â GATE, UIMA
â Mahout
â Behemoth
3. What is it?
ï§ Collection of resources (SDK) for building web crawlers on
Apache Storm
ï§ https://github.com/DigitalPebble/storm-crawler
ï§ Artefacts available from Maven Central
ï§ Apache License v2
3 / 15
ï§ Scalable
ï§ Low latency
ï§ Easily extensible
4. What it is not
ï§ A ready-to-use, feature-complete, recursive web crawler
â Might be something like that as a separate project using S/C later
4 / 15
ï§ e.g. no PageRank or explicit ranking of pages
â Build your own
ï§ No fancy UI, dashboards, etc...
â Build your own
5. Comparison with Nutch
ï§ Nutch is batch driven : little control on when URLs are
fetched
5 / 15
â Potential issue for use cases where need sessions
â latency++
ï§ Fetching only one of the steps in Nutch
â SC : 'always be fetching' (Ken Krugler); better use of resources
ï§ Make it even more flexible
â Typical case : few custom classes (at least a Topology) the rest are just
dependencies and standard S/C components
ï§ Not ready-to use as Nutch : it's a SDK
ï§ Would not have existed without it
â Borrowed code and concepts
6. 6 / 15
Overview of resources
https://www.flickr.com/photos/dipster1/1403240351/
7. 7 / 15
FetcherBolt
ï§ Multi-threaded
ï§ Polite
â Puts incoming tuples into internal queues based on IP/domain/hostname
â Sets delay between requests from same queue
â Respects robots.txt
ï§ Protocol-neutral
â Protocol implementations are pluggable
â HTTP implementation taken from Nutch
ï§ Output
â String URL
â byte[] content
â HashMap<String, String[]> metadata
8. 8 / 15
ParserBolt
ï§ Based on Apache Tika
ï§ Supports most commonly used doc formats
â HTML, PDF, DOC etc...
ï§ Calls ParseFilters on document
â e.g. scrape info with XPathFilter
ï§ Calls URLFilters on outlinks
â e.g normalize and / or blacklists URLs based on RegExps
ï§ Output
â String URL
â byte[] content
â HashMap<String, String[]> metadata
â String text
â Set<String> outlinks
9. 9 / 15
Other resources
ï§ ElasticSearchBolt
â Sends fields to ElasticSearch for indexing
â (deprecated by resources in elasticsearch-hadoop?)
ï§ URLPartitionerBolt
â Generates a key based on the hostname / domain / IP of URL
â Output :
âą String URL
âą String key
âą String metadata
â Useful for fieldGrouping
10. 10 / 15
Other resources
ï§ ConfigurableTopology
â Overrides config with local YAML file
â Simple switch for running in local mode
â Abstract class to be extended
ï§ Simple Spouts (for testing)
â FileSpout / RandomURLSpout
ï§ Various Metrics-related stuff
â Including a MetricsConsumer for https://www.librato.com/
ï§ FetchQueue package
â BlockingURLSpout and ShardedQueue abstraction
11. 11 / 15
Integrate it!
ï§ Write your the Spout for your usecase
â Will work fine existing resources as long as it generates URL, metadata
ï§ Typical scenario
â Group URLs to fetch into separate external queues based on host or
domain (AWS SQS, Apache Kafka)
â Write Spout for it and throttle with topology.max.spout.pending
â So that can enforce politeness without getting timeout on Tuples â fail
â Parse and extract
â Send new URLs to queues
ï§ Can use various forms of persistence for URLs
â ElasticSearch, DynamoDB, Hbase, etc...
12. 12 / 15
Some use cases (prototype stage)
ï§ Processing of streams of data (natural fit for Storm)
â http://www.weborama.com
ï§ Monitoring of finite set of URLs
â http://www.ontopic.io (more on them later)
â http://www.shopstyle.com : scraping + indexing
ï§ One-off non-recursive crawling
â http://www.stolencamerafinder.com/ : scraping + indexing
ï§ Recursive crawler
â WIP
13. 13 / 15
What's next?
ï§ All-in-one crawler project built on SC
â Also a good example of how to use SC
ï§ Additional Parse/URLFilters
ï§ More tests and documentation
ï§ A nice logo (this is an invitation)
ï§ A better name?