3. “Next click” problem
Raymie Strata (CTO,Yahoo):
“With the paths that go through Hadoop [at Yahoo!], the
latency is about fifteen minutes. … [I]t will never be true
real-time. It will never be what we call “next click,” where
I click and by the time the page loads, the semantic
implication of my decision is reflected in the page.”
4. “Next click” problem
(next)
HTTP HTTP HTTP HTTP
Request Response Request Response
max latency max latency
80 ms 80 ms
web server
realtime near realtime
response response
real time layer
collect data process data
time
5. Example problems
• Realtime statistics - counting, trends, moving average
• Read Twitter stream and output images that are
trending in the last 10 minutes
• CTR calculation - read ad clicks/ad impressions and
calculate new click through rate
• ETL - transform format, filter duplicates / bot traffic,
enrich from static data, persist
• Search advertising
6. Pick your framework...
• S4 - Yahoo, “real time map reduce”, actor model
• Storm - Twitter
• MapReduce Online - Yahoo
• Cloud Map Reduce - Accenture
• HStreaming - Startup, based on Hadoop
• Brisk - DataStax, Cassandra
7. System requirements
• Fault tolerance - system keeps running when a node
fails
• Horizontal scalability - should be easy, just add a
node
• Low latency
• Reliable - does not loose data
• High availability - well, if it’s down for an hour its not
realtime
8. Storm in a nutshell
• Written by Backtype (aquired by Twitter)
• Open Source, Github
• Runs on JVM
• Clojure, Python, Zookeeper, ZeroMQ
• Currently used by Twitter for real time statistics
9. Programming model
• Tuple - name/value list
• Stream - unbounded sequence of Tuples
• Spout - source of Streams
• Bolt - consumer / producer of Streams
• Topology - network of Streams, Spouts and Bolts
14. Task
Parallel processor inside Spouts and Bolts.
Each Spout / Bolt has a fixed number of Tasks.
Spout Bolt
Task Task
Task Task
Task
15. Stream grouping
Which Task does a Tuple go to?
• shuffle grouping - distribute randomly
• field grouping - partition by field value
• all grouping - send to all Tasks
• custom grouping - implement your own logic
16. Word count example
Sentence Word (“a”, 2)
Splitter Count (“b”, 2)
Spout
Bolt Bolt (“c”, 1)
(“a”) (“d”, 1)
(“b”)
(“a b c a b d”) (“c”)
(“a”)
(“b”)
(“d”)
17. Guaranteed processing
(“a”)
(“b”)
(“a”, 2)
(“c”)
(“b”, 2)
Spout (“a b c a b d”)
(“c”, 1)
(“a”)
(“d”, 1)
(“b”)
(“d”)
Topology has a timeout for processing of the tuple tree
19. Reliability
• Nimbus / Supervisor are SPOF
• both are stateless, easy to restart without data loss
• Failure of master node (?)
• Running Topologies should not be affected!
• Failed Workers are restarted
• Guaranteed message processing
20. Administration
• Nimbus / Supervisor / Zookeeper need monitoring
and supervisor (e.g. Monit)
• Cluster nodes can be added at runtime
• But: existing Topologies are not rebalanced (there is a
ticket)
• Administration web GUI
21. Community
• Source is on Github - https://github.com/
nathanmarz/storm.git
• Wiki - https://github.com/nathanmarz/storm/wiki
• Nice documentation
• Google Group
• People start to build add-ons: JRuby integration,
adapters for JMS, AMQP
22. Storm summary
• Nice programming model
• Easy to deploy new topologies
• Horizontal scalability
• Low latency
• Fault tolerance
• Easy to setup on EC2