Bigdata roundtable-storm

Storm - pipes and
ﬁlters on steroids

Andre Sprenger

BigData Roundtable
Hamburg 30. Nov 2011

My background
• info@andresprenger.de

• Studied Computer Science and Economics

• Background: banking, ecommerce, online advertising

• Freelancer

• Java, Scala, Ruby, Rails

• Hadoop, Pig, Hive, Cassandra

“Next click” problem
Raymie Strata (CTO,Yahoo):

“With the paths that go through Hadoop [at Yahoo!], the
latency is about ﬁfteen minutes. … [I]t will never be true
real-time. It will never be what we call “next click,” where
I click and by the time the page loads, the semantic
implication of my decision is reﬂected in the page.”

“Next click” problem
(next)
HTTP HTTP HTTP HTTP
Request Response Request Response

max latency max latency
80 ms 80 ms

web server
realtime near realtime
response response

real time layer

collect data process data

time

Example problems
• Realtime statistics - counting, trends, moving average

• Read Twitter stream and output images that are
trending in the last 10 minutes

• CTR calculation - read ad clicks/ad impressions and
calculate new click through rate

• ETL - transform format, ﬁlter duplicates / bot trafﬁc,
enrich from static data, persist

• Search advertising

Pick your framework...
• S4 - Yahoo, “real time map reduce”, actor model

• Storm - Twitter

• MapReduce Online - Yahoo

• Cloud Map Reduce - Accenture

• HStreaming - Startup, based on Hadoop

• Brisk - DataStax, Cassandra

System requirements
• Fault tolerance - system keeps running when a node
fails

• Horizontal scalability - should be easy, just add a
node

• Low latency

• Reliable - does not loose data

• High availability - well, if it’s down for an hour its not
realtime

Storm in a nutshell
• Written by Backtype (aquired by Twitter)

• Open Source, Github

• Runs on JVM

• Clojure, Python, Zookeeper, ZeroMQ

• Currently used by Twitter for real time statistics

Programming model
• Tuple - name/value list

• Stream - unbounded sequence of Tuples

• Spout - source of Streams

• Bolt - consumer / producer of Streams

• Topology - network of Streams, Spouts and Bolts

Spout
tuple tuple tuple tuple

Spout

Bolt
Processes streams and generates new streams.


Bolt

Bolt
• ﬁltering

• transformation

• split / aggregate streams

• counting, statistics

• read from / write to database

Topology
Network of Streams, Spouts and Bolts

Bolt Bolt
Spout

Bolt

Spout Bolt

Bolt

Task
Parallel processor inside Spouts and Bolts.
Each Spout / Bolt has a ﬁxed number of Tasks.

Spout Bolt

Task Task

Task Task

Task

Stream grouping
Which Task does a Tuple go to?

• shuffle grouping - distribute randomly

• field grouping - partition by field value

• all grouping - send to all Tasks

• custom grouping - implement your own logic

Word count example

Sentence Word (“a”, 2)
Splitter Count (“b”, 2)
Spout
Bolt Bolt (“c”, 1)
(“a”) (“d”, 1)
(“b”)
(“a b c a b d”) (“c”)
(“a”)
(“b”)
(“d”)

Guaranteed processing
(“a”)

(“b”)
(“a”, 2)
(“c”)
(“b”, 2)
Spout (“a b c a b d”)
(“c”, 1)
(“a”)
(“d”, 1)
(“b”)

(“d”)

Topology has a timeout for processing of the tuple tree

Reliability
• Nimbus / Supervisor are SPOF

• both are stateless, easy to restart without data loss

• Failure of master node (?)

• Running Topologies should not be affected!

• Failed Workers are restarted

• Guaranteed message processing

Administration

• Nimbus / Supervisor / Zookeeper need monitoring
and supervisor (e.g. Monit)

• Cluster nodes can be added at runtime

• But: existing Topologies are not rebalanced (there is a
ticket)

• Administration web GUI

Community
• Source is on Github - https://github.com/
nathanmarz/storm.git

• Wiki - https://github.com/nathanmarz/storm/wiki

• Nice documentation

• Google Group

• People start to build add-ons: JRuby integration,
adapters for JMS, AMQP

Storm summary
• Nice programming model

• Easy to deploy new topologies

• Horizontal scalability

• Low latency

• Fault tolerance

• Easy to setup on EC2

Bigdata roundtable-storm

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (8)

Ähnlich wie Bigdata roundtable-storm

Ähnlich wie Bigdata roundtable-storm (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bigdata roundtable-storm