4. Motivation
• Queues – Workers paradigm
• Scaling is hard
• System is not robust
• Coding is not fun!
– No abstraction
– Low level message passing
– Intermediate message brokers
5. Use cases
• Stream processing
– Consume stream, update db, etc
• Distributed RPC
– Intense function on top of storm
• Ongoing computation
– Computing music trends on Twitter
7. Elements
• Streams
– Set of tuples
– Unbounded sequence of data
• Spout
– Source of streams
• Bolts
– Application logic
– Functions
– Streaming aggregations, joins, DB ops
13. Trident
● Higher level of abstraction on top of Storm
● Batch processing
● Keeps state using your persistence store e.g.
DBs, Memcached, etc.
● Exactly – once semantics
● Tuples can be replayed!
● Similar API to Pig / Cascading
16. Trident State
● Solid API for reading / writing to stateful
sources
● State updates are idempotent
● Different kind of fault-tolerance depending
on the different Spout implementations
18. Trident Gender
1. Stream of incoming tweets
2. Filter out the non-relevant to topic
3. Check gender by checking first name
4. Update either male or female counter
19. Input (Spout impl.)
● Receives public stream (~1% of tweets) and emits them
into the system
List<Object> tweets;
public void emitBatch(long batchId,
TridentCollector collector) {
for (Object o : tweets)
collector.emit(new Values(o));
}
20. Filter
Implement a Filter class called FilterWords
.each(new Fields("status"), new FilterWords(interestingWords))
String[] words = {“instagram”, “flickr”, “pinterest”, “picasa”};
public boolean isKeep(TridentTuple tuple) {
Tweet t = (Tweet) tuple.getValue(0);
//is tweet an interesting one?
for (String word : words)
if (s.getText().toLowerCase().contains(word))
return true;
return false;
}
}
21. Function
Implement a function class
.each(new Fields("status"), new
ExpandName(), new Fields("name"))
Tuple before:
[{”fullname”: “Iris HappyWorker”,
“text”:”Having the freedom to choose your
work location feels great. This week is
London. pic.twitter.com/BHZq86o6“}]
22.
23. Function
Implement a function class
.each(new Fields("status"), new ExpandName(), new
Fields("name"))
Tuple before:
[{”fullname”: “Iris HappyWorker”, “text”:”Having the
freedom to choose your work location feels great. This week
is London. pic.twitter.com/BHZq86o6“}]
Tuple after:
[{”fullname”: “Iris HappyWorker”, “text”:”Having the
freedom to choose your work location feels great. This week
is London. pic.twitter.com/BHZq86o6“},
“Iris”]
24. State Query
Implement a QueryFunction to query the persistence storage.
.stateQuery(genderDB, new Fields("name"), new
QueryGender(), new Fields("gender"))
public List<String> batchRetrieve(GenderDB state,
List<TridentTuple> tuples) {
List<String> batchToQuery = new ArrayList<String>();
for (TridentTuple t : tuples){
String name = t.getStringByField("name");
batchToQuery.add(name);
}
return state.getGenders(batchToQuery);
}
31. Aggregators (general case)
● Run the init() function before processing the batch
● Aggregate through a number of tuples (usually “grouped-by” before) and emit one
or more results based on the aggregate method.
public interface Aggregator<T> extends Operation {
T init(Object batchId, TridentCollector collector);
void aggregate(T state, TridentTuple tuple,
TridentCollector collector);
void complete(T state, TridentCollector collector);
}
32. Combiner Aggregator
● Run init(TridentTuple t) on every tuple
● Run combine method to tuple values until no tuples are left, then return single value.
public class Count implements CombinerAggregator<Long> {
public Long init(TridentTuple tuple) {
return 1L;
}
public Long combine(Long val1, Long val2) {
return val1 + val2;
}
public Long zero() {
return 0L;
}
}
33. Reducer Aggregator
● Run init() to get an initial value
● Iterate over the value to emit a single result
public interface ReducerAggregator<T>
extends Serializable {
T init();
T reduce(T curr, TridentTuple tuple);
}
34. Back to the example
● For each gender batch run Count()
aggregator
● Not only aggregate, but also store the
value to memory
● Why?
● “Over time count”
35. Back to the example
● For each gender batch run Count() aggregator
● Not only aggregate, but also store the value to memory
● Why?
● “Over time count”
persistentAggregate(new
MemoryMapState.Factory(), new Count(), new
Fields("count"))
36. Putting it all together
TridentState genderDB = topology.newStaticState(new
GenderDBFactory());
Stream gender = topology.newStream("spout", spout)
.each(new Fields("status"), new Filter(topicWords))
.each(new Fields("status"), new ExpandName(), new
Fields("name"))
.parallelismHint(4)
.stateQuery(genderDB, new Fields("name"), new QueryGender(),
new Fields("gender"))
.parallelismHint(10)
.groupBy(new Fields("gender"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(),
new Fields("count"))
.newValuesStream();
38. Some minus
• Hard debugging
➢
pseudo-distributed mode but still..
• Object serialization
➢
When using 3rd
party libraries
➢
Register your own serializers for better
performance e.g. Kryo
39. I didn’t tackle
• Reliability
–Guaranteed message processing
• Distributed RPC example
• Storm-deploy companion
–One-click storm cluster automated
deploy i.e. EC2