Work presented at ScalaByTheBay2016 conference.
Objective:
To implement a statistical "reactive" summeriser (COUNT, TOPK, SET MEMBERSHIP, CARDINALITY)
How:
1. Sketching data structure.
2. Akka-stream to make sure system is back-pressure compliant.
3. Akka-stream's stream-dynamic to make sure system is dynamic.
4. The whole system can be broken down to smaller pieces which we can wired up using akka-tcp.
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka
1. Implement a scalable statistical
aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore
2. The system
Provides service to answer time-series analytical questions such as
COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set
of data streams by using statistical approach.
3. Motivation
The system collects data from multiple sources in streaming log
format
Some common questions in Email Anti-Abuse system
Most frequent Items (IP, domain, sender, etc.)
Number of unique items
Have we seen an item before?
=> Need to be able to answer such questions in a timely manner
4. Data statistics
6K email logs/second
One email log is flatten out to subevents
Ip, sender, sender domain, etc
Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc)
Total ~200K messages/second
5. Challenges
Our system needs to be
Responsive
Space efficient
Reactive
Extensible
Scalable
Resilient
6. Sketching data structures
How many times have we seen a certain IP?
Count Min Sketch (CMS): Counting things + TopK
How many unique senders have we seen yesterday?
HyperLogLog (HLL): Set cardinality
Did we see a certain IP last month?
Bloom Filter (BF): Set membership
SPACE / SPEED
7. Implement data structure for
finding cardinality (i.e. counting
things); set membership; top-k
elements – solved by using
streamlib / twitter algebird
Implement a dynamic,
reactive, distributed system
for answering cardinality (i.e.
counting things); set
membership; top-k elements
What we try to solveWhat is available
16. Splitter Hub
Split the stream based on event type to a dynamic set of
downstream consumers.
Consumers are actors which implement CMS, BF, HLL, etc logic.
Not available in akka-stream.
17. Splitter Hub API
Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
[[SplitterHub]].source can be supplied with a predicate/selector function
to return a filtered subset of data.
selector
19. Splitter Hub
The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the
hub, and then receives items matching the selector function from the
upstream.
Consumer can be added at run time
20. Consumers
Can be either local or remote.
Managed by coordination actor.
Implements a specific data structure (CMS/BF/HLL) for a particular event
type from a specific time-range.
Responsibility:
Answer a specific query.
Persisting serialization of internal data structure such as count-min-table, etc.
regularly. COUNT-QUERY
forward
ref
snapshot
24. Akka stream TCP
Handled by Kernel (back-pressure, reliable).
For each worker, we create a source for each message type it is
responsible for using SplitterHub source() API.
Connect each source to a TCP connection and send to worker.
Backpressure is maintained across network.
~>
~>
26. Master Failover
The Coordinator is the Single Point of Failure.
Run multiple Coordinator Actors as Cluster Singleton .
Worker communicates to master (heartbeat) using Cluster Client.
27. Worker Failover
Worker persists all events to DB journal + snapshot.
Akka Persistent.
Redis for storing Journal + Snapshot.
When a worker is down, its keys are re-distributed.
Master then redirects traffic to other workers.
CMS Actors are restored on new worker from Snapshot + Journal.
28. Benchmark
Akka-stream on single node 100K+ msg/second (one msg-type)
Akka-stream on remote node
(remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node
(remote TCP) with akka persistent
journal
2000+ msg/second (one msg-type)
29. Conclusion
Our system is
Responsive
Reactive
Scalable
Resilient
Future works:
Make worker metric agnostics
Scale out master
Exactly one delivery for worker
More flexible filter using SplitterHub