[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Implement a scalable statistical
aggregation system using Akka
Scala by the Bay, 12 Nov 2016
Stanley Nguyen, Vu Ho
Email Security@Symantec Singapore

The system
Provides service to answer time-series analytical questions such as
COUNT, TOPK, SET MEMBERSHIP, CARDINALITY on a dynamic set
of data streams by using statistical approach.

Motivation
 The system collects data from multiple sources in streaming log
format
 Some common questions in Email Anti-Abuse system
 Most frequent Items (IP, domain, sender, etc.)
 Number of unique items
 Have we seen an item before?
=> Need to be able to answer such questions in a timely manner

Data statistics
 6K email logs/second
 One email log is flatten out to subevents
 Ip, sender, sender domain, etc
 Time period (last 5 minutes, 1 hour, 4 hours, 1 day, 1 week, etc)
Total ~200K messages/second

Challenges
 Our system needs to be
 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient

Sketching data structures
 How many times have we seen a certain IP?
 Count Min Sketch (CMS): Counting things + TopK
 How many unique senders have we seen yesterday?
 HyperLogLog (HLL): Set cardinality
 Did we see a certain IP last month?
 Bloom Filter (BF): Set membership
SPACE / SPEED

 Implement data structure for
finding cardinality (i.e. counting
things); set membership; top-k
elements – solved by using
streamlib / twitter algebird
 Implement a dynamic,
reactive, distributed system
for answering cardinality (i.e.
counting things); set
membership; top-k elements
What we try to solveWhat is available

 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient

Akka Stream
GraphDSL
FLOW-SHAPE NODE

Using GraphDSL
(msg-type, @timestamp, key, value)

Merge Hub
 Provided by Akka Stream:
Allow dynamic set of TCP producers

Splitter Hub
 Split the stream based on event type to a dynamic set of
downstream consumers.
 Consumers are actors which implement CMS, BF, HLL, etc logic.
 Not available in akka-stream.

Splitter Hub API
 Similar to built-in akka stream’s BroadcastHub; different in back-
pressure implementation.
 [[SplitterHub]].source can be supplied with a predicate/selector function
to return a filtered subset of data.
selector

Splitter Hub’s Implementation

Splitter Hub
 The [[Source]] can be materialized any number of times — each
materialization creates a new consumer which can be registered with the
hub, and then receives items matching the selector function from the
upstream.
Consumer can be added at run time

Consumers
 Can be either local or remote.
 Managed by coordination actor.
 Implements a specific data structure (CMS/BF/HLL) for a particular event
type from a specific time-range.
 Responsibility:
 Answer a specific query.
 Persisting serialization of internal data structure such as count-min-table, etc.
regularly. COUNT-QUERY
forward
ref
snapshot

 Responsive
 Space efficient
 Reactive
 Extensible
 Scalable
 Resilient

Scaling out
 If data does not fit in one machine.
 Server crashes.
 How to maintain back pressure end-to-end.

Akka stream TCP
 Handled by Kernel (back-pressure, reliable).
 For each worker, we create a source for each message type it is
responsible for using SplitterHub source() API.
 Connect each source to a TCP connection and send to worker.
 Backpressure is maintained across network.
~>
~>

Master Failover
 The Coordinator is the Single Point of Failure.
 Run multiple Coordinator Actors as Cluster Singleton .
 Worker communicates to master (heartbeat) using Cluster Client.

Worker Failover
 Worker persists all events to DB journal + snapshot.
 Akka Persistent.
 Redis for storing Journal + Snapshot.
 When a worker is down, its keys are re-distributed.
 Master then redirects traffic to other workers.
 CMS Actors are restored on new worker from Snapshot + Journal.

Benchmark
Akka-stream on single node 100K+ msg/second (one msg-type)
Akka-stream on remote node
(remote TCP)
15-20K msg/second (one msg-type)
Akka-stream on remote node
(remote TCP) with akka persistent
journal
2000+ msg/second (one msg-type)

Conclusion
 Our system is
 Responsive
 Reactive
 Scalable
 Resilient
 Future works:
 Make worker metric agnostics
 Scale out master
 Exactly one delivery for worker
 More flexible filter using SplitterHub

[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (16)

Ähnlich wie [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka

Ähnlich wie [ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

[ScalaByTheBay2016] Implement a scalable statistical aggregation system using Akka