14. Can we use declarative languages to specify these
stream processing logics?
15. Complex event processing
● Combines data from multiple sources to infer events or patterns that suggest
more complicated circumstances
● CEP is used across many industries for various use cases, including:
○ Finance: Trade analysis, fraud detection
○ Airlines: Operations monitoring
○ Healthcare: Claims processing, patient monitoring
○ Energy and Telecommunications: Outage detection
● CEP uses declarative rule/query language to specify event processing logic
16. WSO2/Siddhi: Complex event processing engine
● Lightweight, extensible, open source, released as a Java library
● Features supported
○ Filter
○ Join
○ Aggregation
○ Group by
○ Window
○ Pattern processing
○ Sequence processing
○ Event tables
○ Event-time processing
○ UDF
○ Extensions
○ Declarative query language: SiddhiQL
18. How Siddhi works
● Query is parsed at runtime into an execution plan runtime
● As events flow in, the execution plan runtime process events inside the CEP
engine according the query logic
20. Apache Samza
● A distributed stream processing framework
○ Distributed and Scalable
○ Built-in State management
○ Built-in fault tolerant
○ At-least-once message processing
○ Infrastructure support at Uber
21. How can we make the stream processing output
useful?
22. Actions
● Generalize a set of common action templates to make it easy for
micro-services and human to harness the power of realtime stream
processing
● Currently we support
○ Make an RPC call
○ Invoke a Webhook endpoint
○ Index to ElasticSearch
○ Index to Cassandra
○ Kafka
○ Statsd
○ Chat service
○ Email
○ Push notification
27. Partitioner
● Re-shuffle events based on key
● Support predicate pushdown through query analysis
● Support column pruning through query analysis (WIP)
28. Query processor
● Parse Siddhi queries into execution plan runtime
● Process events in Siddhi execution plan runtime
● Checkpoint state regularly to ensure recovery upon crash/restart using
RocksDB
29. Action processor
● Execute actions upon the query processing output
● Support various kinds of actions for easy integration
● Implement action retry mechanism using RocksDB to provide at-least-once
delivery
30. How do we translate a query into psychical plan that
runs?
31. DAG (Directed Acyclic Graph) generation
● Analyze Siddhi query to automatically generate the stream processing DAG in
Samza using the processors
Filter, transformation
35. REST API backend
● All queries, actions are stored externally in database.
● RESTFUL API for CRUD operations
● If query/action logic changed
○ Redeploy the Samza DAG if needed
○ Otherwise, the updated queries/actions will be loaded at runtime w/o interruption
36. Unified management and monitoring
● Every use case
○ share the same set of processors
○ Use queries and actions to describe its processing logic
● A single monitoring template can be reused across different use cases
37. Production status
● In production for >1.5 years
● 120+ production use cases
● 30+ billion messages processed per day
40. Out-of-order event handling
● Not a big concern
○ Events of the same rider/partner are usually seconds aparts
● K-slack extension in Siddhi for out-of-order event processing
41. Auto-scaling
● Manually re-partition kafka topics to increase parallelism
● Manually tune container memory if needed
● Future
○ Use CPU/memory/IO stats to automate the process
43. Large checkpointing state
● Samza use Kafka to log state changes
● Siddhi engine snapshot can be large
● Kafka message size limit to 1MB by default
● Solution: we build logics to slice state into smaller pieces and checkpoint
them.
44. Synchronous checkpointing
● Samza checkpointing is synchronous with message processing
● If state is large, time to checkpoint can be long, might cause processing lag
● Incremental state checkpointing
45. Exactly once state processing?
● Can not commit state and offset atomically
● No exactly once state processing
46. Custom business logic
● Common logic implemented as Siddhi extensions
● Ad-hoc logic implemented as UDF in javascript or scalascript inline with the
query
47. Intermediate Kafka messages
● Samza uses Kafka as message queue for intermediate processing output
○ Each stage is independent of each other
○ This can create large load on Kafka if a heave topic is re-shuffled multiple times
■ Encode the intermediate messages to reduce footprint
48. Upgrading Samza jobs
● Upgrade Samza jobs require a full restart, and can take minutes due to
○ Offset checkpointing topic too large → set retention to hours or enable compaction
○ Changelog topic too large → set retention or enable compaction in Kafka or host affinity
● To minimize the interruption during upgrade, it would be nice to have
○ Rolling restart
○ Per container restart
49. Our solution: non-interrupted handoff
● For critical jobs, we use replication during upgrade
○ Start a shadow job
○ Upgrade shadow
○ Switch primary and shadow
○ Upgrade primary
○ Switch back
● Downside: require 2x capacity during upgrade