Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData

Complex Event Processing platform handling
millions of users
Krzysztof Zarzycki - CTO @ Getindata

© Copyright. All rights reserved. Not to be reproduced without prior written consent.
About us
Founded in 2014 by
ex-Spotify engineers.
Focus only on Big Data and
Cloud (from day 1)
Community builders (Big
Data Tech Warsaw
organizers)
50+ Big Data engineers

What is it?
The application logic, analytics, and queries exist continuously, and data ﬂows through them continuously.
Stream Processing

Why is it important for business?
Stream Processing

Actionable insights
Stream Processing

Stream Processing
Why is it important for engineering?

It’s NOT only about real-time
It’s just natural - data comes continuously.
Stream Processing

Stream Processing
User sessions spanning minutes, hours, or days
Batch boundaries are often artiﬁcial.
♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪ ♪
[9:00 - 10:00) [10:00-11:00)

Complex Event Processing
● Analyze patterns,relations, cause-and-effect
○ If A & B then C
● Infer business-relevant events from raw technical stream
● .. and cascade extraction of even higher-level events
● Alerts, triggers, workﬂow automation

Complex Event Processing
● behavioral marketing
● product analytics
● business activity monitoring
● technical monitoring and anomaly detection
● IoT
● fraud detection

ESP vs CEP
● Difference is blurry and diminishing
● Traditional CEP
○ Complex proc, low latency, single-machine
○ high-level language like SQL
● Traditional ESP
○ straightforward, high-throughput, distributed
○ Broader, more generic and low-level
● NOW: Best of both!
○ Often called “Streaming Analytics”

The need
● Streaming model and real-time
● On par with batch
○ Enrichment, Joins
○ Aggregation
○ Reprocessing of historical data
○ Machine Learning scoring, inference
○ Complex Event Processing
● large scale, high-throughput
● correctness and fault tolerance

The solution
Apache Flink
open-source stateful processor over massive data streams

Who uses Flink

We use Flink!
Banks Telcos Automotive Adtech
Commiters to Flink

Late & Out-of-sequence
events
Breaks correctness.
Often handled with very tedious user code.
Or solved in batch by “waiting enough” and “processing twice”.

Late & Out-of-sequence
events
Handled by Flink in the framework
Based on watermarks heuristics, that marks the progress of event time in the stream.
Asserts that all earlier events have probably arrived.

Local State
Operational state obligatory for analytics
Used for accumulators, windows, source offsets, tracking patterns, ...
6
sum
1
3
2
4
1
1

Local State
Persistent operational state local to computation
Maximize performance with millions of updates per second & core.
Enable out-of-core (more than RAM) processing, with RocksDb
State
Task 1
Logic State
Task N
Logic

Fault tolerance
(Checkpointing)
State survives abrupt crashes or just maintenance
Checkpointed regularly to resilient external storage.
Accurate - keeps stream offsets, accumulators or windows in perfect sync, consistency.
Efﬁcient - almost no impact on the processing.

Flink Cluster

APIs
Java/Scala
high-level Table API
mid-level dataﬂow
low-level advanced for tricky cases
Developer Data Scientist Analyst
SQL
Incl. analytical functions
MATCH_RECOGNIZE
and UDF extensibility
Python
Based on Table API

Big picture

Assisting Millions of User in Real-Time
Kcell case

About Kcell
Kcell has a strong software
development team and lots
of experience in building
services and products
We like innovations
> 10 000 000 subscribers
Largest GSM operator
in Kazakhstan
4G (40%), 3G (73%), 2G
(96%) population
Great network
coverage
There is the ongoing
process of company digital
transformation
Not only telco

Business needs
Assisting Millions of Users in Real-Time
SMS events
Voice usage events
Data usage events
Roaming events
Location events
Input Process Actions

Use Cases
Use case scenarios. Just few of many.
Case
If subscriber top-ups her balance too often in
short period of time. We can offer her a less
expensive tariff or auto-payment services.
Balance Top Up Case
Trigger UI

Roaming
Fraud
Trigger to Marketing Platform if subscriber
visited X country OR/AND registered in Y
visited mobile network and his device's type
is Z
Roaming case
Send an email to the anti-fraud unit if
subscriber registered in roaming but his
balance at the moment is equal to 0.
This situation is impossible in standard case.
Fraud case in roaming

Personalized Notiﬁcations
Business Automation
Regulatory

Future Work
We have already done a lot. But more great things are coming.
2020 Q3 2020 Q4 2021 Q1 Bright Future
More Data Sources
More Triggers
Geolocation data
Equipment logs
Commoditize Machine Learning
Extract value from ML company-wide!
Enable easy ML training and productization
of models in real-time
Real-time BI
Intraday view on business and
operations
Monetize valuable insights from
our combined rich data sources.
Data Monetization
Predictive maintenance
Network Optimization
To lower operational costs
And make better investments
And many more...
Create behavioral proﬁle of the
customers for better
personalised serving
Customer 360 view

Old System
Why did we start to look for the new solution?
External Vendor
Solution
Blackbox Solution
Scalability issues
Not reliable
1
2
3
Kcell Developers can’t ﬁx, tweak or optimize it
Limited to ~2000 events / sec
Can’t support all needed data sources
Multiple accidents which took too much time to resolve

Scale
Required system throughput
500K
Events / second
10M
Subscribers
40
TB / month

New Solution
Real-time Stream Processing
ingestion outgestion
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ

New Solution
flink
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
flink flink

New Solution (Operations)
Web UI, Monitoring, Security
flink
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
Admin UI
(Triggers workbench)
Monitoring
Loki - logs
Prometheus/Grafana -
metrics
Security
FreeIPA
Kerberos
LDAP/AD
API (kafka based)
flink flink

New Solution (Data Lake)
Data Lake and Sub-second OLAP Analytics
flink
events
hub
events
processing
HTTP
push/pull
FTP
NFS
MQ
HTTP
push/pull
FTP
MQ
Data Lake
Historical Storage (HDFS)
Batch (Spark) SQL (Hive)
Keep history, Report, Explore
Column-oriented
Data store
OLAP (Druid)
Interactive BI
flink flink

Processing Flow
raw call events
data usage events
transform
transformed events
transform
transformed events
local state
RocksDB
control topic
Admin UI
HTTP
calls
notiﬁcation
events
outgestion
ingestion
ingestion
submit/stop
triggers

Dynamic Rules
Design
Some treats for Squirrels

Dynamic Rules Design
Key Points
● We want to run 100s of triggers/business rules
● A typical approach: job per rule
● Won’t work in our case:
○ Run 100s of topologies/jobs = multiplied resources cost
○ Pull data from Kafka 100s of times
○ State (user features) replicated 100s times
○ Starting rule requires deployment of the job

Key Points
● Our approach: One job to run all triggers/rules
○ And to consume all the sources
● Trigger “templates” still coded with java
● adding/removing rules without restarting application
● 100s of rules running efficiently

The Overview
billing events
roaming
Sort by time
control topic
notiﬁcation
events
Deduplicate Router
Late events
Trigger 1
Trigger 2
State
Updater
Apply Triggers (CoProcess Function)
Keyed by User

Pros and Cons
Shared resources and costs
● CPU, RAM, state, shufﬂe
● Pulling data from Kafka
One bad rule affects whole system
● Watermarks are shared
● Failures are shared
No job restart on start of new rules
● Rules started by business, no IT
involved
Still need to code rule template in
the job
● No way to use SQL, Table API, CEP
Sharing of state
● Build customer features, that can
be seen by all rules
Can be tricky to debug
● Code is shared
● Code paths enabled externally

Issue: lagging sources slow down all rules
Source A:
highly unordered, late
Source B:
Ordered, low latency
Late notifications
Low latency
notifications
Triggers
Triggers
Group 1
Triggers
Group 2
Source A:
highly unordered, late
Source B:
Ordered, low latency
Triggers
Group
Late notifications
Problem Solution

Flink Changes Wishlist
What could be even better?
attach new branch to existing topology
that receives the same data
Dynamic Topologies
Cheaper topologies
● Graph of topologies that pass
data locally in Flink
● Other words: Local
Proxying/fan-out of Kafka trafﬁc
Share inputs between topologies
Dynamic SQL
SQL
{ }

Decisions made
Some decisions our team made before or during project implementation
Streaming-ﬁrst approach
Apache Kafka for event hub
Apache Flink
Powerful Real-Time Analytics

Apache Avro
Keep state local to the process
Ingest reference data for local joins and
enrichment
● No need to query external systems
while processing
● Data time correlation correctness
Performance
transformed
events
transformed
events
Subscriber proﬁle data
(events)
Local State
Not at >100K
events / sec

Niﬁ for data ingestion (no coding)
● but not for CEP
Web UI for conﬁguring triggers
Ease of Use

Flink on YARN, with HDFS
HA for redundancy and running ~24/7
Prometheus & Grafana for monitoring &
alerting
Loki for logs collection and aggregation
Reliability and battle-tested techniques
Kerberos and AD thanks to FreeIPA
Apache Ranger for authorization
Security

One platform for the whole Enterprise
Batch (adhoc) queries too
● Spark, Hive/Presto
Online analytics
● OLAP
Extensiveness
HDP
Open-source technologies
HDP as a licence-free distribution
Just start with a bunch of servers
Cost-Efﬁciency

Testing
def "should notify when user's balance drops below threshold"() {
given:
BalanceDropTrigger trigger =balanceDropTrigger()
.threshold(
50.0)
.outgestionSystem(
'campaignSystem')
.build()
admin.createsTrigger(trigger)
and:
user.withBalance(
60.0)
when:
user.makesCall(
phoneCall().amountSpent(
20.0))
then:
wait(allowedEventLateness)
and:
List<Notification> actualNotifications =
campaignSystem.getNotifications(
user, trigger)
and:
actualNotifications.size() ==1
assertThat(actualNotifications.first())
.hasMsisdn(
user.msisdn)
.hasBalanceAfter(
40.0)
cleanup:
admin.deletesTrigger(trigger)
}
ﬂink
events
hub
events
processing
Fake
Campaign
System
HTTP
push/pull
FTP
NFS
MQ
Test event generators
Preprod environment

Our Collaboration
Two heads are better than one
Joint development team
Not a vendor solution
Development as one team
Code quality
Code review and
automated tools for
code quality control
Agile Practices
Distant geographic
locations, but
everyday standups
Go live quickly!
<4 months to ﬁrst
production case
running 24/7!
Deliver
DevOps/Automation
Knowledge sharing
Constant knowledge
exchange in areas of
expertise
Testing
Separate testing
environment
Automated Unit/E2E tests

Q&A

Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData

Ähnlich wie Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData (20)

Mehr von GetInData

Mehr von GetInData (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Complex event processing platform handling millions of users - Krzysztof Zarzycki, GetInData