Kcell, a large telecommunications company in Kazakhstan, implemented a complex event processing platform using Apache Flink to enable real-time processing of millions of user events per second. The system handles SMS, call, data usage, location and other events to trigger personalized notifications, detect fraud and optimize business operations and analytics. Key to the solution was using a single Flink job to run hundreds of dynamic rules via control topics to update triggers, while keeping state local for high performance.
25. About Kcell
Kcell has a strong software
development team and lots
of experience in building
services and products
We like innovations
> 10 000 000 subscribers
Largest GSM operator
in Kazakhstan
4G (40%), 3G (73%), 2G
(96%) population
Great network
coverage
There is the ongoing
process of company digital
transformation
Not only telco
26. Business needs
Assisting Millions of Users in Real-Time
SMS events
Voice usage events
Data usage events
Roaming events
Location events
Input Process Actions
27. Use Cases
Use case scenarios. Just few of many.
Case
If subscriber top-ups her balance too often in
short period of time. We can offer her a less
expensive tariff or auto-payment services.
Balance Top Up Case
Trigger UI
28. Roaming
Fraud
Trigger to Marketing Platform if subscriber
visited X country OR/AND registered in Y
visited mobile network and his device's type
is Z
Roaming case
Send an email to the anti-fraud unit if
subscriber registered in roaming but his
balance at the moment is equal to 0.
This situation is impossible in standard case.
Fraud case in roaming
30. Future Work
We have already done a lot. But more great things are coming.
2020 Q3 2020 Q4 2021 Q1 Bright Future
More Data Sources
More Triggers
Geolocation data
Equipment logs
Commoditize Machine Learning
Extract value from ML company-wide!
Enable easy ML training and productization
of models in real-time
Real-time BI
Intraday view on business and
operations
Monetize valuable insights from
our combined rich data sources.
Data Monetization
Predictive maintenance
Network Optimization
To lower operational costs
And make better investments
And many more...
Create behavioral profile of the
customers for better
personalised serving
Customer 360 view
31. Old System
Why did we start to look for the new solution?
External Vendor
Solution
Blackbox Solution
Scalability issues
Not reliable
1
2
3
Kcell Developers can’t fix, tweak or optimize it
Limited to ~2000 events / sec
Can’t support all needed data sources
Multiple accidents which took too much time to resolve
39. Dynamic Rules Design
Key Points
● We want to run 100s of triggers/business rules
● A typical approach: job per rule
● Won’t work in our case:
○ Run 100s of topologies/jobs = multiplied resources cost
○ Pull data from Kafka 100s of times
○ State (user features) replicated 100s times
○ Starting rule requires deployment of the job
40. Dynamic Rules Design
Key Points
● Our approach: One job to run all triggers/rules
○ And to consume all the sources
● Trigger “templates” still coded with java
● adding/removing rules without restarting application
● 100s of rules running efficiently
41. Dynamic Rules Design
The Overview
billing events
roaming
Sort by time
control topic
notification
events
Deduplicate Router
Late events
Trigger 1
Trigger 2
State
Updater
Apply Triggers (CoProcess Function)
Keyed by User
42. Dynamic Rules Design
Pros and Cons
Shared resources and costs
● CPU, RAM, state, shuffle
● Pulling data from Kafka
One bad rule affects whole system
● Watermarks are shared
● Failures are shared
No job restart on start of new rules
● Rules started by business, no IT
involved
Still need to code rule template in
the job
● No way to use SQL, Table API, CEP
Sharing of state
● Build customer features, that can
be seen by all rules
Can be tricky to debug
● Code is shared
● Code paths enabled externally
43. Dynamic Rules Design
Issue: lagging sources slow down all rules
Source A:
highly unordered, late
Source B:
Ordered, low latency
Late notifications
Low latency
notifications
Triggers
Triggers
Group 1
Triggers
Group 2
Source A:
highly unordered, late
Source B:
Ordered, low latency
Triggers
Group
Late notifications
Problem Solution
44. Flink Changes Wishlist
What could be even better?
attach new branch to existing topology
that receives the same data
Dynamic Topologies
Cheaper topologies
● Graph of topologies that pass
data locally in Flink
● Other words: Local
Proxying/fan-out of Kafka traffic
Share inputs between topologies
Dynamic SQL
SQL
{ }
45. Decisions made
Some decisions our team made before or during project implementation
Streaming-first approach
Apache Kafka for event hub
Apache Flink
Powerful Real-Time Analytics
46. Apache Avro
Keep state local to the process
Ingest reference data for local joins and
enrichment
● No need to query external systems
while processing
● Data time correlation correctness
Performance
transformed
events
transformed
events
Subscriber profile data
(events)
Local State
Not at >100K
events / sec
47. Nifi for data ingestion (no coding)
● but not for CEP
Web UI for configuring triggers
Ease of Use
48. Flink on YARN, with HDFS
HA for redundancy and running ~24/7
Prometheus & Grafana for monitoring &
alerting
Loki for logs collection and aggregation
Reliability and battle-tested techniques
Kerberos and AD thanks to FreeIPA
Apache Ranger for authorization
Security
49. One platform for the whole Enterprise
Batch (adhoc) queries too
● Spark, Hive/Presto
Online analytics
● OLAP
Extensiveness
HDP
Open-source technologies
HDP as a licence-free distribution
Just start with a bunch of servers
Cost-Efficiency
51. Our Collaboration
Two heads are better than one
Joint development team
Not a vendor solution
Development as one team
Code quality
Code review and
automated tools for
code quality control
Agile Practices
Distant geographic
locations, but
everyday standups
Go live quickly!
<4 months to first
production case
running 24/7!
Deliver
DevOps/Automation
Knowledge sharing
Constant knowledge
exchange in areas of
expertise
Testing
Separate testing
environment
Automated Unit/E2E tests