L’Auto Scaling, c’est l’argument phare d’un bon nombre de technologies en Data Engineering. Parmi les outils du moment, on retrouve Kafka-Streams. Avec sa forte intégration au bus de message Apache Kafka, il est pensé pour être un framework distribué capable de passer à l’échelle. Pourtant, dans la pratique, sa seule utilisation est limitée, et l’orchestration de ces applications est généralement nécessaire.
Dans ce talk, nous parlerons de conteneurisation, d’orchestration et de monitoring, qui sont des éléments clefs qui nous permettront de profiter pleinement de la scalabilité des applications Kafka-Streams, le tout autour de technologies comme Kubernetes et Stackdriver.
¨Par Loic Divad, Data Engineer chez Xebia
15. @Xebiconfr #Xebicon18 @LoicMDivad
Kafka-Streams and the consumer protocol
topic-partition-0
topic-partition-1
topic-partition-2
topic-partition-N
APP
● Every topic in Kafka is split into one or more
partitions
● All the streaming tasks are executed through
one or multiple threads of the same instance
15
16. @Xebiconfr #Xebicon18 @LoicMDivad
Kafka-Streams and the consumer protocol
topic-partition-0
topic-partition-1
topic-partition-2
topic-partition-N
APP
APP
● Consumers from the same consumer group
cooperate to consume data from topics.
● Every instance by joining the group triggers a
partition rebalance.
16
17. @Xebiconfr #Xebicon18 @LoicMDivad
Kafka-Streams and the consumer protocol
topic-partition-0
topic-partition-1
topic-partition-2
topic-partition-N
APP
APP
APP
APP
● The maximum parallelism is determined by the
number of partitions of the input topic(s)
17
33. @Xebiconfr #Xebicon18 @LoicMDivad
K8s: Horizontal Pod Autoscaler
- Kubernetes Resource
- Periodically adjusts the number of replicas
- Base on CPU usage in autoscaling/v1
- Memory and custom metrics are covered by the
autoscaling/v2beta1
- Use the metrics.k8s.io API through a metric server
➔ Source: Kubernetes.io Documentation
33
37. @Xebiconfr #Xebicon18 @LoicMDivad
CONCLUSION
States migration, changelog compaction, topology
upgrades and k8s StateFull Sets adoption are the
next challenges to ease auto-scaling
BUILD THE FUTURE
1. Kafka-Streams exposes relevant metrics
related to stream processing
2. Consumer-lag is one of the key metrics to
monitor in real time application
3. The cloud native trends brings a set of
powerful tools on which the Kafka
community keep a close look
37
42. @Xebiconfr #Xebicon18 @LoicMDivad
The Horizontal Pod Autoscaler algorithm depends on the current metric value and replica number
desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]
➢ A ratio of two will double the number of intances within the respect of maxReplicas
➢ By using targetAverageValue, the metric is computed by taking the average of the given metric across all Pods
The number of replicas may fluctuating frequently due to the dynamic nature of the metrics, it’s called trashing
➢ --horizontal-pod-autoscaler-downscale-delay (default 5m0s)
➢ --horizontal-pod-autoscaler-upscale-delay (default 3m0s)
Note: Both Kafka-Streams topology modification and HPA makes rolling update imposible
HPA & thrashing: “Should I stay or should I Go?”
42
45. @Xebiconfr #Xebicon18 @LoicMDivad
now supports more than
45
https://www.confluent.io/blog/apache-kafka-supports-200k-partitions-per-cluster
200K partitions
47. @Xebiconfr #Xebicon18 @LoicMDivad
Use Case - King Of Fighters: The combos sessionization
47
Streaming
App
Correlate
Flatten
Decode
Group
Produce Back
Key => {
"ts":1542609460412,
"machine":"903071",
"zone":"AU"
}
Value => {
"bytes":[
"c3ff8ab19d00d9e5",
"e3ff8c72b600d9e5"
]}
[{
"impact":0,
"key":"X",
"direction":"DOWN",
"type":"Missed",
"level":"Pro",
"game":"Neowave"
}, ...]