Joins in Kafka Streams and ksqlDB are a killer-feature for data processing and basic join semantics are well understood. However, in a streaming world records are associated with timestamps that impact the semantics of joins: welcome to the fabulous world of _temporal_ join semantics. For joins, timestamps are as important as the actual data and it is important to understand how they impact the join result.
In this talk we want to deep dive on the different types of joins, with a focus of their temporal aspect. Furthermore, we relate the individual join operators to the overall ""time engine"" of the Kafka Streams query runtime and explain its relationship to operator semantics. To allow developers to apply their knowledge on temporal join semantics, we provide best practices, tip and tricks to ""bend"" time, and configuration advice to get the desired join results. Last, we give an overview of recent, and an outlook to future, development that improves joins even further.
2. 09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
2
Sascha Holtbrügge
Big Data Architect
SVA System Vertrieb Alexander GmbH
sascha.holtbruegge@sva.de
Sascha Bleckmann
Big Data Engineer
SVA System Vertrieb Alexander GmbH
sascha.bleckmann@sva.de
3. CLOUD-NATIVE KAFKA
Big Data Analytics & IoT @ SVA
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
3
• more than 170 consultants
• widespread skill: engineers, physicists,
mathematicians, computer scientists,
statisticians, psychologists, …
• all required roles for an e2e big data solution
• Data Scientist – algorithms and statistics
• Data Engineer – development
• Architect – IT Infrastructure, Platform
• Managed Services – Operations
• DevOps – agility and methodics
• strong eco-system (Confluent, Elastic, Splunk …)
5. 09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
5
Cloud native technologies empower
organizations to build and run scalable
applications in modern, dynamic environments
such as public, private, and hybrid clouds.
CNCF Cloud Native Definition v1.0
6. CLOUD-NATIVE KAFKA
What are the key points of cloud native technologies?
Best Practices: Containers, Service-Meshes, Microservices, „immutable infrastructure“ and „declarative APIs“
Loosely coupled systems
Decoupling infrastructure and platform
Hardware and operating systems become transparent to the application and are considered a disposable
resource
What would be a feasible approach to achieve this goal?
Cloud Native
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
6
7. CLOUD-NATIVE KAFKA
Building a platform on top of an infrastructure
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
7
How may a single container platform be operated on dedicated hardware resources?
9. CLOUD-NATIVE KAFKA
„Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and
management of containerized applications.”
Open Source project within the Cloud Native Computing Foundation
Originates back to the principles of Google’s “Borg”, the system which empowers the whole Google platform
Kubernetes enables complex orchestration and deployment scenarios
Declarative approach using „Manifests“ describing the desired state of resources
Kubernetes introduces the „Pod“ as smallest unit, implying the execution of one or an union of more
containers to be executed collectively on a „Worker Node“
The „Kubelet“ process runs the pods on the according nodes
Storage, network, … are considered a disposable resource, managed by Kubernetes for pod usage
Kubernetes
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
9
11. CLOUD-NATIVE KAFKA
We introduced a shared platform by decoupling the infrastructure, using Kubernetes:
… processes to be run may be described declaratively!
Configuration of the application‘s environment can be prepared by the development team
Containers are already provided with their respective run-time environment
… the system is resilient to failure and embraces load balancing!
Resources are shared fairly between all services running on the platform
Node failure may be compensated by other nodes in the cluster
… there are new problems arising over the horizon.
Provisioning the Kubernetes platform
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
11
12. CLOUD-NATIVE KAFKA
What happens, if …
… the sink is not reachable?
… the source transmits too much data in a short time
period, such that the sink can‘t handle it?
… the data has an unexpected format?
… the number of participating sources and sinks
increases further?
… the overall load of the system increases?
Data Integration between source and sink
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
12
Source Sink
Data
14. CLOUD-NATIVE KAFKA
Apache Kafka is a scalable, reliable and distributed streaming platform, optimized for high data rates
and throughput:
Out-of-the-box integrations, for example legacy services, databases and external services
Guarantees regarding delivery and order of messages
High throughput, even considering very high message load
Loosely coupled sources and sinks
Transmission in realtime
Horizontal scalability
Open Source project supported by commercial features
Why Apache Kafka?
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
14
16. CLOUD-NATIVE KAFKA
Apache Kafka
Producers and consumers are completely decoupled
Producer write data as soon as they are present
Consumer read and process data with the velocity they
are able to handle, from the source they want
Consumers are organized in consumer groups
Load balancing between all members of the consumer
group
Messages may be replayed, even if already processed
If needed, messages may even be stored infinitely – for
example, utilizing the Tiered Storage feature of Confluent
Platform 6.0!
With the right architectural decisions, the system scales
without any limit
Kafka vs. Message Queue
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
16
Message Queue
Processing in the fashion of a „command queue“
Normally, there‘s no replaying of messages intended
Possibilities of routing and processing/filtering of
messages in the MQ system itself
e.g. you need to manage business logic in the MQ system
Need to file a „subscription“ before messages are
scheduled
Messages are pushed to the according parties by the MQ
17. CLOUD-NATIVE KAFKA
Operating a Kafka cluster, especially on Kubernetes, requires some distinct challenges to be addressed:
Kafka‘s brokers are stateful applications
Attribution of the Broker ID must not change
Each broker has its own data persistence layer – and that assignment must not change as well
ZooKeeper is needed to operate a Kafka cluster
Stateful application as well
Used for coordination of brokers, as well as meta information regarding topics, ACLs, …
There are further components to be configured with the cluster
Kafka Connect, ksqlDB, …
Kafka on Kubernetes
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
17
18. 09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
18
How to manage a system of such complexity
efficiently on the Kubernetes platform?
Operator Pattern!
20. CLOUD-NATIVE KAFKA
Operators allow an automated deployment of applications on Kubernetes:
Declaring own APIs in Kubernetes by definition of „Custom Resource Definitions“ (CRDs)
Kubernetes provides the technical base using the API server for resource lifecycle management
An operator is a dedicated process in the cluster, using the Kubernetes API to watch and manage the
state of the CRD instance objects
Changes in CRD instance objects trigger a reconciliation
Operator manages dependent Kubernetes objects, such as Deployments, ConfigMaps, …
Operator Pattern
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
20
21. CLOUD-NATIVE KAFKA
Strimzi Kafka Operator
Open Source project
https://github.com/strimzi/strimzi-kafka-operator
Kafka‘s components as Custom Resource Definitions
Kafka, KafkaConnect, KafkaConnector, …
Topics can be created and managed as CRDs
Kafka Operators for Kubernetes
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
21
Confluent Operator
Commercial product with enterprise platform
Confluent Platform may be used along Kafka
Current operator is based on a stacked Helm chart, but:
Confluent Operator 2.0 hit „Early Access“ phase –
completely based on Custom Resource Definitions!
23. CLOUD-NATIVE KAFKA
Customer wants to operate a multi-tenant environment
Kubernetes platform and Kafka cluster shall be shared between all tenants
Namespaces are dedicated to a single tenant‘s environment
Deployment and operation of external developed applications
Applications are delivered as OCI-compatible containers along with Helm charts
For every tenant the subset of deployed applications is different
But still, every application communicates utilizing Kafka as central data hub
Test environments should be started and stopped on the fly
Topics should be dynamically created and deleted with the according environments
Kafka connectors also need to be created and stopped automatically
How does a Kafka operator support that?
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
23
24. CLOUD-NATIVE KAFKA
Continuous Delivery shall accelerate the delivery process of software, while improving quality
Minimizing the duration of a single development cycle
Using deployment pipelines and well-defined processes to deliver the software product
Cloud Native technologies support this intend
Declarative definition of resources and decoupling systems significantly decreases maintenance and
administration overhead
Systems have lower coherence with each other, thus less overall complexity
ArgoCD is a Kubernetes operator enabling GitOps by Kubernetes manifests
Deployment of desired Kubernetes resources is triggered by further Kubernetes resources
Continuous synchronisation with Git repository
GitOps and Continuous Delivery
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
24
26. CLOUD-NATIVE KAFKA
Cloud Native technologies benefit from decoupled infrastructure
Kubernetes is a perfectly well-suited platform
Kafka simplifies and consolidates data streams
Data streams are chained to a single central data hub
Kafka Connect enables in- and output of data in conjunction with sources and sinks
Orchestrating complex applications on Kubernetes should be addressed by the operator pattern
Kafka can be operated very well on Kubernetes in that way
Strimzi Operator and Confluent Operator are able to manage the whole life-cycle of the Kafka instance
Summary
09.02.2021
Cloud-Native Kafka – KAFKA SUMMIT 2021
26