9. Agenda
History
Use cases
Producers and consumers
Topics, partitions and clusters
Streams, AdminClient and Connectors
Kafka in .NET and Cloud
External stream processing systems (spark/storm/flink/apex)
11. History
Developed in LinkedIn
Open sourced in 2011
Named after Franz Kafka because it’s optimized for writing
12. Kafka is basically:
Open source
Written in Scala
Message broker
Stream processing platform
High throughput & low latency
Scalable
Designed as distributed transaction log
27. Producer API
Allows to publish stream of messages to one or more topics
Asynchronous and thread safe (in original implementation)
Can deliver messages “at least once”, “at most once” or “exactly once”
Can batch messages
Can use partitions for load balancing purpose
28.
29. Consumer API
Allows subscription to topic and receiving messages from it
Messages are pulled from topic – each consumer can process messages at its
own pace
Supports long polling to avoid being stuck in a loop
Each consumer handles its own position
Does not support acknowledgements but can rewind from any offset
Supports consumer groups
34. Topics
Each topic has a name, is partitioned and is multi-subscriber
Kafka persists each published message. Retention period is configurable.
Consumer controls its own offset
Partition must fit on the server but topic can be partitioned across multiple nodes
Partitions are replicated across cluster to ensure fault tolerance, each partition has
a leader replica
35.
36.
37. Cluster
Kafka runs in cluster
Cluster has multiple servers/nodes
Cluster can run on multiple datacenters
Cluster stores messages in partitioned topics
Zookeeper coordinates servers in cluster
39. Streams
Acts as stream processor
Allows consuming inputs from one or more topics and provide processed output to
other topic
Works (almost) in real time
44. Kafka in .NET
Main library is confluent-kafka-dotnet
Supports Avro serialization/deserialization with schema registry
Easy to learn, hard to master
45.
46.
47.
48.
49.
50.
51. Kafka in Azure
Azure Event Hub are fully compatible with Kafka enabled applications (you just
need to change connection configuration)
You can setup Kafka Cluster in HDInsight (it’s not cheap)
52. Kafka in AWS
Amazon Managed Streaming for Apache Kafka (Amazon MSK)
Amazon Kinesis has somewhat similar capabilities
59. Apache Apex
Platform used to help in development of stream and batch oriented applications.
Designed to process data in-motion
Performant
Scalable
Fault tolerant
Allows creation of various functions without thinking about distributed environment
60. Apache Flink
Focused on parallel, pipelined processing of streams
Runs Java, Scala, Python and SQL Code
Manages state
Great for data analysis and event correlation
61. Apache Spark
Analytics engine for big data processing
Data processing framework
Used for processing and transforming streams of data
Also used for training machine learning algorithms
Great for ETL (Extract, transform, and load) processes
Supports Java, Scala, Python and R
62. Apache Storm
Distributed real-time computation system
Great for real time analytic systems (in example fraud detection)
Can handle MASSIVE amounts of data on the fly
Works with ANY programming language