DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka

Building Data Streaming
Platform with Apache Kafka
Serhii Kalinets
System Architect

History of Kafka
Created in Linkedin
Creators then founded Confluent
Why name is Kafka? Jay Kreps (Confluent CEO): I thought that since
Kafka was a system optimized for writing, using a writer’s name
would make sense. I had taken a lot of lit classes in college and liked
Franz Kafka. Plus the name sounded cool for an open source project.

Kafka use cases
Message Broker
Logs
Commit log
Streaming

What is Kafka
A publish/subscribe messaging system that has an
interface typical of messaging systems
but a storage layer more like a log-aggregation system

Messaging System
Messages
Topics
Partitions
Producers
Consumers

Messages
Key / Value pair, both can be nulls
Kafka treats both just as bytes
Serialization / deserialization happens on clients
Confluent broker can validate messages against schema

https://kafka.apache.org/intro

How many partitions?
What is the throughput you expect to achieve for the topic?
What is the maximum throughput you expect to achieve when
consuming from a single partition?
Throughput for producers can be ignored

How many partitions?
Adding partitions later can be very challenging
Consider the number of partitions you will place on each broker and
available disk space and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other
resources on the broker and will increase the time for leader elections.

Producers
Can specify partition explicitly or explicitly (via partitioners)
Decision is taken on producer side
Different SKDs might have different default partitioners
Adding new partitions can change partition assignments

Producers guarantees
Kafka guarantees ordering within partition for producers
Can be broken for retries if max.in.flights.requests.per.session > 1
Idempotent producers (retries will not cause duplicates)
Transactions (messages sent within transactions will be available for
consumers only after transaction completes)

Consumer Groups
Common group.id
One consumer is a group coordinator
Poll loop
Simple for developer: while (true) { consumer.poll(); processMessages(); }
Complicated implementation: coordination, rebalancing, heartbeats etc.

Commits and offsets
Consumers commit their last offsets to Kafka
Automatic / manual commits
Sync / async commits
auto.offset.reset from where start reading (start or end)

Datastore
Partitions
Replicas
Segments
Compaction

Default topic configuration
Replication factor = 3
min.insync.replicas = 2
In producers: acks = all

Segments
Physical files with raw data
Kafka keep open handles to all segments, including inactive
Writes are being done to active segments
Retention, compaction are applied only to inactive segments

Retention
Kafka does not wait until all consumers read data
log.retention.ms -- retention by time
log.retention.bytes -- retention by size (per partition)
log.segment.bytes -- size of when active segment is closed
log.segment.ms -- time when active segment is closed

Compaction
min.compaction.lag.ms when to compact messages
To delete event, send new message with key and null value
(tombstone)
delete.retention.ms when tombstone can be deleted (the default is 24
hours)
Compaction process is configurable (# of threads, resource
consumption, frequency etc.)

Brokers
Cluster use zookeeper to handle membership
One of broker is a controller (leader), it is responsible for partition
leader election
There are plans to get rid of zookeeper

Kafka guaranties
Durability and high availability
Message ordering in partition
At least once / exactly once
Transactions

Kafka Streams
High level DSL for working with Kafka topics as stream
Currently JVM only (Java / Scala)
DSL is rather simple (kind of map / join / reduce)
Supports joins, filters, aggregations
Streams and tables
Handles all low level stuff

Kafka Connect
Is a framework for connecting Kafka with external systems such as
databases, key-value stores, search indexes, and file systems
Built with Kafka streams
Deploys as cluster via operators / helm charts
Configurable via REST endpoint

Add connector to mysql
echo '{"name":"mysql-login-connector",
"config":{"connector.class": "JdbcSourceConnector",
"connection.url":"jdbc:mysql://127.0.0.1:3306/test? user=root",
"mode":"timestamp","table.whitelist":"login", "validate.non.null":false,
"timestamp.column.name":"login_time","topic.prefix":"mysql."}}' |
curl -X POST -d @- http://localhost:8083/connectors
--header "content-Type:application/json"

ksqlDB
is an event streaming database
SQL on top of Kafka streams + materialized views

ksqlDB Components
Streams: immutable sequences of events
Tables: mutable sequences of events
Stream processing: transform, filter, aggregate and join
Push queries let you subscribe to a query's result as it changes in
real-time.
Pull queries allow you to fetch the current state of a materialized
view.

Creating tables
CREATE TABLE currentCarLocations (
vehicleId VARCHAR,
latitude DOUBLE(10, 2),
longitude DOUBLE(10, 2)
) WITH (
kafka_topic = 'locations',
partitions = 3,
key = 'vehicleId',
value_format = 'json'
);

Queries
SELECT vehicleId,
latitude,
longitude
FROM currentCarLocations
WHERE ROWKEY = '6fd0fcdb'
EMIT CHANGES;

Advantages
Non developers can write their queries
Read from and write to many data sources
Much less code -- less bugs
Data exploration

Our Roadmap
Consumer / producer API
Kafka Streams / Connect ← we are here
ksqlDB

Thanks!
serhii.kalinets@pm.bet
@skalinets

DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka

Ähnlich wie DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka (20)

Mehr von DevOps_Fest

Mehr von DevOps_Fest (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apache Kafka