Apache Kafka зараз на хайпі. Все більше компаній починають використовувати її, як message bus. Проте Kafka може набагато більше, аніж бути просто транспортом. Її реальна міць і краса розкриваються, коли Kafka стає центральною нервовою системою вашої архітектури. Вона швидка, надійна і доволі гнучка для різних сценаріїв використання.
На цій доповіді Сергій поділитися досвідом побудови data streaming платформи. Ми поговоримо про те, як Kafka працює, як її потрібно конфігурувати і в які халепи можна потрапити, якщо Kafka використовується неоптимально.
2. History of Kafka
Created in Linkedin
Creators then founded Confluent
Why name is Kafka? Jay Kreps (Confluent CEO): I thought that since
Kafka was a system optimized for writing, using a writer’s name
would make sense. I had taken a lot of lit classes in college and liked
Franz Kafka. Plus the name sounded cool for an open source project.
4. What is Kafka
A publish/subscribe messaging system that has an
interface typical of messaging systems
but a storage layer more like a log-aggregation system
6. Messages
Key / Value pair, both can be nulls
Kafka treats both just as bytes
Serialization / deserialization happens on clients
Confluent broker can validate messages against schema
8. How many partitions?
What is the throughput you expect to achieve for the topic?
What is the maximum throughput you expect to achieve when
consuming from a single partition?
Throughput for producers can be ignored
9. How many partitions?
Adding partitions later can be very challenging
Consider the number of partitions you will place on each broker and
available disk space and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other
resources on the broker and will increase the time for leader elections.
10.
11. Producers
Can specify partition explicitly or explicitly (via partitioners)
Decision is taken on producer side
Different SKDs might have different default partitioners
Adding new partitions can change partition assignments
12. Producers guarantees
Kafka guarantees ordering within partition for producers
Can be broken for retries if max.in.flights.requests.per.session > 1
Idempotent producers (retries will not cause duplicates)
Transactions (messages sent within transactions will be available for
consumers only after transaction completes)
13.
14. Consumer Groups
Common group.id
One consumer is a group coordinator
Poll loop
Simple for developer: while (true) { consumer.poll(); processMessages(); }
Complicated implementation: coordination, rebalancing, heartbeats etc.
15. Commits and offsets
Consumers commit their last offsets to Kafka
Automatic / manual commits
Sync / async commits
auto.offset.reset from where start reading (start or end)
20. Segments
Physical files with raw data
Kafka keep open handles to all segments, including inactive
Writes are being done to active segments
Retention, compaction are applied only to inactive segments
21. Retention
Kafka does not wait until all consumers read data
log.retention.ms -- retention by time
log.retention.bytes -- retention by size (per partition)
log.segment.bytes -- size of when active segment is closed
log.segment.ms -- time when active segment is closed
23. Compaction
min.compaction.lag.ms when to compact messages
To delete event, send new message with key and null value
(tombstone)
delete.retention.ms when tombstone can be deleted (the default is 24
hours)
Compaction process is configurable (# of threads, resource
consumption, frequency etc.)
24. Brokers
Cluster use zookeeper to handle membership
One of broker is a controller (leader), it is responsible for partition
leader election
There are plans to get rid of zookeeper
26. Kafka Streams
High level DSL for working with Kafka topics as stream
Currently JVM only (Java / Scala)
DSL is rather simple (kind of map / join / reduce)
Supports joins, filters, aggregations
Streams and tables
Handles all low level stuff
28. Kafka Connect
Is a framework for connecting Kafka with external systems such as
databases, key-value stores, search indexes, and file systems
Built with Kafka streams
Deploys as cluster via operators / helm charts
Configurable via REST endpoint
29. Add connector to mysql
echo '{"name":"mysql-login-connector",
"config":{"connector.class": "JdbcSourceConnector",
"connection.url":"jdbc:mysql://127.0.0.1:3306/test? user=root",
"mode":"timestamp","table.whitelist":"login", "validate.non.null":false,
"timestamp.column.name":"login_time","topic.prefix":"mysql."}}' |
curl -X POST -d @- http://localhost:8083/connectors
--header "content-Type:application/json"
31. ksqlDB
is an event streaming database
SQL on top of Kafka streams + materialized views
32. ksqlDB Components
Streams: immutable sequences of events
Tables: mutable sequences of events
Stream processing: transform, filter, aggregate and join
Push queries let you subscribe to a query's result as it changes in
real-time.
Pull queries allow you to fetch the current state of a materialized
view.