1. The document summarizes 101 mistakes that Finn.no has made with Kafka. It discusses various configuration mistakes and operational mistakes made when initially adopting and using Kafka, and the consequences of those mistakes.
2. Common mistakes included not considering backwards compatibility of Kafka versions, treating Kafka like a database, not properly defining schemas, and not understanding client-side rebalancing.
3. Finn.no's solutions to address the mistakes included running multiple Kafka clusters during a migration, using fewer Kafka partitions as a default, and adopting better configuration practices tailored to their use cases.
2. agenda
introduction to kafka
kafka @ finn.no
101* mistakes
questions
“From a certain point onward
there is no longer any turning
back. That is the point that
must be reached.”
― Franz Kafka, The Trial
3. FINN.no
2nd largest website in norway
60 millions pageviews pr day
80 microservices
130 developers
900 deploys pr. week
6 minutes from commit to deploy
(median)
6. why use kafka
#notAnESB
what is a log
terminology
components
giant leap
“A First Sign of the Beginning of
Understanding is the Wish to Die.”
― Franz Kafka
https://commons.wikimedia.org/wiki/File:Kafka.jpg
7. Why use Kafka?
“Apache Kafka is publish-subscribe messaging
rethought as a distributed commit log.”
● Fast
● Scalable
● Durable
● Distributed by design
Sweet spot: High volume, low latency
Quora:
“Use Kafka if you have a fire hose of events (100k+/sec)
you need delivered in partitioned order 'at least once' with
a mix of online and batch consumers, you want to be able
to re-read messages”
“Use Rabbit if you have messages (20k+/sec) that need to
be routed in complex ways to consumers, you want per-
message delivery guarantees, you don't care about ordered
delivery”
8. #NotAnESB
“Based on conversations with the project
sponsors I began to suspect that at least the
introduction of the ESB was a case of RDD, ie.
Resume-Driven Development, development in
which key choices are made with only one
question in mind: how good does it look on my
CV?
Talking to the developers I learned that the ESB
had introduced “nothing but pain.”
Was this really another case of architect’s dream,
developer’s nightmare?”
1. Are you integrating 3 or more applications/services? If
you only need to communicate between 2 applications,
using point-to-point integration is going to be easier.
2. Do you need to use more than one type of
communication protocol? If you are just using
HTTP/Web Services or just JMS, you’re not going to get
any of the benefits if cross protocol messaging and
transformation that Mule provides.
3. Do you need message routing capabilities such as
forking and aggregating message flows, or content-
based routing? Many applications do not need these
capabilities
Mule ESB
9. What is a log?
A log is perhaps the simplest possible storage abstraction.
It is an append-only, totally-ordered sequence of records ordered by time.
Appended to the end of the log, reads proceed left-to-right.
Each entry is assigned a unique sequential log entry number.
The ordering of records defines a notion of "time" since entries to the left are
defined to be older then entries to the right.
This is a data log, not an application log (i.e not log4j)
The two problems a log solves—ordering changes and distributing data—are even
more important in distributed data systems.
10. Changelog 101: Tables and Events are Dual
Duality: a log of changes and a table.
Accounting
log: credit and debit (events pr key)
table: all current balances (i.e state pr key)
In a sense the log is the more fundamental data structure: in addition to creating the
original table you can also transform it to create all kinds of derived tables.
11. producers writes to brokers
consumers reads from brokers
everything is distributed
data is stored in topics
topics are split into partitions
which are replicated
kafka cluster
consumer
producerproducer
producer producer
consumer
consumer
consumer
consumer
consumer
producer
producer
terminology
14. Giant leap?
In fact, persistent replicated messaging is such a giant leap in messaging architecture that it may be worthwhile to point out a few side
effects:
a. Per-message acknowledgments have disappeared
b. ordered delivery
c. The problem of mismatched consumer speed has disappeared. A slow consumer can peacefully co-exist with a fast
consumer now
d. Need for difficult messaging semantics like delayed delivery, re-delivery etc. has disappeared. Now it is all up to the
consumer to read whatever message whenever - onus has shifted from broker to consumer
e. The holy grail of message delivery guarantee: at-least-once is the new reality - both Kafka and Azure Event Hub
provides this guarantee. You still have to make your consumers and downstream systems idempotent so that recovering
from a failure and processing the same message twice does not upset it too much, but hey - that has always been the
case
http://blogs.msdn.com/b/opensourcemsft/archive/2015/08/08/choose-between-azure-event-hub-and-kafka-_2d00_-what-
you-need-to-know.aspx
18. timeline
2012 Decided to use RabbitMQ as message queue. Kafka was installed for a test
2013 Feb - Kafka PoC (“strømmen”)
Ad matching (“lagret søk”)
Ad indexed
2014
Product lifecycle - product paid, etc
2015 Feb -> May - 0.8.2. Dedicated cluster
27. why is it a mistake
0.7 -> 0.8: not backwards compatible
0.7 client does not work well with 0.8 cluster
0.8 -> 0.9: not backwards compatible
0.8 consumer does not work with 0.9 cluster
0.9 - 1.0: ???
28. what is the consequence
kafka is a critical component for communication between applications
coordination of 10-15 teams with 30 services
migration process of 6-8 months
from decision to old cluster turned off
29. what is the correct solution
evaluate the maturity of
critical architecture
components before
everyone starts using it
30. what has finn.no done
0.7 -> 0.8
1) create additional 0.8 cluster
2) all clients consume from both clusters (0.7 and 0.8)
3) critical services (payment) migrates consumers and producers during nighttime
with downtime
4) rest of services migrates it producers to 0.8 (the last mile takes a long time)
5) stop consuming from 0.7
6) turn off 0.7 cluster
31. what has finn.no done
0.7 -0.8
1) create additional 0.8 cluster
2) all clients consume from both clusters (0.7 and 0.8)
3) critical services migrates consumers and producers during nighttime with
downtime
4) rest of services migrates it producers to 0.8 (the last mile takes a long time)
5) stop consuming from 0.7
6) turn off 0.7 cluster
7) read blogpost stating that 0.9 is not backwards compatible with 0.8
33. why is it a mistake
everything that is published on Kafka is visible to any client that can access
34. what is the consequence
direct reads across services/domains is quite normal in legacy and/or enterprise
systems
this coupling makes it hard to make changes
Kafka has no security pr topic - you must add that yourself
35. what is the correct solution
Data on the inside versus data on the outside
At least decide on a convention for what is private data and what is public data
37. why is it a mistake
schema change differently from the code producing and consuming messages
data needs versioning
defining schema in a java library makes it more difficult to access data from non-jvm
languages
code repository with java is not the easiest way to figure out the data on a topic
38. what is the consequence
development speed outside jvm has been slow
change of data needs coordinated deployment
difficult to create tooling that needs to know data format, like data lake
39. what is the correct solution
confluent.io platform has a separate schema registry
rest interface
apache avro
multiple compatibility settings and evolutions strategies
connect
40. what has finn.no done
still using java library
confluent platform 2.0 is planned for the next step, not kafka 0.9
42. why is it a mistake
We used our normal Ops scripts for kafka - if a config changes, restart automatically
If shutdown does not work within 5 seconds, kill -9
A database needs to finish what it is doing, before shutting down
A distributed database even more so
43. what is the consequence
At least need for recovery at startup
Data loss
No convergation - you won’t come back up
44. what is the correct solution
Do not play with stored data - understand how and when to apply changes
47. why is it a mistake
kafka has a clear algorithm for handling increase or decrease in clients to be able
to keep everything balanced.
all consumers are reconnected
This algorithm creates a lot of noise in logs when you deploy all the time
common java-library had 4 consumer-threads as default pr application
48. what is the consequence
developers did not understand what happened
during a deploy
“kafka is unstable”
most service-instances did not receive messages
each deploy of a service (typically 4 instances)
triggered 4 rebalances.
if rebalance takes to long, the (at least our)
consumer dies.
“kafka is down”
49. what is the correct solution
1. For each topic T that Ci
subscribes to
2. let PT
be all partitions producing topic T
3. let CG
be all consumers in the same group as Ci
that consume topic T
4. sort PT
(so partitions on the same broker are clustered together)
5. sort CG
6. let i be the index position of Ci
in CG
and let N = size(PT
)/size(CG
)
7. assign partitions from i*N to (i+1)*N - 1 to consumer Ci
8. remove current entries owned by Ci
from the partition owner registry
9. add newly assigned partitions to the partition owner registry
(we may need to re-try this until the original partition owner releases its
ownership)
all consumers in a group rebalances when a consumer arrives or departs from the
group
50. what has finn.no done
some consumers use 1 thread pr instance
planning to rewrite consumer library
read kafka documentation
52. why is it a mistake
Historically - One Big Database with Expensive License => One, Single Server
Database world - OLTP and OLAP
Changes with Open Source software and Cloud
Tried to simplify the developer's day with a single config
Kafka supports very high throughput and highly reliable
53. what is the consequence
Trade off between throughput and degree of reliability
With a single configuration - the last commit wins (remember the 128 partitions?)
Either high throughput, and risk of loss - or potentially too slow
54. what is the correct solution
understand your use cases and their needs!
55. Defaults that are quite reliable
Exposing configuration variables in the client
Ask the questions;
● at least once delivery
● ordering - if you partition, what must have strict ordering
● 99% delivery is that enough?
● which level of throughput is needed
what has finn.no done
56. Configuration
Configuration for production
● Partitions
● Replicas (default.replication.factor)
● Minimum ISR (min.insync.replicas)
● Wait for acknowledge when producing messages (request.required.acks, block.on.buffer.full)
● Retries
● Leader election
Configuration for consumer
● Number of threads
● When to commit (autocommit.enable vs consumer.commitOffsets)
59. why is it a mistake
topics are created every time someone tries to consume from or produce to a
topics name
60. what is the consequence
topic names from production:
we are not able to control the number of topics
too many topics gives too many partitions. ZooKeeper gets slow when handling
this.
no place to put topic condig
Event.USER.blabla, testing42, testing2,
Event.GO_CLICK.asdf4133, Event.GO_CLICK.asdf7392, Event.GO_CLICK.asdf7532,
61. what is the correct solution
small number of partitions as default
increase number of partitons for selected topics
62. what has finn.no done
5 partitions as default
2 topics have more than 5 partitons
topics with lots of traffic
63. mistake:
deploy a proof of concept
hack - in production ; i.e
why we had 8 zk nodes
https://flic.kr/p/6eoSgT
64. why is it a mistake
Kafka was set up by Ops for a test - not for hardened production use
By coincidence we had 8 nodes for kafka, the same 8 nodes for zookeeper
Zookeeper is dependent on a majority quorum, low latency between nodes
65. what is the consequence
Zookeeper recommends 3 nodes for normal usage, 5 for high, and any more is
questionable
More nodes leads to longer time for finding consensus, more communication
If we get a split between data centers, there will be 4 in each
You should not run Zk between data centers, due to latency and outage
possibilities
66. what is the correct solution
Have an odd number of Zookeeper nodes - preferrably 3, at most 5
Don’t cross data centers
Check the documentation before deploying serious production load
Don’t run a sensitive service (Zookeeper) on a server with 50 services, 300% over
committed on RAM
69. why is it a mistake
in certain conditions unclean.leader.elections=true can lose messages
replication.factor = 3
in.sync.replicas = 1 100
101
replica1
100
101
replica2
100
replica3
70. why is it a mistake
in certain conditions unclean.leader.elections=true can lose messages
replication.factor = 3
in.sync.replicas = 1
replica3 dies
100
101
replica1
100
101
replica2
100
replica3
leader
71. why is it a mistake
in certain conditions unclean.leader.elections=true can lose messages
replication.factor = 3
in.sync.replicas = 1
replica2 dies
100
101
102
103
104
replica1
100
101
replica2
100
replica3
72. why is it a mistake
in certain conditions unclean.leader.elections=true can lose messages
replication.factor = 3
in.sync.replicas = 1
replica1 dies
100
101
102
103
104
replica1
100
101
replica2
100
replica3
73. why is it a mistake
in certain conditions unclean.leader.elections=true can lose messages
replication.factor = 3
in.sync.replicas = 1
which replicas comes
online first
100
101
102
103
104
replica1
100
replica3
100
101
replica2
74. what is the consequence
messages might be lost forever
without errors in the client
https://upload.wikimedia.org/wikipedia/commons/d/d4/George-W-Bush.jpeg
75. what is the correct solution
replication.factor = 3
in.sync.replicas = 2
unclean.leader.election=false
(unless you are worried about what
happens when replica1 (leader) is dead for
a long time)
76. what has finn.no done
replication.factor = 3
in.sync.replicas = 1 (2 for selected topics)
unclean.leader.election=true
78. why is it a mistake
partitions are kafkas way of scaling consumers, 128 partitions can handle 128
consumers processes
0.8 clusters could not reduce the number of partitions without deleting data
highest number of consumers today is 20
79. what is the consequence
0.8 cluster was configured with 128 partitions as default, for all topics.
many partitions and many topics creates many datapoints that must be coordinated
zookeeper must coordinate all this
rebalance must balance all clients on all partitions
zookeeper and kafka went down (may 2015)
(500 topics * 128 partitions)
80. what is the correct solution
small number of partitions as default
increase number of partitions for selected topics
understand your use case
reduce length of transactions on consumer side
81. what has finn.no done
5 partitions as default
2 topics have more than 5 partitions
topics with lots of traffic
86. “It's only because
of their stupidity
that they're able
to be so sure of
themselves.”
― Franz Kafka,
The Trial
Audun Fauchald Strand
@audunstrand
Henning Spjelkavik
@spjelkavik
http://www.finn.no/apply-here