This document discusses the use of Kafka and Kafka Streams to build a functional architecture for exporting car listings data from AutoScout24 to other marketplaces. It describes the business requirements, including allowing dealers to enable/disable exports and ensuring listings are updated. It then provides an overview of Kafka and Kafka Streams and how they enable a scalable, fault-tolerant streaming data pipeline modeled with stateless functions. Key learnings are that functions should transform domains and consistency is on a per-partition basis.
3. WHAT TO EXPECT?
● To meet ScoutWorks :)
● Tales about business requirements
● A brief introduction to some Kafka & Kafka Streams conventions
● See how we designed our architecture
● Talk about resilience in a functional architecture
4. AUTOSCOUT24
● Platform for selling cars & motorbikes
● 8 countries + 10 language versions
● 55+ thousands dealers
● 2,4+ millions listings
● 3+ billions page impression per month
● 10+ millions active users per month
5. OUR DOMAIN
● Core of domain are listings
● Images are one of the main point of information of listings
● Dealers want to export those listings to other marketplaces
6. OUR PRODUCT
A system able to export dealers’ high quality listings
to other marketplaces to improve her visibility on the market.
7. BUSINESS REQUIREMENTS
● A dealer is capable of enabling and disabling the export process
● All active listings of a dealer will be exported
● Exported listings that become inactive or deleted should be hidden
on external marketplaces
8. MORE BUSINESS REQUIREMENTS
● It’s acceptable to not have latest listing information exported in real-time,
but it should be eventually updated
● It’s important to have all listings on external marketplaces ASAP to ensure
visibility
● Listings data format is dynamic, so it should be possible to reprocess the
listing and export again
9. TECH REQUIREMENTS
● Load fluctuates during the day, scaling up / down is mandatory
● Easy to add additional marketplaces
● Easy to monitor / trace any listing
12. WHAT IS KAFKA?
● Distributed streaming platform
● Records are published in topics, which formed by partitions
● Each partition is an append-only (*) structured commit log
● Records consist of partition key, a value and a timestamp, and an assigned
offset, which means position of record in the log
13. KAFKA GUARANTEES
● Sharding of records based on partition key
● Replication of records depending on configuration
● Ordering of records within partition
● At-least-once delivery guarantee of records
14. WHY KAFKA?
Kafka is often used for building real-time streaming applications
that transform or react to the streams of data.
15. WHY KAFKA?
● Listings change propagation fits very well to Kafka streaming mindset
● Possibility to go back in time and reprocess records if needed
● Enables developers to design thinking in a composition of small functions
16. KAFKA STREAMS
● Opinionated library to process streams or records
● Provides possibility to build elastic, scalable and fault-tolerant solutions
● Uses Kafka to store current offsets / intermediate state of processed data
● Supports stateless processing, stateful processing or windowing
operations, e.g. aggregates of records
● For stateless operations, allows to see microservices as state-ignorant
pure functions, letting Kafka Streams to take care of side-effects
18. STREAMING VS MESSAGING
● Very similar approaches, but...
● Who has the fish?
● Go back in time and re-process records?
● Ordered records for a single aggregate root
21. Functions run once and
completely, can not be
interrupted
Atomic Composable
Functions can be chained
generating more abstract
and business-related
algebras
State-ignorant
State is shared as a
parameter, avoiding mutable
state between functions
FUNCTIONS ARE
24. AGGREGATE ROOT
● Is the boundary of consistency
● Is a set of records in a single topic with the same partition key
● Represents a single business object (for example, a Listing)
36. KAFKA
For every topic with replication factor of N,
Kafka tolerates failures up to N-1 nodes.
37. KAFKA STREAMS
● One node setup: after coming back, picking up where processing stopped
● Multi-node setup: other nodes taking over, but…
○ Stateless processor: continue working as soon as nodes are re-balanced
○ Stateful processor, simple setup: can take a while until state is built up
○ Stateful processor, hot stand-by setup: local state is being build-up, but records are
not being actually processed until failover happens
38. LEARNINGS
● Function signature should be unique (only one function should be
responsible of a single transformation)
● Functions, by design, should not pertain to a single domain, but
map two domains
● The consistency boundary is a partition (or a single aggregate root)
39. LEARNINGS
● A system can be seen as a composition of functions, but data needs
to be managed by an external system.
● As a function, we should test transformations, not side-effects.
● Adding a correlation id on data sources is really useful for tracing, but
boundaries should be chosen carefully.
40. LEARNINGS
● Kafka Streams should not be used for external I/O. For example, if
you need a service that makes HTTP requests, use another streaming
engine for that (we used Akka Streams).
● Kafka Streams’ learning curve is really steep.
● Kafka Streams and Kafka by default are not there yet for medium size
messages (like ~50KB). You will need to tweak and optimize the
configuration.
41. LEARNINGS
● Backpressure is a natural fit as functions are pull-based.
● Single-direction data-flow is a mindset that needs to be learned and
improved.
42. THANK YOU
For questions or suggestions:
Kevin Mas Ruiz (@skmruiz)
kmas@ThoughtWorks.com
Alexey Gravanov (@gravanov)
alexey.gravanov@scout24.com