The Possibilities and Pitfalls of Writing Your Own State Stores. Building an event-driven system will inevitably lead you to exposing your data through APIs to make your data accessible to non-streaming solutions. At first glance, Kafka Streams provides statestores which we could use to build our APIs directly onto Kafka. But all store implementations are key/value based which is fine when you only retrieve information by key. API’s however require a bit more “searchability”.
Writing your own state store is certainly possible, but it is challenging. At KOR, we went through this process and implemented a state store on top of Nitrite Database, an embedded document database. This allows you not only to retrieve your documents using a key, but also to search through the values in the store using a MongoDB-like API. On the surface, state stores seem straightforward, but the devil is certainly in the details. How does partitioning fit into this story and how do you make sure everything keeps running smooth, even after restarting or scaling your applications?
We made the project open source for everyone out there wanting to try this approach, but most of all we want to tell you about the dragons we encountered. Join me in a journey of ups and downs that starts with a simple requirement (host an API), through implementing a custom state store and finishes off by describing the challenges we encountered getting our APIs deployed. Don’t expect all “roses and sunshine”. While hosting APIs on Kafka is possible, there are some consequences that we just couldn’t overcome … yet.
2. Setting The Scene
KOR Financial
Regulatory Reporting
Event Driven Organization
Event Driven Foundation
Based on Kafka
Long retention (40j!)
Rethink common practices
2
6. What is a state store
Part of Kafka Streams
Embedded “cache”
Internal to the application
Local vs Global Statestores
Fault Tolerant through “changelog topics”
Queryable from outside through Interactive Query Service (IQ)
6
10. Challenge 1: Key Value Only
Fast!
GET, SCAN
RocksDB
In-Memory, overflow to disk
Tweak the default memory settings!
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks
https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-and-filter-blocks
Be smart with keys!
10
11. Challenge 1: Key Value Only
But what if we had:
- Search capabilities
- Filter-like API
11
12. Challenge 1: Key Value Only
But what if we had:
- Search capabilities
- Filter-like API
Custom Statestores!
12
13. Challenge 1: Key Value Only
But what if we had:
- Search capabilities
- Filter-like API
Embedded databases:
- H2, HSQLDB, Derby
- Lucene
- NitriteDB -> https://github.com/nitrite/nitrite-java
13
14. Challenge 1: Key Value Only
NitriteDB
Document Store
Natural Filtering API
Supports Indexes
14
Cursor cursor = collection.
find(
// and clause
and(
// firstName == John
eq("firstName", "John"),
// elements of data array is less than 4
elemMatch("data", lt("$", 4)),
// elements of fruits list has one element matching orange
elemMatch("fruits", regex("$", "orange")),
// note field contains string 'quick'
text("note", "quick")
)
);
for (Document document : cursor) {
// process the document
}
15. NO2 keys are Integers and values are Documents
→ Use NO2 Documents as envelopes
→ No relation between NO2 keys and kafka keys
→ But NO2 can search on value (doc.key == …)
Challenge 1: Key Value Only
15
16. What about fault-tolerance?
→ Changelog topic (compacted)
→ On application Start
Data on FS → Load from FS
Data not on FS → Restore from changelog
Challenge 1: Key Value Only
16
17. Challenge 1: Key Value Only
17
Topology topology = new Topology()
// read commands from the movie-events topic
.addSource("sourceProcessor"
, Serdes.String().deserializer(), eventSerde.
deserializer(), "movie-events"
)
// add a processor to manipulate the no2 store based on incoming events
.addProcessor("commandHandler"
, MovieEventHandler::
new, "sourceProcessor"
)
// add the no2 statestore itself, using “code” as the key field
.addStateStore(
DocumentStores.
nitriteStore("movies", Serdes.String(), movieSerde, Movie.class, "code"),
"cmdHandler")
// write out processing results back to the original movie-events topic
.addSink("sinkProcessor"
, "movie-events"
, Serdes.String().serializer(), eventSerde.
serializer(), "cmdHandler");
Integrate into Kafka Streams Topology
18. Challenge 1: Key Value Only
18
// get the statestore from the processor context
DocumentStore<String, Movie, ObjectFilter> store = context.
getStateStore("movies");
// retrieve all movies which contain “Matrix” as part of the title
QueryCursor<Movie> movies = store.
find(and(ObjectFilters.
regex("title", ".*Matrix.*")));
Integrate into Kafka Streams Processors (processor API)
21. Challenge 2: IQ Querying Limitations
Access data in statestores
Bring-your-own-api
IQv2
IQ Metadata API
Keeps track of which data is located at which instance
21
27. Challenge 2: IQ Querying Limitations
Skewed Data
Time Query(B) < Time Query(C)
A cannot return until all results are in
Shard Failures
What if B is down (or rebalancing)
27
B
API
C
API
31. Challenge 2: IQ Querying Limitations
Paging
Get parts of the result
Requires sorting for consistent results
Does not scale well:
# records = (partitions * (offset + size))
31
34. Challenge 3: Cloud Native Issues
Rebalances will get you
→ require statestores to be built again from scratch
→ Persistent Volumes help, but are a pain
34
35. Challenge 3: Cloud Native Issues
Rebalances will get you
→ require statestores to be built again from scratch
→ Persistent Volumes help, but are a pain
CICD flows trigger A LOT of rebalances
35
42. Conclusion
APIs can be hosted directly from statestores
→ Your Mileage May Vary
Alternatively
→ Use external storage (ElasticSearch, Mongodb, Arangodb, …)
→ Prefer Connectors over writing directly to external storage
→ Beware of the additional external dependency
→ Flink Statefun (stateful functions) ?
42