The Possibilities and Pitfalls of Writing Your Own State Stores with Daan Gertis

The Possibilities And Pitfalls of Writing
Your Own State Stores
Daan Gerits

Setting The Scene
KOR Financial
Regulatory Reporting
Event Driven Organization
Event Driven Foundation
Based on Kafka
Long retention (40j!)
Rethink common practices
2

What is a state store
Part of Kafka Streams
Embedded “cache”
Internal to the application
Local vs Global Statestores
Fault Tolerant through “changelog topics”
Queryable from outside through Interactive Query Service (IQ)
6

Challenge 1: Key Value Only
Fast!
GET, SCAN
RocksDB
In-Memory, overflow to disk
Tweak the default memory settings!
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB#indexes-and-filter-blocks
https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-and-filter-blocks
Be smart with keys!
10

But what if we had:
- Search capabilities
- Filter-like API
11

But what if we had:
- Filter-like API
Custom Statestores!
12

But what if we had:
- Filter-like API
Embedded databases:
- H2, HSQLDB, Derby
- Lucene
- NitriteDB -> https://github.com/nitrite/nitrite-java
13

NitriteDB
Document Store
Natural Filtering API
Supports Indexes
14
Cursor cursor = collection.
find(
// and clause
and(
// firstName == John
eq("firstName", "John"),
// elements of data array is less than 4
elemMatch("data", lt("$", 4)),
// elements of fruits list has one element matching orange
elemMatch("fruits", regex("$", "orange")),
// note field contains string 'quick'
text("note", "quick")
)
);
for (Document document : cursor) {
// process the document
}

NO2 keys are Integers and values are Documents
→ Use NO2 Documents as envelopes
→ No relation between NO2 keys and kafka keys
→ But NO2 can search on value (doc.key == …)
15

What about fault-tolerance?
→ Changelog topic (compacted)
→ On application Start
Data on FS → Load from FS
Data not on FS → Restore from changelog
16

17
Topology topology = new Topology()
// read commands from the movie-events topic
.addSource("sourceProcessor"
, Serdes.String().deserializer(), eventSerde.
deserializer(), "movie-events"
)
// add a processor to manipulate the no2 store based on incoming events
.addProcessor("commandHandler"
, MovieEventHandler::
new, "sourceProcessor"
)
// add the no2 statestore itself, using “code” as the key field
.addStateStore(
DocumentStores.
nitriteStore("movies", Serdes.String(), movieSerde, Movie.class, "code"),
"cmdHandler")
// write out processing results back to the original movie-events topic
.addSink("sinkProcessor"
, "movie-events"
, Serdes.String().serializer(), eventSerde.
serializer(), "cmdHandler");
Integrate into Kafka Streams Topology

18
// get the statestore from the processor context
DocumentStore<String, Movie, ObjectFilter> store = context.
getStateStore("movies");
// retrieve all movies which contain “Matrix” as part of the title
QueryCursor<Movie> movies = store.
find(and(ObjectFilters.
regex("title", ".*Matrix.*")));
Integrate into Kafka Streams Processors (processor API)

Challenge 2
IQ Querying Limitations

Challenge 2: IQ Querying Limitations
Access data in statestores
Bring-your-own-api
IQv2
IQ Metadata API
Keeps track of which data is located at which instance
21

22
GET people?firstname=john
A
API
B
API
n
API

23
A
API
B
API
n
API
GET
GET

24
A
API
B
API
n
API
[...]

25
A
API
B
API
n
API
[...]

Skewed Data
Time Query(B) < Time Query(C)
A cannot return until all results are in
Shard Failures
What if B is down (or rebalancing)
27
B
API
C
API

28
GET people?firstname=john&offset=50&size=25
A
API
B
API
n
API

29
A
API
B
API
n
API

30
A
API
B
API
n
API
[A.results, B.results, …].
sort().subset(50, 25)

Paging
Get parts of the result
Requires sorting for consistent results
Does not scale well:
# records = (partitions * (offset + size))
31

Challenge 3
Cloud Native Issues

33
“Kubernetes is great for many things.
Storage is not one of them.”

Challenge 3: Cloud Native Issues
Rebalances will get you
→ require statestores to be built again from scratch
→ Persistent Volumes help, but are a pain
34

Challenge 3: Cloud Native Issues
Rebalances will get you
→ require statestores to be built again from scratch
→ Persistent Volumes help, but are a pain
CICD flows trigger A LOT of rebalances
35

37
Are these challenges solvable?

Conclusion
Why?
Doing it right requires time and effort
Not our KOR (pun intended) business
Economical Trade-off
41

Conclusion
APIs can be hosted directly from statestores
→ Your Mileage May Vary
Alternatively
→ Use external storage (ElasticSearch, Mongodb, Arangodb, …)
→ Prefer Connectors over writing directly to external storage
→ Beware of the additional external dependency
→ Flink Statefun (stateful functions) ?
42

SHONO
daan@shono.io
Twitter: @daangerits
Github: @calmera

The Possibilities and Pitfalls of Writing Your Own State Stores with Daan Gertis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie The Possibilities and Pitfalls of Writing Your Own State Stores with Daan Gertis

Ähnlich wie The Possibilities and Pitfalls of Writing Your Own State Stores with Daan Gertis (20)

Mehr von HostedbyConfluent

Mehr von HostedbyConfluent (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

The Possibilities and Pitfalls of Writing Your Own State Stores with Daan Gertis