Change data capture with MongoDB and Kafka.

In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.

  1. 1. Change Data Capture with Mongo + Kafka By Dan Harvey
  2. 2. High level stack React.js - Website Node.js - API Routing Ruby on Rails + MongoDB - Core API Java - Opinion Streams, Search, Suggestions Redshift - SQL Analytics
  3. 3. Problems • Keep user experience consistent • Streams / search index need to update • Keep developers efficient • Loosely couple services • Trust denormalisations
  4. 4. Use case • User to User recommender • Suggest “interesting” users to a user • Update as soon as you make a new opinion • Instant feedback for contributing content
  5. 5. Log transformation Java$Services Avro Rails$API JSON/BSON Mongo Opinion Optaileroplog Kafka: User Topic User Recommender Change$data$capture Stream$processing User Kafka: Opinion Topic
  6. 6. Op(log)tailer • Converts BSON/JSON to Avro • Guarantees latest document in topic (eventually) • Does not guarantee all changes • Compacting Kafka topic (only keeps latest)
  7. 7. Avro Schemas • Each Kafka topic has a schema • Schemas evolve over time • Readers and Writers will have different schemas • Allows us to update services independently
  8. 8. Schema Changes • Schema to ID managed by Confluent registry • Readers and writers discover schemas • Avro deals with resolution to compiled schema • Must be forwards and backwards compatible Ka#a$message:$byte[] message:$byte[]schema$ID:$int
  9. 9. Search indexing • User / Topic / Opinion search • Re-use Kafka topics from before • Index from Kafka to Elasticsearch • Need to update quickly and reliably
  10. 10. Samza Indexers • Index from Kafka to Elasticsearch • Used Samza for transform and loading • Far less code than Java Kafka consumers • Stores offsets and state in Kafka
  11. 11. Elasticsearch Producer • Samza consumers/producers deal with I/O • Wrote new ElasticsearchSystemProducer • Contributed back to Samza project • Included in Samza 0.10.0 (released soon)
  12. 12. Samza Good/Bad • Good API • Simple transformations easy • Simple ops: logging, metrics all built in • Only depends on Kafka • Inbuilt state management • Joins tricky, need consistent partitioning • Complex flows are hard (Flink/Spark better)
  13. 13. Decoupling Good/Bad • Easy to try out complex new services • Easy to keep data stores in sync, low latency • Started to duplicate core logic • More overhead with more services • Need high level framework for denormalisations • Samza SQL being developed
  14. 14. Ruby Workers • Ruby Kafka consumers not great… • Optailer to AWS SQS (Shoryuken gem) • No order guarantee like Kafka topics • But guaranteed trigger off database writes • Better for core data transformations
  15. 15. Future • Segment.io user interaction logs to Kafka • Use in product, view counts, etc… • Fill Redshift for analytics (currently batch) • Kafka CopyCat instead of our Optailer • Avro transformation in Samza
  16. 16. Questions? • email: dan@state.com • twitter: @danharvey