2. Presenter
Maheedhar Gunturu, Solutions Architect
Maheedhar held senior roles both in engineering and sales
organizations. He has over a decade of experience designing &
developing server-side applications in the cloud and working on
big data and ETL frameworks in companies such as Samsung,
MapR, Apple, VoltDB, Zscaler and Qualcomm. He holds a
masters degree in Electrical and Computer engineering from
the University of Texas at San Antonio.
7. Intermediate layer for buffering.
â Provides flexibility for downstream Consumers.
â Buffer data while upgrades, migrations or troubleshooting.
â Downstream systems don't have to be provisioned for peak traffic
â Save hardware costs.
â Dynamically scalable layer to handle bursty loads
â Add more partitions/brokers to increase parallelism and throughput.
â Use kafka operator and it will dynamically scale the cluster based on ingress traffic.
â Provides resiliency and fault tolerance
â Each Topic has replicas available with multiple partitions spread across multiple
brokers.
â Set TTLs at the topic level to determine retention.
9. Publish CDC streams
â Publish record level changes to the corresponding Topics.
â Usually configurable to what level of detail you want in the change records.
â Upstream changes from watched rows emitted as a change record
â The format of these rows is in a configurable format (JSON, Avro etc)
â Downstream processing for reporting, caching, or full-text indexing.
â Subscribe with the corresponding consumer (i.e. Scylla, elasticsearch, spark)
â Changefeeds are emitted with at-least-once delivery guarantees.
â In most cases, each version of a row will be emitted once. However, some infrequent
conditions (e.g., node failures, network partitions) will cause them to be repeated.
10. Integrations
Scylla
Mongo
Example Consumers
Serializer
App 2
Serializer
App 3
!
Schema
Registry
Elastic
Serializer
App 1
!
Kafka Topic
â Define the expected fields for each Kafka topic
â Automatically handle schema changes (e.g. new
fields)
â Prevent backwards incompatible
changes
â Support multi-data center environments
Hbase
11. Streaming Data Transformations.
Streams
API
Producer
Topic TopicTopic
Consumer Consumer
Overview
âą Write standard Java applications
âą No separate processing cluster required
âą Exactly-once processing semantics
âą Elastic, highly scalable, fault-tolerant
âą Fully integrated with Kafka security
Example Use Cases
âą Event-Driven Microservices
âą Continuous queries
âą Continuous transformations
Kafka Cluster
12. â Applications produce and consume data at a different rate.
â Provides flexibility for the downstream applications to scale based on their SLAs
â Downstream applications can be independently scaled
â Dynamically move partitions to optimize resource utilization and reliability.
â Enable elastic scaling by easily adding and removing nodes from your Kafka cluster.
â Tuning topicâs configuration will help in efficient use of consumers
â Determine the ratio between number of partitions in a topic and number of
consumers.
â ADB traffic is throttled upon data transfers to ensure network bandwidth
Impedance mismatch between applications.
13. Event Sourcing
â Every change to the state of an application is captured in an event
object.
â Order of the events needs to be maintained.
â Ability to recreate state in your application and the supporting
database.
â cqrs provides the benefit of event sourcing analogous to a materialized view
â Need to keep track of lineage and the transformations that were run on the data.
â Newer versions of ML algorithms can operate on the raw Event data
to recreate the state in the database.
â Better model serving/benchmarking.
16. Kafka Connect Features
01
A standard framework for
Kafka connectors.
04
Distributed & scalable by
default.
04
Automatic offset mgmt.
02
Distributed and standalone
modes.
06
Streaming/batch integration.
03
REST interface for configuration.
Port: 8083
17. Kafka Connect API
CDC
Database
Mongo
Cassandra
Elastic
Scylla
HDFS
Kafka Connect API
Kafka Pipeline
Connect Worker
Connect worker
Connect worker
Connect Worker
Connect Worker
Connect Worker
Sources Sinks
Auto-recovery and
Fault tolerant
Manage hundreds of
data sources and
sinks
Preserves data
schema
Integrated within
Confluent Control
Center
Simple Parallelism
18. Configuring Kafka Connect (sink)
#sample casssandra-sink.properties file
name=sink
topics=temperature
tasks.max=1
connector.class=io.confluent.connect.cassandra.CassandraSinkConn
ector
cassandra.contact.points=<PUBLIC IPs of your SCYLLA Cluster
(IP1,IP2,IP3)>
cassandra.keyspace=demo
cassandra.compression=SNAPPY
cassandra.consistency.level=LOCAL_QUORUM
transforms=prune
transforms.prune.type=org.apache.kafka.connect.transforms.Replac
eField$Value
transforms.prune.whitelist=CreatedAt,Id,Text,Source,Truncated
1. Update the sink.properties
2. Update the connect-
distributed.properties file
3. start the Connect framework using the
Cassandra connector in distributed
mode.
ref: https://www.scylladb.com/2018/12/19/scylla-and-confluent-for-iot/
19. Kafka Connect Security
Encryption
â Kafka Connect also works with SSL-encrypted connections to these
brokers.
Authentication
â Kafka Connect works with SASL â e.g. Kerberos, Active Directory
Authorization
â Restrict who can create, write to, read from topics, and more
â REST API for Kafka Connect nodes are not secure.
â Require an external proxy (eg Apache HTTP) to act as a secure gateway to the REST
services, when configuring a secure cluster.
20. Confluent Hub
â Discover and share
Connectors
â Cassandra (OSS) and
Dynamodb Source/Sink
connectors available.
â Scylla Shard aware
connector to be
published soon!
https://www.confluent.io/hub/confluentinc/kafka-connect-cassandra
22. TakeAways
â Message queues are useful for a variety of reasons.
â Scylla Kafka Connecter ( Sink and CDC source) will be coming out
soon!
â Event Streaming and Event-driven microservices are useful - try it
out!
23. Thank you Stay in touch
Any questions?
Maheedhar Gunturu
maheedhar@scylladb.com
@vanguard_space
24. some useful links
Here are some useful links for further reading/watching.
1. Useful video explaining most things for a low level of understanding â https://www.confluent.io/kafka-summit-sf18/so-
you-want-to-write-a-connector
2. Confluentâs Developer guide to connectors which covers most basics â
https://docs.confluent.io/current/connect/devguide.html
3. The source for above developer guide is available through maven here â
https://mvnrepository.com/artifact/org.apache.kafka/connect-file/2.1.1
4. Useful guide providing additional best practices ( now deprecated though still useful) â
https://docs.google.com/document/d/1jEn_G-KDsrhdecPTGIWIcke1I4gw4fR0G8OVj8e3iAI/edit#
5. Verification guide though a little generic as it is for both Connectors and Consumer/producers â
https://www.confluent.io/wp-content/uploads/Verification-Guide-Confluent-Platform-Connectors-Integrations.pdf
6. https://opencredo.com/blogs/kafka-connect-source-connectors-a-detailed-guide-to-connecting-to-what-you-love/
7. https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
8. https://www.confluent.io/blog/the-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
9. https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Hinweis der Redaktion
Kafka is, it is highly available and resilient to node failures and supports automatic recovery. This feature makes Apache Kafka ideal for communication and integration between components of large-scale data systems in real-world data systems.
A typical ratio of the number of partitions in a topic to the number of consumers in a group would be (1:1) or (1:2)
https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-clusterhttps://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://www.confluent.io/blog/apache-kafka-supports-200k-partitions-per-cluster
Topic partition is the unit of parallelism in Kafka. On both the producer and the broker side, writes to different partitions can be done fully in parallel. Kafka can replicate partitions across a configurable number of Kafka servers. Each partition has a leader server and zero or more follower servers. Leaders handle all read and write requests for a partition.
Kafka uses also uses partitions for parallel consumer handling within a group. Each Broker handles its share of data and requests by sharing partition leadership. The partitions in each topic that all of the consumers are subscribed to are assigned dynamically to the consumers in round-robin fashion.
Phrase coined by Martin Fowler. - CQRS and Event Sourcing
Command Query Responsibility Segregation
https://martinfowler.com/eaaDev/EventSourcing.html
Kafka Connect is an open source framework, built as another layer on core Apache Kafka, to support large scale streaming data.
SoC (Separation of Concerns)
Includes transformations (SMT)
Has ability to communicate with schema registry
Currently the API is primarily Java and Scala only.