5. Data Format Standard is important!
Enter Section/Running Header Here
syntax="proto2";
message envelope {
# required fields
required string data_type = 1;
required string create_at_us = 1;
required string source_name = 1;
# optional
string schema = 1;
# payload
bytes payload = 1;
}
6. • To query Kafka message in real-time
• To quickly find the location of a message
• To trace for a historic event for debugging/diagnose
• To monitor data quality in the pipeline
• To monitor and project data volume in the pipeline for capacity planning
• To detect abnormal data patterns
Why?
Enter Section/Running Header Here
7. • Quick overview of Kafka Connect
• How does the data transformation work in Kafka Connect
• What is SMT
• Some use cases for SMT
• SMT vs Kafka Streams for data transformation
• Tips for using Kafka Connect to sink data to Elasticsearch
Takeaways
Enter Section/Running Header Here
9. • Lightweight and stateless
• Scalable and fault-tolerant
• Integrates with Kafka and many other data systems
• Pluggable architecture make customization easy and configurable
• Lots open source (connectors and converter plugins) available
• Run in two modes:
• standalone mode is great for dev and local testing
• distributed mode is great for scaling and fault-tolerance
• REST API available to monitor and configure your connectors in the distributed
mode
More reasons to use Kafka Connect
Enter Section/Running Header Here
11. ● Default AVRO or JSON or write your own
● Configurable
○ Different data converter for key and value
○ Specify how null, invalid or malformed message should be handled
● Kafka Connect isolates each plugin from one another so that libraries in one plugin
are not affected by the libraries in any other plugins
○ `plugin.path` is configured in the Kafka Connect worker configuration
○ Build your JAR with dependency and copy it to `plugin.path`.
Plugin: Data Converter
Enter Section/Running Header Here
# directory other than the home directory of Confluent Platform.
plugin.path=share/java
12. # Data converter plugin
value.converter.protoClassName=net.demonware.pipes.connect.data.proto.MessageEnvelopeOuterClass$Mess
ageEnvelope
Plugin: Data Converter
Enter Section/Running Header Here
14. • Modifies messages going out of Kafka before it reaches Elasticsearch
• One message at a time
• Many built-in SMT are already available
• Flexible within the constraints of the TransformableRecord API and 1:{0,1}
mapping
• Transformation is chained
• Pluggable transformers through Connect configuration
What is SMT?
Enter Section/Running Header Here
15. Default Kafka Connect SMT
Enter Section/Running Header Here
Field Name Included in Kibana
InsertField Insert field using attributes from the record metadata or a configured
static value.
MaskField Mask specified fields with a valid null value for the field type.
ReplaceField Filter or rename fields.
TimestampConverter Convert timestamps between different formats such as Unix epoch,
strings, and Connect Date and Timestamp types.
TimestampRouter Update the record’s topic field as a function of the original topic value
and the record timestamp.
RegexRouter Update the record topic using the configured regular expression and
replacement string.
16. Field Name Included in Kibana
Cast Cast fields or the entire key or value to a specific type, e.g. to force an
integer field to a smaller width.
ExtractField Extract the specified field from a Struct when schema present, or a Map
in the case of schemaless data. Any null values are passed through
unmodified
ExtractTopic Replace the record topic with a new topic derived from its key or value.
Flatten Flatten a nested data structure. This generates names for each field by
concatenating the field names at each level with a configurable delimiter
character.
HoistField Wrap data using the specified field name in a Struct when schema
present, or a Map in the case of schemaless data.
ValueToKey Replace the record key with a new key formed from a subset of fields in
the record value.
17. • An alias in transforms implies that some additional keys are configurable.
• Syntax:
• transforms.$alias.type – fully qualified class name for the transformation
• transforms.$alias.* – all other keys as defined in Transformation.config() are
embedded with this prefix
• Example:
Configuring SMT
Enter Section/Running Header Here
transforms.insertKafkaMetadata.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.insertKafkaMetadata.topic.field=kafka_topic
transforms.removeFields.type=org.apache.kafka.connect.transforms.ReplaceField$Value
transforms.removeFields.blacklist=context,tracing,payload
transforms.convertTimestampUnit.type=net.demonware.pipes.kafka.connect.transforms.ConvertTimeToMillis$Value
transforms.convertTimestampUnit.timestamp.fields=created_at_us,ingested_at_us
18. • SMT is chained
• SMT are applied in the order they are specified in `transforms`.
• If your transformation is order dependent then need to make sure they are specified in the correct order
• Example:
Ordering of SMT matters!
Enter Section/Running Header Here
transforms=insertKafkaMetadata,indexMapping
transforms.indexMapping.type:org.apache.kafka.connect.transforms.TimestampRouter
transforms.indexMapping.topic.format:topic-changed-${timestamp}
transforms.indexMapping.timestamp.format:yyyy.MM.dd
transforms.insertKafkaMetadata.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.insertKafkaMetadata.topic.field=kafka_topic
19. • Only if you cannot use the built-in and cannot use Kafka Streams for the data
transformation.
• Must implement the Transformation interface.
• Consider to make your SMT configurable.
• If you have multiple custom SMT, better to have separate Transformation
implementation.
Create Custom SMT
Enter Section/Running Header Here
20. // Existing base class for SourceRecord and SinkRecord, new self type parameter.
public abstract class ConnectRecord<R extends ConnectRecord<R>> {
// ...
// New abstract method:
/** Generate a new record of the same type as itself, with the specified parameter values. **/
public abstract R newRecord(String topic, Schema keySchema, Object key, Schema valueSchema, Object value, Long timestamp);
}
public interface Transformation<R extends ConnectRecord<R>> extends Configurable, Closeable {
// via Configurable base interface:
// void configure(Map<String, ?> configs);
/**
* Apply transformation to the {@code record} and return another record object (which may be {@code record} itself) or {@code null},
* corresponding to a map or filter operation respectively. The implementation must be thread-safe.
*/
R apply(R record);
/** Configuration specification for this transformation. **/
ConfigDef config();
/** Signal that this transformation instance will no longer will be used. **/
@Override
void close();
}
Interface: Transformation
25. ● Recommended practice in general
● Transformation involves multiple
messages, such as aggregation.
● More complex transformation:
aggregation, windowing, joining
● When the transformed data will be
consumed by multiple downstream
consumers. Reduce overhead by
running transformation only once and
allow reuse.
● Lightweight and simple data
transformation.
● Covered by the Kafka Connect
built-in SMT
● Data footprint cost is a concern.
Large amount of transformed data
written back to Kafka is too costly.
● Simplicity in streaming data pipeline
is important. Want to keep data
pipeline stage and services to a
minimum
● Transformation does not interact with
external systems
SMT
Enter Section/Running Header Here
Kafka Streams
27. • Overwrite the ES @timestamp internal field
• Overwrite the document ‘_id’ field to have control over how should your data be
de-duplicated
• Remove unnecessary columns/fields to save space and footprint of your ES cluster.
• Manage your ES Index by day. You can use ‘TimestampRouter’ and ‘RegexRouter’ SMT to
generate ES indice per day for your data.
• Have binary data available for search in a user-friendly format then you need to transform
the binary data prior to indexing.
“123e4567-e89b-12d3-a456-426655440000”
Do’s
Enter Section/Running Header Here
28. ● Some cosmetic data format tweaking can be done in Kibana
○ Date display format
○ Base64 Decode binary data for display
○ Type casting from integer to text
● If you need to modify Kafka Connect source code for any reason then you might want to
reconsider using Kafka Connect
○ it can be hard to debug and test. Maybe you should consider Kafka Streams instead
● When implement your own transformation, keep each transformation implementation
separate rather than have a single transformation class that does a bunch of things.
Don’ts
Enter Section/Running Header Here