This document provides an agenda for a deep dive on KSQL, the streaming SQL engine for Apache Kafka. It begins with an overview of the Apache Kafka ecosystem and how Kafka Streams serves as the foundation for KSQL. It then discusses the motivations for using KSQL and covers KSQL concepts like streams, tables, and windowing. The agenda also includes two live demos - an introduction to KSQL and a clickstream analysis example. It will discuss building user-defined functions with KSQL and machine learning. Finally, it covers getting started with KSQL.
3. 3Confidential
1.0 Enterprise
Ready J
A Brief History of Apache Kafka and Confluent
0.11 Exactly-once
semantics
0.10 Data processing
(Streams API)
0.9 Data integration
(Connect API)
Intra-cluster
replication
0.8
2012 2014
Cluster mirroring0.7
2015 2016 20172013 2018
CP 4.1
KSQL GA
4. 4Confidential
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
5. 5KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
6. 6KSQL- Streaming SQL for Apache Kafka
Apache Kafka - A Distributed, Scalable Commit Log
8. 8KSQL- Streaming SQL for Apache Kafka
Apache Kafka at Large Scale à No need to do a POC
https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63921
https://qconlondon.com/london2018/presentation/cloud-native-and-scalable-kafka-architecture
(2018) (2018)
9. 9KSQL- Streaming SQL for Apache Kafka
The Log ConnectorsConnectors
Producer Consumer
Streaming Engine
Apache Kafka – The Rise of a Streaming Platform
10. 10KSQL- Streaming SQL for Apache Kafka
Apache Kafka – The Rise of a Streaming Platform
11. 11KSQL- Streaming SQL for Apache Kafka
KSQL – The Streaming SQL Engine for Apache Kafka
12. 12KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
28. 28KSQL- Streaming SQL for Apache Kafka
Kafka
Streams
A complete streaming microservices, ready for production at large-scale
App configuration
Define processing
(here: WordCount)
Start processing
29. 29KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
30. 30KSQL- Streaming SQL for Apache Kafka
Why KSQL?
Population
CodingSophistication
Realm of Stream Processing
New, Expanded Realm
BI
Analysts
Core
Developers
Data
Engineers
Core Developers
who don’t like
Java
Kafka
Streams
KSQL
32. 32KSQL- Streaming SQL for Apache Kafka
What is it for ?
Streaming ETL
• Kafka is popular for data pipelines
• KSQL enables easy transformations of data within the pipe
CREATE STREAM vip_actions AS
SELECT userid, page, action FROM clickstream c
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
33. 33KSQL- Streaming SQL for Apache Kafka
What is it for ?
Analytics, e.g. Anomaly Detection
• Identifying patterns or anomalies in real-time data, surfaced in milliseconds
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 MINUTES)
GROUP BY card_number
HAVING count(*) > 3;
34. 34KSQL- Streaming SQL for Apache Kafka
What is it for ?
Real Time Monitoring
• Log data monitoring, tracking and alerting
• Sensor / IoT data
CREATE TABLE error_counts AS
SELECT error_code, count(*)
FROM monitoring_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE type = 'ERROR'
GROUP BY error_code;
35. 35KSQL- Streaming SQL for Apache Kafka
What is it for ?
Simple Derivations of Existing Topics
• One-liner to re-partition and / or re-key a topic for new uses
CREATE STREAM views_by_userid
WITH (PARTITIONS=6,
VALUE_FORMAT=‘JSON’,
TIMESTAMP=‘view_time’) AS
SELECT *
FROM clickstream
PARTITION BY user_id;
36. 36KSQL- Streaming SQL for Apache Kafka
What is it for ?
Easily Convert between Data Formats in Real Time
• JSON-to-Avro conversion
CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature
INTEGER, ...) WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON');
CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT *
FROM sensor_events_json;
37. 37KSQL- Streaming SQL for Apache Kafka
Where is KSQL not such a great fit (at least not yet)?
Powerful ad-hoc query
○ Limited span of time usually
retained in Kafka
○ No indexes
BI reports (Tableau etc.)
○ No indexes
○ No JDBC (most Bi tools are not
good with continuous results!)
Keep in mind:
Kafka is a streaming platform!
38. 38KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
39. 39KSQL- Streaming SQL for Apache Kafka
KSQL – A Streaming SQL Engine for Apache Kafka
40. 40KSQL- Streaming SQL for Apache Kafka
KSQL is Equally viable for S / M / L / XL / XXL use cases
Ok. Ok. Ok.
… and KSQL is ready for production, including 24/7 support!
41. 41KSQL- Streaming SQL for Apache Kafka
KSQL is Equally viable for S / M / L / XL / XXL use cases
43. 43KSQL- Streaming SQL for Apache Kafka
Where to develop and operate your applications?
44. 44KSQL- Streaming SQL for Apache Kafka
KSQL Concepts
● No need for source code
• Zero, none at all, not even one line.
• No SerDes, no generics, no lambdas, ...
● All the Kafka Streams “magic” out-of-the-box
• Exactly Once Semantics
• Windowing
• Event-time aggregation
• Late-arriving data
• Distributed, fault-tolerant, scalable, ...
46. 46KSQL- Streaming SQL for Apache Kafka
SELECT statement syntax
SELECT `select_expr` [, ...]
FROM `from_item` [, ...]
[ WINDOW `window_expression` ]
[ WHERE `condition` ]
[ GROUP BY `grouping expression` ]
[ HAVING `having_expression` ]
[ LIMIT n ]
where from_item is one of the following:
stream_or_table_name [ [ AS ] alias]
from_item LEFT JOIN from_item ON join_condition
47. 47KSQL- Streaming SQL for Apache Kafka
CREATE STREAM AS syntax
CREATE STREAM `stream_name`
[WITH (`property = expression` [, …] ) ]
AS SELECT `select_expr` [, ...]
FROM `from_item` [, ...]
[ WHERE `condition` ]
[ PARTITION BY `column_name` ]
● where property can be any of the following:
KAFKA_TOPIC = name - what to call the sink topic
FORMAT = DELIMITED | JSON | AVRO - defaults to format of input stream
AVROSCHEMAFILE = path/to/file - if FORMAT=AVRO, where the output schema file will be written to
PARTITIONS = # - number of partitions in sink topic
TIMESTAMP = column - The name of the column to use as the timestamp. This can be used to define the
event time.
48. 48KSQL- Streaming SQL for Apache Kafka
CREATE TABLE AS syntax
CREATE TABLE `stream_name`
[WITH ( `property_name = expression` [, ...] )]
AS SELECT `select_expr` [, ...]
FROM `from_item` [, ...]
[ WINDOW `window_expression` ]
[ WHERE `condition` ]
[ GROUP BY `grouping expression` ]
[ HAVING `having_expression` ]
● where property values are same as for ‚Create Streams as Select‘
49. 49KSQL- Streaming SQL for Apache Kafka
Automatic Inference of Topic Schema (leveraging Confluent Schema Registry)
https://www.matthowlett.com/2017-12-23-exploring-wikipedia-ksql.html
50. 50KSQL- Streaming SQL for Apache Kafka
WINDOWing
● Not ANSI SQL ! à Continuous Queries
● Three types supported (same as Kafka Streams):
• TUMBLING (repeats at a non-overlapping interval)
• SELECT appname, ip, COUNT(appname) AS problem_count FROM
logstream WINDOW TUMBLING (size 1 minute) WHERE loglevel='ERROR'
GROUP BY appname, ip;
• HOPPING (similar to tumbling, but hopping generally has an overlapping interval)
• SELECT itemid, SUM(arraycol[0]) FROM orders WINDOW HOPPING (
size 20 second, advance by 5 second) GROUP BY itemid;
• SESSION (groups elements by sessions of activity, do not overlap and do not have a fixed start and end time,
closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred)
• SELECT itemid, SUM(sales_price) FROM orders WINDOW SESSION (20
second) GROUP BY itemid;
51. 51KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
52. 52KSQL- Streaming SQL for Apache Kafka
KSQL CLI
ksql> CREATE STREAM pageviews_original (viewtime bigint, userid varchar, pageid varchar) WITH (kafka_topic='pageviews',
value_format='DELIMITED');
ksql> CREATE TABLE users_original (registertime bigint, gender varchar, regionid varchar, userid varchar) WITH (kafka_topic='users',
value_format='JSON');
ksql> SELECT pageid FROM pageviews_original LIMIT 10;
ksql> CREATE STREAM pageviews_female AS SELECT users_original.userid AS userid, pageid, regionid, gender FROM
pageviews_original LEFT JOIN users_original ON pageviews_original.userid = users_original.userid WHERE gender = 'FEMALE';
ksql/bin/ksql-server-start
ksql/bin/ksql
56. 56KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
57. 57KSQL- Streaming SQL for Apache Kafka
KSQL - Components
KSQL has 3 main components:
1. The Engine which actually runs the Kafka Streams topologies
2. The REST server interface enables an Engine to receive instructions from the CLI
or any other client
3. The CLI, designed to be familiar to users of MySQL, Postgres etc.
(Note that you also need a Kafka Cluster… KSQL is deployed independently)
58. 58KSQL- Streaming SQL for Apache Kafka
How to run KSQL è #1 Client – Server (Interactive Mode)
JVM
KSQL Server
KSQL CLI / any REST Client
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster
59. 59KSQL- Streaming SQL for Apache Kafka
How to run KSQL è #1 Client – Server (Interactive Mode)
• Start any number of server nodes
bin/ksql-server-start
• Start one or more CLIs or REST Clients and point them to a server
bin/ksql https://myksqlserver:8090
• All servers share the processing load
Technically, instances of the same Kafka Streams Applications
scale up / down without restart
60. 60KSQL- Streaming SQL for Apache Kafka
How to run KSQL è #2 as Standalone Application (Headless Mode)
JVM
KSQL Server
JVM
KSQL Server
JVM
KSQL Server
Kafka Cluster
61. 61KSQL- Streaming SQL for Apache Kafka
How to run KSQL è #2 as Standalone Application (Headless Mode)
• Start any number of server nodes
Pass a file of KSQL statement to execute
bin/ksql-node query-file=foo/bar.sql
• Ideal for streaming ETL application deployment
Version-control your queries and transformations as code
• All running engines share the processing load
Technically, instances of the same Kafka Streams Applications
scale up / down without restart
62. 62KSQL- Streaming SQL for Apache Kafka
How to run KSQL è #3 Embedded in an Application (JVM Mode)
JVM App Instance
KSQL Engine
Application Code
JVM App Instance
KSQL Engine
Application Code
JVM App Instance
KSQL Engine
Application Code
Kafka Cluster
63. 63KSQL- Streaming SQL for Apache Kafka
How to run KSQL è #3 Embedded in an Application (JVM Mode)
• Embed directly in your Java application
• Generate and execute KSQL queries through the Java API
Version-control your queries and transformations as code
• All running application instances share the processing load
Technically, instances of the same Kafka Streams Applications
scale up / down without restart
64. 64KSQL- Streaming SQL for Apache Kafka
Dedicating resources
Join Engines to the same
‘service pool’ by means of the
ksql.service.id property
65. 65KSQL- Streaming SQL for Apache Kafka
Sizing of KSQL Servers
3 Categories of KSQL Queries
• Project / Filter, e.g. SELECT <columns> FROM <table/stream> WHERE <condition>
• Joins, e.g. Stream-Table Joins
• Aggregations, e.g. SUM, COUNT, TOPK, TOPKDISTINCT
Sizing Questions
• How much Memory, CPU, Disk, Network?
• When and how to scale up / down?
• How to tune performance?
• Etc…
Guide for Capacity Planning
https://docs.confluent.io/current/ksql/docs/
capacity-planning.html
https://docs.confluent.io/current/streams/
sizing.html#streams-sizing
66. 66KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
67. 67KSQL- Streaming SQL for Apache Kafka
Demo: Clickstream Analysis
Kafka
Producer
Elastic
search
Grafana
Kafka
Cluster
Kafka
Connect
KSQL
Stream of
Log Events
Kafka Ecosystem
Other Components
68. 68KSQL- Streaming SQL for Apache Kafka
Demo - Clickstream Analysis
• https://docs.confluent.io/current/ksql/docs/tutorials/clickstream-docker.html#ksql-clickstream-
docker
• Leverages Apache Kafka, Kafka Connect, KSQL, Elasticsearch and Grafana
• 5min screencast: https://www.youtube.com/watch?v=A45uRzJiv7I
• Setup in 5 minutes (with or without Docker)
SELECT STREAM
CEIL(timestamp TO HOUR) AS timeWindow, productId,
COUNT(*) AS hourlyOrders, SUM(units) AS units
FROM Orders GROUP BY CEIL(timestamp TO HOUR),
productId;
timeWindow | productId | hourlyOrders | units
------------+-----------+--------------+-------
08:00:00 | 10 | 2 | 5
08:00:00 | 20 | 1 | 8
09:00:00 | 10 | 4 | 22
09:00:00 | 40 | 1 | 45
... | ... | ... | ...
70. 70KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
71. 71KSQL- Streaming SQL for Apache Kafka
Advanced Use Case - Machine Learning UDF for Real Time Sensor Analytics
Kafka
Producer
Elastic
search
Grafana
Kafka
Cluster
Kafka
Connect
KSQL
Heath
Check
Sensor
Kafka Ecosystem
Other Components
Emergency
System
All Data
Apply
Analytic
Model
Filter
Predictions
72. 72KSQL- Streaming SQL for Apache Kafka
KSQL and Deep Learning (Autoencoder) for IoT Sensor Analytics
https://www.confluent.io/blog/write-user-defined-function-udf-ksql/
https://github.com/kaiwaehner/ksql-machine-learning-udf
“SELECT event_id, anomaly(SENSORINPUT) FROM health_sensor;“
KSQL UDF using an analytic model under the hood
à Write once, use in any KSQL statement
74. 74KSQL- Streaming SQL for Apache Kafka
Use Case:
Anomaly Detection
(Sensor Healthcheck)
Machine Learning Algorithm:
Autoencoder built with H2O
Streaming Platform:
Apache Kafka and KSQL
Live Demo – Prebuilt Model Embedded in KSQL Function
75. 75KSQL- Streaming SQL for Apache Kafka
Agenda – KSQL Deep Dive
1) Apache Kafka Ecosystem
2) Kafka Streams as Foundation for KSQL
3) Motivation for KSQL
4) KSQL Concepts
5) Live Demo #1 – Intro to KSQL
6) KSQL Architecture
7) Live Demo #2 - Clickstream Analysis
8) Building a User Defined Function (Example: Machine Learning)
9) Getting Started
76. 76KSQL- Streaming SQL for Apache Kafka
KSQL Quick Start
github.com/confluentinc/ksql
Local runtime
or
Docker container
77. 77KSQL- Streaming SQL for Apache Kafka
Resources and Next Steps
Get Involved
• Try the Quickstart on GitHub
• Check out the code
• Play with the examples
KSQL is GA… You can already use it for production deployments!
https://github.com/confluentinc/ksql
http://confluent.io/ksql
https://slackpass.io/confluentcommunity #ksql