This document discusses streaming SQL and compares different streaming SQL implementations. It provides an overview of Apache Calcite's proposal for streaming SQL, which includes windowing, stream-to-relation joins, and stream-to-stream joins. The document also provides examples of streaming SQL statements in Storm SQL and the concepts proposed by Apache Calcite.
2. WHO AM I?
• Software Engineer @ Hortonworks
• remote worker
• Open source prosumer
• PMC member of Apache Storm
• Committer of Jedis
• Contributor of Apache (Spark,
Zeppelin,Ambari, Calcite), Redis,
and so on.
• Contact: kabhwan@gmail.com
9. STREAMING SQL
• Unbounded real time data
• can’t be fully covered in SQL standard and requires new ideas
• No standard yet
• Apache Calcite proposes its own Streaming SQL
• https://calcite.apache.org/docs/stream.html
• aggregation and stream-relation, stream-stream join is done within window
• most of things are not implemented yet
11. SIMPLE USE CASE
1. Get JSON from Kafka
2. Filter error logs (status >=
400)
3. Project columns with user
defined function and
calculations
4. Store rows back to Kafka
12. STORM SQL STATEMENTS
CREATE FUNCTION GET_TIME AS 'org.apache.storm.sql.runtime.functions.scalar.datetime.GetTime2'
CREATE EXTERNAL TABLE APACHE_LOGS (id INT PRIMARY KEY, remote_ipVARCHAR, request_urlVARCHAR,
request_methodVARCHAR, statusVARCHAR, request_header_user_agentVARCHAR, time_received_utc_isoformatVARCHAR,
time_us DOUBLE) LOCATION 'kafka://localhost:2181/brokers?topic=apachelogs' TBLPROPERTIES '{"producer":
{"bootstrap.servers":"localhost:
9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
CREATE EXTERNAL TABLE APACHE_ERROR_LOGS (id INT PRIMARY KEY, remote_ipVARCHAR, request_url
VARCHAR, request_methodVARCHAR, status INT, request_header_user_agentVARCHAR, time_received_utc_isoformat
VARCHAR, time_received_timestamp BIGINT, time_elapsed_ms INT) LOCATION 'kafka://localhost:2181/brokers?
topic=apacheerrorlogs' TBLPROPERTIES '{"producer":{"bootstrap.servers":"localhost:
9092","acks":"1","key.serializer":"org.apache.storm.kafka.IntSerializer","value.serializer":"org.apache.storm.kafka.ByteBufferSerializer"}}'
INSERT INTO APACHE_ERROR_LOGS SELECT ID, REMOTE_IP, REQUEST_URL, REQUEST_METHOD,
CAST(STATUS AS INT) AS STATUS_INT, REQUEST_HEADER_USER_AGENT,TIME_RECEIVED_UTC_ISOFORMAT,
GET_TIME(TIME_RECEIVED_UTC_ISOFORMAT, 'yyyy-MM-dd''T''HH:mm:ssZZ') AS
TIME_RECEIVED_TIMESTAMP, (TIME_US / 1000) ASTIME_ELAPSED_MS FROM APACHE_LOGS WHERE
(CAST(STATUS AS INT) / 100) >= 4