This document summarizes Robert Hodges' presentation on integrating ClickHouse with remote data sources. It discusses how ClickHouse can be used as a polyglot database to access data from MySQL, Kafka, S3, Snowflake and other sources using database engines, table engines, table functions and dictionaries. Specific examples are provided on accessing MySQL data, consuming messages from Kafka topics, reading and writing to S3 files, and experimental connections to Snowflake via ODBC. The presentation emphasizes that ClickHouse's polyglot capabilities are improving continuously and encourages testing new integrations.
2. Introduction to Presenter
www.altinity.com
Leading software and services
provider for ClickHouse
Major committer and community
sponsor in US and Western Europe
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security.
ClickHouse is DBMS #20
3. Introduction to ClickHouse
SQL optimized for analytics
Runs on bare metal to cloud
Stores data in columns
Parallel and vectorized execution
Scales to many petabytes
Is Open source (Apache 2.0)
Is WAY fast on analytic queries
a b c d
a b c d
a b c d
a b c d
4. What do we mean by a polyglot database?
100% Non-
Polyglot
100%
Polyglot
Object
Storage
RDBMS
NoSQL
Event
Queue
App
App
App
App
More direct accessMore “translators”
9. Access a MySQL database from ClickHouse
CREATE DATABASE mysql_repl
ENGINE=MySQL(
'127.0.0.1:3306',
'repl',
'root',
'secret')
use mysql_repl
show tables
Database engine
10. Selecting MySQL data from ClickHouse
SELECT
t.datetime, t.date, t.request_id,
t.name customer, s.name sku
FROM (
SELECT t.* FROM traffic t
JOIN customer c ON t.customer_id = c.id) AS t
JOIN sku s ON t.sku_id = s.id
WHERE customer_id = 5
ORDER BY t.request_id LIMIT 10
Predicate pushed
down to MySQL
12. Standard flow from Kafka to ClickHouse
Topic
Contains
messages
Kafka Table Engine
Encapsulates topic
within ClickHouse
Materialized View
Fetches Rows
Target Table
Stores Rows
13. Create target table
CREATE TABLE readings (
readings_id Int32 Codec(DoubleDelta, LZ4),
time DateTime Codec(DoubleDelta, LZ4),
date ALIAS toDate(time),
temperature Decimal(5,2) Codec(T64, LZ4)
) Engine = MergeTree
PARTITION BY toYYYYMM(time)
14. Create Kafka Engine table
CREATE TABLE readings_queue (
readings_id Int32,
time DateTime,
temperature Decimal(5,2)
) ENGINE = Kafka SETTINGS
kafka_broker_list = 'kafka-headless.kafka:9092',
kafka_topic_list = 'readings',
kafka_group_name = 'readings_consumer_group1',
kafka_num_consumers = '1',
kafka_format = 'CSV' Format
Connection
info
15. Create materialized view to transfer data
CREATE MATERIALIZED VIEW readings_queue_mv
TO readings
AS
SELECT readings_id, time, temperature
FROM readings_queue;
17. Select from S3 CSV to ClickHouse table
SET max_insert_threads=32
INSERT INTO sdata
SELECT * FROM s3(
'https://s3.us-east-1.amazonaws.com/d1-altinity/data/sdata*.csv',
'aws_access_key_id',
'aws_secret_access_key',
'CSVWithNames',
'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')
Parallelize!
Use host/bucket to
enable wildcards
Format & Schema
18. Write from ClickHouse to S3 Parquet file
INSERT INTO TABLE FUNCTION
s3(
'https://d1-altinity.s3.amazonaws.com/data/sdata.parquet',
'aws_access_key_id',
'aws_secret_access_key',
'Parquet',
'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')
SELECT DevId, Type, MDate, MDatetime, Value FROM sdata
Where to write
What to write
Single host does not
allow wildcard
19. Select directly from S3 Parquet
SELECT * FROM s3(
'https://d1-altinity.s3.amazonaws.com/data/sdata.parquet',
'aws_access_key_id', 'aws_secret_access_key','Parquet',
'DevId Int32, Type String, MDate Date, MDatetime
DateTime, Value Float64')
┌─DevId─┬─Type─┬──────MDate─┬───────────MDatetime─┬─Value─┐
│ 0 │ test │ 2020-08-30 │ 2020-08-30 01:00:00 │ 0 │
│ 0 │ test │ 2020-08-30 │ 2020-08-30 01:00:15 │ 150 │
│ 0 │ test │ 2020-08-30 │ 2020-08-30 01:00:30 │ 300 │
│ 0 │ test │ 2020-08-30 │ 2020-08-30 01:00:45 │ 450 │
. . .
21. Moving data from Snowflake to ClickHouse
CREATE TABLE nation (
N_NATIONKEY UInt64,
N_NAME String,
N_REGIONKEY UInt64,
N_COMMENT String )
ENGINE=Log
INSERT INTO nation
SELECT *
FROM odbc('DSN=snowflake', 'TPCH_SF001', 'NATION')
Schema Table
Names are case-sensitive!Use Snowflake
history in console
22. Snowflake data, in a ClickHouse near you
SELECT
N_NATIONKEY, N_NAME,
N_REGIONKEY, substring(N_COMMENT, 1, 25)
FROM nation LIMIT 5
┌─N_NATIONKEY─┬─N_NAME────┬─N_REGIONKEY─┬─substring(N_COMMENT, 1, 25)─┐
│ 0 │ ALGERIA │ 0 │ haggle. carefully final │
│ 1 │ ARGENTINA │ 1 │ al foxes promise slyly ac │
│ 2 │ BRAZIL │ 1 │ y alongside of the pendin │
│ 3 │ CANADA │ 1 │ eas hang ironic, silent p │
│ 4 │ EGYPT │ 4 │ y above the carefully unu │
└─────────────┴───────────┴─────────────┴─────────────────────────────┘
24. Key Takeaways
● ClickHouse can access a wide range of data stores
● Stability of connectivity varies
○ MySQL is very stable
○ S3 is new, use 20.6
○ Data stores like Snowflake are experimental
● Try them out, post issues, and make them better
● ClickHouse polyglot capabilities improve constantly
○ MaterializeMySQL engine reads MySQL binlog (20.8)