Real World Storage in Treasure Data

T R E A S U R E D A T A
REAL WORLD DISTRIBUTED
DATABASE ON CLOUD TECHNOLOGY
BDI Research Group, Aug 7th
Kai Sasaki
Senior Software Engineer at Arm Treasure Data

ABOUT ME
• Kai Sasaki (佐々⽊木海海)

• Senior Software Engineer at Arm Treasure Data

• Ex. Yahoo Japan Corporation

• Presto/Hadoop/TensorFlow contributor

• Hivemall committer

• Started GA Tech OMSCS  
(http://www.omscs.gatech.edu/)

TREASURE DATA
Data Analytics Platform
Unify all your raw data in scalable and
secure platform. Supporting 100+
integrations to enable you to easily connect
all your data sources in real-time.
Founded in 2011.
Live with OSS
• Fluentd
• Embulk
• Digdag
• Hivemall
and more
https://www.treasuredata.com/opensource/

TREASURE DATA
Our customers exists across industries.
- Wish
- LG
- Subaru
- KIRIN
etc…
Support vast amount of use cases.
- Automotive
- Retail
- Digital Marketing
- IoT

TREASURE DATA
Presto: 11.7+ million
Hive: 1.3+ million
Total Record: 1064+ trillion
Streaming: 60%
Data Connector: 30%
Bulk Import: 10%
Imported Records: 60+ billion
Integrations: 114

TREASURE DATA
Treasure CDP
Customer Data Platform (CDP) is a
marketer-based management system.
Client can unify customer database with
various kind of data sources and combine it
to find a specific customer profile.

AGENDA
• Cloud Storage of Treasure Data

• Distributed Query Processing Engine

• Presto/Hive

• Enterprise Level Storage System

CLOUD STORAGE IN TD
• Our Treasure Data storage service is built on cloud storage
like S3. (Plazma)

• Presto/Hive just provides a distributed query execution layer.
It requires us to make our storage system also scalable.
• On the other hand, we should make use of maintainability
and availability provided cloud service provider (IaaS).

PLAZMA
• We built a thin storage layer on existing cloud storage and
relational database, called Plazma.

• Plazma is a central component that stores all customer data for
analysis in Treasure Data.

• Plazma consists of two components

• Metadata (PostgreSQL)

• Storage (S3 or RiakCS)

PLAZMA
• Plazma stores metadata of data ﬁles in PostgreSQL hosted
by Amazon RDS.

PLAZMA
• Plazma stores metadata of data files in PostgreSQL hosted
by Amazon RDS.

• This PostgreSQL manages the index, file path on S3,
transaction and deleted files.
LOG
LOG

WHY POSTGRESQL?
• GiST index easily enables us to do complicated index search
on data_set_id and time ranges.
CREATE INDEX
plazma_index
ON partitions USING gist (
data_set_id,
index_range(
first_index_key,
last_index_key, ‘[]')
);

PARTITIONING
• To make the best of high throughput by Presto parallel
processing, it is necessary to distribute data source too.

• Distributing data source evenly can contribute the high
throughput and performance stability.

• Two basic partitioning method

• Key range partitioning -> Time-Index partitioning
• Hash partitioning -> User Deﬁned Partitioning

PARTITIONING
• A partition record in Plazma represents a ﬁle stored in S3 with some additional
information

• Data Set ID

• Range Index Key

• Record Count

• File Size

• Checksum

• File Path

PARTITIONING
• All partitions in Plazma are indexed by time when it is
generated. Time index is recorded as UNIX epoch.

• A partition keeps first_index_key and last_index_key
to speciﬁes the range where the partition includes.

• Plazma index is constructed as multicolumn index by using
GiST index of PostgreSQL.  
(https://www.postgresql.org/docs/current/static/gist.html)

• (data_set_id, index_range(first_index_key, last_index_key))

CLOUD STORAGE IN TD
data_set_id ﬁrst_index_key last_index_key path
1 1533276065 1533276071 s3://path
2 1533276071 1533276077 s3://path
4 1533276077 1533276085 s3://path
4 1533276085 1533276092 s3://path
5 1533276092 1533276098 s3://path
5 1533276098 1533276103 s3://path
5 1533276103 1533276108 s3://path
5 1533276108 1533276114 s3://path
…
PostgreSQL
Amazon S3

LIFECYCLE OF PARTITION
• Plazma has two storage management layer. 
At the beginning, records are put on realtime storage layer
in raw format. (msgpack.gz)
Realtime Storage Archive Storage
time: 100
time: 4000
time: 3800
time: 300
time: 500

• Every one hour, a speciﬁc map reduce job called Log Merge
Job runs to merge same time range records into one
partition in archive storage.
time: 100
time: 4000
time: 3800
time: 300
time: 500
time: 0~3599
time:
3600~7200
MR

• Query execution engine like Presto needs to fetch the data
from both realtime storage and archive storage. But basically
it should be eﬃcient to read the data from archive storage.

• Inspired by 
C-Store paper 
M. Stonebraker
time: 100
time: 4000
time: 3800
time: 300
time: 500
time: 0~3599
time:
3600~7200
MR

TRANSACTION AND PARTITIONING
• Consistency is the most important factor for enterprise
analytics workload. Therefore MPP engine like Presto and
backend storage MUST always guarantee the consistency.

→ UPDATE is done atomically by Plazma

• At the same time, we want to achieve high throughput by
distributing workload to multiple worker nodes.

→ Data ﬁles are partitioned in Plazma

PLAZMA TRANSACTION
• Plazma supports transaction for the query that has side-eﬀect
(e.g. INSERT INTO/CREATE TABLE).

• Transaction of Plazma means the atomic operation on the
appearance of the data on S3, not actual ﬁle.
• Transaction is composed of two phases

• Uploading uncommitted partitions

• Commit transaction by moving uncommitted partitions

PLAZMA TRANSACTION
• Multiple worker try to upload ﬁles to S3 
asynchronously.
Uncommitted Committed
PostgreSQL

PLAZMA TRANSACTION
• After uploading is done, insert a record in uncommitted  
table in PostgreSQL respectively.
PostgreSQL

PLAZMA TRANSACTION
• After uploading is done, insert a record in uncommitted  
table in PostgreSQL respectively.
PostgreSQL
p1
p2

PLAZMA TRANSACTION
• After all upload tasks are completed, coordinator tries  
to commit the transaction by moving  
all records in uncommitted to committed.
p1
p2
p3
PostgreSQL

PLAZMA DELETE
• Delete query is handled in similar way. First newly created 
partitions are uploaded excluding deleted  
records.
p1
p2
p3
p1’
p2’
p3’
PostgreSQL

PLAZMA DELETE
• When transaction is committed, the records in committed
table is replaced by uncommitted records  
with diﬀerent ﬁle path.
p1’
p2’
p3’
PostgreSQL

WHAT IS PRESTO?
• Presto is an open source scalable distributed SQL  
engine for huge OLAP workloads

• Mainly developed by Facebook, Teradata

• Used by FB, Uber, Netﬂix etc

• In-Memory processing

• Pluggable architecture 
Hive, Cassandra, Kafka etc

PRESTO CONNECTOR
• Presto connector is the plugin for providing the access way
to various kind of existing data storage from Presto.

• Connector is responsible for managing metadata/
transaction/data accessor.
http://prestodb.io/

PRESTO CONNECTOR
• Hive Connector 
Use metastore as metadata and S3/HDFS as storage.

• Kafka Connector 
Querying Kafka topic as table. Each message as interpreted as row
in a table.

• Redis Connector 
Key/value pair is interpreted as a row in Presto.

• Cassandra Connector 
Support Cassandra 2.1.5 or later.

PRESTO CONNECTOR
• Black Hole Connector 
Works like /dev/null or /dev/zero in Unix like system. Used for
catastrophic test or integration test.

• Memory Connector 
Metadata and data are stored in RAM on worker nodes.  
Still experimental connector mainly used for test.

• System Connector 
Provides information about the cluster state and running
query metrics. It is useful for runtime monitoring.

PRESTO CONNECTOR
• Plugin deﬁnes an interface  
to bootstrap your connector  
creation.

• Also provides the list of  
UDFs available your  
Presto cluster.

• ConnectorFactory is able to 
provide multiple connector implementations.
Plugin
ConnectorFactory
Connector
getConnectorFactories()
create(connectorId,…)

PRESTO CONNECTOR
• Connector provides classes to manage metadata, storage
accessor and table access control.

• ConnectorSplitManager create  
data source metadata to be  
distributed multiple worker  
node.

• ConnectorPage 
[Source|Sink]Provider 
is provided to split  
operator. Connector
Connector
Metadata
Connector
SplitManager
Connector
PageSource
Provider
Connector
PageSink
Provider
Connector
Access
Control

PRESTO CONNECTOR
• Call beginInsert from  
ConnectorMetadata

• ConnectorSplitManager creates 
splits that includes metadata of  
actual data source (e.g. file path)

• ConnectorPageSource 
Provider downloads the 
file from data source in  
parallel

• finishInsert in ConnectorMetadata 
commit the transaction
Connector
Metadata
beginInsert
getSplits
Connector
PageSource
Provider
Connector
PageSource
Provider
Connector
PageSource
Provider
Connector
Metadata
finishInsert
Operators…
Connector
SplitManager

PRESTO ON CLOUD STORAGE
• Distributed execution engine like Presto cannot make use of
data locality any more on cloud storage.

• Read/Write of data can be a dominant factor of query
performance, stability and money.

→ Connector should be implemented to take care of  
network IO cost.

TIME INDEX PARTITIONING
• By using multicolumn index on time range in Plazma, Presto
can ﬁlter out unnecessary partitions through predicate push
down

• TD_TIME_RANGE UDF tells Presto the hint which partitions
should be fetched from Plazmas.

• e.g. TD_TIME_RANGE(time, ‘2017-08-31 12:30:00’, NULL, ‘JST’)
• ConnectorSplitManager select the necessary partitions
and calculates the split distribution plan.

• Select metadata records from realtime storage and archive
storage according to given time range. 
 
SELECT * FROM rt/ar WHERE start < time AND time < end;
Connector 
SplitManger
time: 0~3599
time:
3600~7200
time: 8000
time: 8200
time: 9000
time: 8800

• A split is responsible to download multiple files on S3 in
order to reduce overhead.

• ConnectorSplitManager calculates file assignment to
each split based on given statistics information (e.g. file size,
the number of columns, record count)
f1
f2
f3
Connector 
SplitManger
Split1
Split2

SELECT 10 cols in a range
0 sec
23 sec
45 sec
68 sec
90 sec
113 sec
135 sec
158 sec
180 sec
60days 50days 40days 30days 20days 10days
TD_TIME_RANGE

SELECT 10 cols in a range
0 splits
8 splits
15 splits
23 splits
30 splits
38 splits
45 splits
53 splits
60 splits
6years~ 5years 4years 3years 2years 1 year 6month
split

CHALLENGE
• Time-Index partitioning worked very well because

• Most logs from web page, IoT devices have originally the
time when it is created.

• OLAP workload from analysts often limited by speciﬁc
time range (e.g. in the last week, during a campaign).

• But it lacks the ﬂexibility to make an index on the column
other than time. This is required especially in digital
marketing, DMP use cases.

USER DEFINED PARTITIONING
• Now evaluating user defined partitioning with Presto.

• User defined partitioning allows customer to set index on
arbitrary data attribute flexibly.

• User defined partitioning can co-exist with time-index
partitioning as secondary index.

SELECT
COUNT(1)
FROM audience  
WHERE  
TD_TIME_RANGE(time, ‘2017-09-04’, ‘2017-09-07’) 
AND 
audience.room = ‘E’

BUCKETING
• Similar mechanism with Hive bucketing

• Bucket is a logical group of partition ﬁles by speciﬁed
bucketing column.
table
bucket bucket bucket bucket
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
partition
time range 1
time range 2
time range 3
time range 4

BUCKETING
• PlazmaDB deﬁnes the hash function type on partitioning key
and total bucket count which is ﬁxed in advance.
Connector
SplitManager
SELECT COUNT(1) FROM audience  
WHERE  
AND 
table
bucket1 bucket2 bucket3
partition
partition
partition
partition
partition
partition
partition
partition
partition

BUCKETING
• ConnectorSplitManager select the proper partition from
PostgreSQL with given time range and bucket key.
Connector
SplitManager
SELECT COUNT(1) FROM audience  
WHERE  
AND 
table
bucket1 bucket2 bucket3
partition
partition
partition
partition
partition
partition
partition
partition
partition
hash(‘E’) -> bucket2
1504483200 < time
&& time < 1504742400

1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE c1 = ‘v1’ AND time = …
1h 1h 1h 1h1h
time
c1
v1
v2
v3
… WHERE c1 = ‘v1’ AND time = …
• User can specify the partitioning strategy based on their usage using partitioning key column  
max time range.

• We can skip to read several unnecessary partitions. This
architecture very ﬁt to digital marketing use cases.

• Creating user segment

• Aggregation by channel

• Still make use of time index partitioning.

PERFORMANCE COMPARISON
SQLs on TPC-H (scaled factor=1000)
elapsedtime(sec)
0 sec
75 sec
150 sec
225 sec
300 sec
count1_ﬁlter groupby hashjoin
87.279
36.569
1.04
266.71
69.374
19.478
NORMAL UDP

PERFORMANCE COMPARISON
SQLs on TPC-H (scaled factor=1000)
elapsedtime
0 sec
20 sec
40 sec
60 sec
80 sec
between mod_predicate count_distinct
NORMAL UDP

REPARTITIONING
• Many small partition ﬁles can have a memory pressure on
partition metadata in coordinator.
table
time range 1
time range 2
time range 3
time range 4

REPARTITIONING
• Merging scattered partitions can make query performance
better signiﬁcantly.
table
time range 1
time range 2
time range 3
time range 4

CREATE TABLE stella.partition.[remerge|split]
WITH ([max_file_size=‘256MB’,
max_record_count=1000000]*)
AS SELECT * FROM stella.partition.sources WHERE
account_id = 1 AND
table_schema = sample_datasets AND
table_name = www_access AND
TD_TIME_RANGE(time, start, end);

DATA DRIVEN REPARTITIONING
• There is a bunch of metric data of customer workload so that
we can make use of that to get an insight for repartitioning.

• We are now designing a new system to optimize the data
layout including: 
- Index 
- Cache 
- Partitions

• Continuous optimization of storage layout is our next goal  
to support enterprise use cases including IoT.

FUTURE WORKS
• Self-Driving Databases (https://blog.treasuredata.com/blog/2018/04/27/self-driving-databases-current-and-future/)
• Repartitioning leveraged by data analysis
• Reindexing based on customer workload
• Partition cache
• Precomputing subexpression
• “Selecting Subexpressions to Materialize at Datacenter Scale“
• https://www.microsoft.com/en-us/research/publication/thou-shall-not-recompute-selecting-subexpressions-materialize-datacenter-scale-2/

RECAP
• Presto provides a plugin mechanism called connector.
• Though Presto itself is highly scalable distributed engine, connector is
also responsible for efﬁcient query execution.
• Plazma has some desirable features to be integrated with such kind of
connector because of
• Transaction support
• Time-Index Partitioning
• User Deﬁned Partitioning

Real World Storage in Treasure Data

Real World Storage in Treasure Data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Real World Storage in Treasure Data

Ähnlich wie Real World Storage in Treasure Data (20)

Mehr von Kai Sasaki

Mehr von Kai Sasaki (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Real World Storage in Treasure Data