Web analytics at scale with Druid at naver.com

Web analytics at scale
with Druid at naver.com
Jason Heo (analytic.js.heo@navercorp.com)
Doo Yong Kim (dooyong.kim@navercorp.com)

• Part 1
• About naver.com
• What is & Why Druid
• The Architecture of our service
• Part 2
• Druid Segment File Structure
• Spark Druid Connector
• TopN Query
• Plywood & Split-Apply-Combine
• How to fix TopN’s unstable results
• Appendix
Agenda

About naver.com
https://en.wikipedia.org/wiki/Naver
• naver.com
• The biggest website in South Korea
• The Google of South Korea
• 74.7% of all web searches in South Korea

• Developed Analytics Systems at Naver
• Working with Databases since 2000
• Author of 3 MySQL books
• Currently Elasticsearch, Spark, Kudu,
and Druid
• Working on Spark and Druid-based OLAP
platform
• Implemented search infrastructure at
coupang.com
• Have been interested in MPP and advanced file
formats for big data
Jason Heo Doo Yong Kim
About Speakers

Platforms we've tested so far
Parquet
ORC
Carbon Data
Elasticsearch
ClickHouse Kudu
Druid
SparkSQL
Hive
Impala
Drill
Presto
Kylin
Phoenix
Query
Engine
Storage
Format

• What is Druid?
• Our Requirements
• Why Druid?
• Experimental Results
What is & Why Druid

• Column-oriented distributed datastore
• Real-time streaming ingestion
• Scalable to petabytes of data
• Approximate algorithms (hyperLogLog, theta sketch)
https://www.slideshare.net/HadoopSummit/scalable-
realtime-analytics-using-druid
From HORTONWORKS
What is Druid?

From my point of view
• Druid is a cumbersome version of Elasticsearch (w/o search feature)
• Similar points
• Secondary Index
• DSLs for query
• Flow of Query Processing
• Terms Aggregation ↔ TopN Query, Coordinator ↔ Broker, Data Node ↔ Historical
• Different points
• more complicated to operate
• better with much more data
• better for Ultra High Cardinality
• less GC overhead
• better for Spark Connectivity (for Full Scan)
What is Druid?

Real-time
Node
Historical
BrokerOverlord
Middle
Manager
Coordinator
Kafka
Index Service
Segment management
What is Druid? - Architecture
MySQL
metadata
Zookeeper
cluster mgmt.
Deep Storage
(HDFS, S3)
stores Druid segments
for durability
Query Service
Clients
Druid DSL
Segments
download
Segments for
query

Real-time
Node
Historical
Broker
{
"queryType": "groupBy",
"dataSource": "sample_data",
"dimension": ["country", "device"],
"filter": {},
"aggregation": [...],
"limitSpec": [...]
}
{
"queryType": "topN",
"dataSource": "sample_data",
"dimension": "sample_dim",
"filter": {...}
"aggregation": [...],
"threshold": 5
}
SELECT ... FROM dataSource
What is Druid? - Queries
• SQLs can be converted to Druid DSL
• No JOIN

SELECT COUNT(*)
FROM logs
WHERE url = ?;
1. Random Access
(OLTP)
SELECT url,
COUNT(*)
FROM logs
GROUP BY url
ORDER BY COUNT(*)
DESC
LIMIT 10;
2. Most Viewed
SELECT visitor,
COUNT(*)
FROM logs
GROUP BY visitor;
3. Full Aggregation
SELECT ...
FROM logs INNER
JOIN users
GROUP BY ...
HAVING ...
4. JOIN
Why Druid? - Requirements

• Supports Bitmap Index
• Fast Random Access
Perfect solution for OLTP and OLAP
For OLTP
• Supports TopN Query
• 100x times faster than GroupBy query
• Supports Complex Queries
• JOIN, HAVING, etc
• with our Spark Druid Connector
For OLAP
Why Druid?
★★★★☆1. Random Access
★★★★☆3. Full Aggregation
★★★★★2. Most Viewed
★★★★☆4. JOIN

• Fast Random Access
• Terms Aggregation
• TopN Query
• Easy to manage
Pros
Cons
• Slow full scan with es-hadoop
• Low Performance for multi-field terms aggregation
(esp. High Cardinality)
• GC Overhead
Comparison – ElasticSearch
1. Random Access ★★★★★
3. Full Aggregation ☆☆☆☆☆
2. Most Viewed ★★★☆☆
4. JOIN ☆☆☆☆☆

• Fast Random Access via Primary Key
• Fast OLAP with Impala
Pros
• No Secondary Index
• No TopN Query
Cons
Comparison – Kudu + Impala
★★★★★ (PK)
★☆☆☆☆ (non-PK)
1. Random Access
★★★★★3. Full Aggregation
☆☆☆☆☆2. Most Viewed
★★★★★4. JOIN

Random Access Most Viewed
0.25 0.35 0.08
2.7
2.9
0.78
0
0.5
1
1.5
2
2.5
3
3.5
Elasticesarch Kudu+Impala Druid
1 Field 2 Fields
0.003
0.14
0.03
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
Elastisearch Kudu+Impala Druid
Experimental Results – Response Time
sec sec

Experimental Results – Notes
• ES: Lucene Index
• Kudu+Impala: Primary Key
• Druid: Bitmap Index
Random Access
• ES: Terms Aggregation
• Kudu+Implala: Group By
• Druid: TopN
• Split-Apply-Combine for Multi Fields
Most Viewed
• 210 mil. rows
• same parallelism
• same number of shards/partitions/segments
Data Sets

Logs
The Architecture of our service
Zeppelin
Plywood
Druid DSL
Coordinator
Overlord
Middle
Manager
Peon
Spark Thrift
Server
Batch
Ingestion
Parquet
Kafka
Run daily batch job
API Server
Historical
Spark
Executor
Segments File Broker
Druid
SparkSQL
Kafka
Indexing
Service
Kafka
transform logs
Parquet
remove
duplicated logs
Real-time
Ingestion

Introduction – Who am I?
1. Doo Yong Kim
2. Naver
3. Software engineer
4. Big data

Contents
1. Druid Storage Model
2. Spark Druid Connector Implementation
3. TopN Query
4. Plywood & Split-Combine-Apply
5. Extending Druid Query

Druid Storage Model – 4 characteristics
• Columnar format
• Explicit distinguishes between dimension, metric
• Bitmap index
• Dictionary encoded

Druid Storage Model - background
Druid treats dimension and metric separately.
Dimension Metric
• Bitmap Index
• GroupBy Fields
• Argument of Aggregate Function
{
"dimensionsSpec": {
"dimensions": ["country", "device", ...]
},
...
"metricsSpec": [
{ "type": "count", "name": "count" },
{ "type": "doubleSum", "fieldName": "duration", "name": "duration" }
]
}
Druid Ingestion Spec

Druid Storage Model- Dimension
Country (Dimension)
Korea
UK
Korea
Korea
Korea
UK
Korea ↔ 0
UK ↔ 1
Dictionary for country
UK appears in 2nd, 6th rows
Korea → 101110
UK → 010001
Bitmap for Korea
0
1
0
0
0
1
Dictionary Encoded Values

Druid Storage Model - Metric
13
2
15
29
30
14
Country (Dimension) duration (Metric)
Korea 13
UK 2
Korea 15
Korea 29
Korea 30
UK 14

Row
Filter it manually
device LIKE 'Iphone%'
Druid Storage Model
Bitmapcountry Filtering
Bitmapdevice Filtering
duration Filtering
Filter by bitmap
country = 'Korea'
('Korea', 'Iphone 6s', 13)
SELECT country, device, duration
FROM logs
WHERE country = 'Korea'
AND device LIKE 'Iphone%'

Spark Druid Connector
1. 3 Ways to implement, Our implementation
2. What is needed to implement
3. Sample Codes, Performance Test
4. How to implement

Spark Druid Connector - 3 Ways to implement
Druid
Broker
Spark
Driver
DSLSQL Druid
Historical
Spark
Driver
SQL Spark
Executor
• Good if SQL is rewritable to DSL
• But DSL does not support all SQL
• Ex: JOIN, sub-query
• Easy to implement
• No need to understand Druid Index Library
• Ser/de operation is expensive
• Parallelism is bounded to no. of Historical
Select DSL
Large JSON
1st way 2nd way

Spark Druid Connector - 3 Ways to implement
Spark
Driver
SQL
• Read Druid segment files directly.
• Similar to the way of reading Parquet
• Difficult to implement
• Need to understand Druid segment library
3rd way
Executor
Segment File
Reads segments using
Druid Library
Allocate Spark executor into Historical Node
We chose this way!

spark.read
.format("com.navercorp.ni.druid.spark.druid")
.option("coordinator", "host1.com:18081")
.option("broker", "host2.com:18082")
.option("datasource", "logs").load()
.createOrReplaceTempView("logs")
Spark Druid Connector – How to use
spark.sql("""
SELECT country, device, duration
FROM logs
AND device LIKE 'Iphone%'
""").show(false)
Create table Execute Query

Total 4.4B rows
0.21
7.5
0
1
2
3
4
5
6
7
8
Spark Druid Spark Parquet
Random Access
24.1
7.7
0
5
10
15
20
25
30
Spark Druid Spark Parquet
Full Scan & GROUP BY
Spark Druid Connector - Performance
Seconds, lower is better

Spark Druid Connector – How to implement

Spark Druid Connector – How to implement
1. Druid Rest API
2. Druid Segment Library
3. Spark Data Source API

Spark Druid Connector – Get table schema
Spark
Driver
Druid
Broker
{
"queryType": "segmentMetaData",
"dataSource": "logs",
"merge": "true"
}
{
"columns": {
"__time": {...},
"country": {...},
"device": {...},
"duration": {...}
...
}
spark.read
.format("...")
.option("coordinator", "...")
.option("broker", "...")
.option("datasource", "logs")
.load()
Schema

Spark Druid Connector – Partition pruning
AND_time = CAST('2018-05-23' AS TIMESTAMP)
Segments can be pruned
by interval condition and single dimension
partition
1. Interval condition
serverview returns only matched segments
2. Single dimension partition
compare start and end with given filter
Spark
Driver
Druid
Coordinator
GET /.../logs/intervals/2018-05-23/serverview
[
{
"segment": {
"shardSpec": {
"dimension": "country",
"start": "null", "end":
"b" ...},
"id": "segmentId"
},
"servers": [
{"host": "host1"},
{"host": "host2"}
]
},
{ "segment": ...},
...
}

Spark Druid Connector – Spark filters to Druid filters
AND city = 'Seoul'
buildScan(requiredColumns: [country, device, duration],
filters: [EqualTo(country, Korea), EqualTo(city, Seoul)])
Spark's filters are converted into Druid's DimFilter
private def toDruidDimFilters(sparkFilter: Filter): DimFilter = {
sparkFilter match {
...
case EqualTo(attribute, value) => {
new SelectorDimFilter(
attribute,
value.toString,
null
)
case GreaterThan(attribute, value) => ...

Spark Druid Connector – Attach locality to RACK_LOCAL
• getPreferredLocations(partition: Partition)
• Returns Hosts having Druid Segments
• Caution: Spark does not always guarantee that executors launch on preferred locations
• Set spark.locality.wait to very large value

Spark Druid Connector - How to implement
Done!
Now Spark executor can read records from Druid segment files.
Segment
File
Spark Druid
Connector
Spark

TopN Query
1. How TopN Query works
2. Performance
3. Limitation

TopN Query flow (N=100)
Broker
Historical
Segment Cache
User
TopN Query – We heavily use TopN query
Historical
Segment Cache
Historical
Segment Cache
Client get merged results from
each historical node.
Broker merge each’s results
and make final records.
Each historical node return
local top 100 results

country SUM(duration)
korea 114
uk 47
us 21
uk 67
korea 24
usa 3
korea 87
uk 57
china 33
korea 225
uk 171
china 33
usa 24
korea 225
uk 171
china 33
TopN Query - Example
Top 3 country ORDER BY SUM(duration)
Broker
Top 3 Result
Top 3 of Historical a
Top 3 of Historical b
Top 3 of Historical c

korea 114
uk 47
usa 21
china 17
uk 67
korea 24
usa 3
china 1
korea 87
uk 57
usa 22
china 33
korea 225
uk 171
china 33
Missing!
TopN – is an approximate approach

GroupBy
(Few minutes)
TopN
(1536 ms)
rank metric rank metric
1 1,948,297 1 1,948,297
2 1,404,167 2 1,404,167
3 1,383,538 3 1,383,538
4 1,141,977 4 1,141,977
5 1,099,028 5 1,090,277
6 1,090,277 6 1,079,242
7 1,051,448 7 1,051,448
8 996,961 8 996,961
9 941,284 9 941,284
10 937,078 10 937,078
100x Faster!
TopN – 100x faster than GroupBy
1. rank changed
rank 5 → rank 6
2. value changed
1,099,028 → 1,079,242

TopN – Limitations
1. TopN only has one dimension.
2. Unstable result when replication factor is larger than 2.

Plywood
1. Plywood
2. Split-Apply-Combine
3. Our Improvement

1. https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf
2. http://plywood.imply.io/index
// Split [ country, city, device ]
ply()
.apply(dataSource, $(dataSource).filter(...)) // Filter1
.apply('country', $(dataSource).split(...)
.apply(...) // Filter to Split1 (country)
.apply('city', $(dataSource).split(...)
.apply(...) // Filter to Split2 (city)
.apply(...) // Filter to Split2 (city)
.apply('device', $(dataSource).split(...)
.apply(...) // Filter to Split3 (device)
)
)
)
SELECT country, city, device
FROM $TABLE
WHERE …
GROUP BY country, city, device
≒
Split Apply Combine - SAC

Throughput (qps, higher is better)
Before
Before After
Tuning Results

Same query but the results can be different under 2+ replica factor configuration
Stable TopN - Motivation
Seg_1
Seg_2
Historical 1
Seg_1
Seg_2
Historical 2
Broker
Historical 1 Historical 2
Broker
TopN(Seg_1 + Seg_2) TopN(Seg_2 + Seg_3)
First Result Second Result
Results can be different
!=
Seg_3Seg_3
Seg_1
Seg_2
Seg_3
Seg_2
Seg_3
TopN(Seg_3)
Seg_1
TopN(Seg_1)

Bypass Historical side TopN Merge, do Broker side merge TopN results for each segment by it’s ID
order
by_segment patch
Broker Broker
First Result Second Result
Always identical
==
Seg_1
Seg_2
Historical 1
Seg_1
Seg_2
Historical 2 Historical 1 Historical 2
TopN(Seg_1) + TopN(Seg_2) TopN(Seg_2) + TopN(Seg_3)
Seg_3Seg_3
Seg_1
Seg_2
Seg_3
Seg_2
Seg_3
TopN(Seg_3)
Seg_1
TopN(Seg_1)

Navis @ SK TelecomEns @ Naver
Special Thanks

• 10 Broker Nodes
• 40 Historical Nodes
• 2 MiddleManager & Overlord Nodes
• 2 Coordinator Nodes
• 10 Yarn & HDFS Nodes for Batch Ingestion
• Spark Standalone Cluster runs on Historical Nodes
• for Locality
Druid Deploy & Configuration (1)

• Druid version : 0.11
• H/W Spec for Broker & Historical
• CPU: 40 cores (w/ hyperthread)
• RAM: 128GB
• HDD: SSD w/ RAID 5
• Memory Configuration
Configuration Value for Broker Value for Historical
-Xmx 20GB 12GB
-XX:MaxDirectMemorySize 30GB 45GB
druid.processing.numMergeBuffers 10 20
druid.processing.numThreads 20 30
druid.processing.buffer.sizeBytes 512MB 800MB
druid.cache.sizeInBytes 0 5GB
druid.server.http.numThreads 40 40
Druid Deploy & Configuration (2)

Use Yarn External Resource for Batch Ingestion
"tuningConfig": {
"type": "hadoop",
"jobProperties": {
"yarn.resourcemanager.hostname" : "host1.com",
"yarn.resourcemanager.address" : "host1.com:8032",
"yarn.resourcemanager.scheduler.address": "host1.com:8030",
"yarn.resourcemanager.webapp.address": "host1.com:8088",
"yarn.resourcemanager.resource-tracker.address": "host1.com:8031",
"yarn.resourcemanager.admin.address": "host1.com:8033"
}
}
Ingest Spec for External Yarn and HDFS

Use External HDFS for intermediate MR output
"tuningConfig": {
"type": "hadoop",
"jobProperties": {
"fs.defaultFS": "hdfs://DEFAULT_FS:8020",
"dfs.namenode.http-address": "NAMENODE:50070",
"dfs.namenode.https-address": "NAMENODE:50470",
"dfs.namenode.servicerpc-address": "NAMENODE:8022"
}
}
Ingest Spec for External Yarn and HDFS

Lambda Architecture with Two Databases
https://en.wikipedia.org/wiki/Lambda_architecture
Lambda Architecture with Druid
https://www.slideshare.net/gianmerlino/druid-at-sf-big-analytics-
2015-1201
Why Druid? – Simple Lambda Architecture

https://github.com/knoguchi/cm-druid
Druid on CDH

Extending Druid Query
1. Accumulated Metric in TopN
2. Stable TopN Result

Row stream
Query
Second Query
Historical
Result
Result
Extending Druid Query
Client
Broker
Historical
Cursor
Aggregation
Row
Row
Row
Row
Row

Extending Druid Query - Motivation
2 queries are needed to make following table
1. Total 3 times TopN query for 3 countries
2. Aggregation query for total duration
Country SUM(duration) Ratio over total duration
korea 225 20%
uk 171 15.2%
usa 33 2.9%
Can we do it at once?

Extending Druid Query - Background
Yes we can!
Just do TopN operation and SUM operation simultaneously!
korea 114
china 17
usa 21
uk 47
country duration
korea 100
korea 14
uk 40
uk 7
usa 21
china 17
Segment Data
Aggregated in map structure
korea 114
uk 47
usa 21
Final records
Total duration equals
sum of all metric values!

{
"queryType": "topN",
...
"metric": "edits",
"accMetrics": ["edits"],
...
}
{
...
"edits": 33,
"__acc_edits": 1234
...
}
User Request
Druid Response
Extending Druid Query in TopN
Broker
Historical
Cursor
TopN
Aggregation
Row TopN Queue
Count Metric
We customized Druid to calculate
total edits and metric at once!
Row
Row
Row
Row
Row

Huge intermediate files with MapReduce
• Druid's default Batch Ingestion use MapReduce
• To ingest 1.4GB Parquet file (Single Dim. Partition)
• Read: 16.6GB
• Write: 20.5GB
• Total: 41.1GB
Druid Spark Batch

We modified Original Druid Spark Batch
• https://github.com/metamx/druid-spark-batch
• Original version of Druid Spark Batch from Metamarket (creator of Druid)
• We added some features
• Parquet input
• Single Dimension Partition
• Query Granularity
• Same Ingest spec with Druid MapReduce Batch
Druid Spark Batch

37.1
7
0
5
10
15
20
25
30
35
40
MapReduce Spark
Disk Read, Write
759
2260
0
500
1000
1500
2000
2500
MapReduce Spark
Ingest time
(Single Dim Partition)
(3 Segments, 430MB each)
333
376
0
50
100
150
200
250
300
350
400
MapReduce Spark
Ingest time
(Single Dim Partition)
(11 Segments, 135MB each)
Druid Spark Batch
GB, lower is better Seconds, lower is better Seconds, lower is better

Web analytics at scale with Druid at naver.com

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Web analytics at scale with Druid at naver.com

Ähnlich wie Web analytics at scale with Druid at naver.com (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Web analytics at scale with Druid at naver.com