TenMax Data Pipeline Experience Sharing

2 0 1 7 . 0 9
T e n M a x D a t a P i p e l i n e E x p e r i e n c e S h a r i n g
P o p c o r n y ( 陸振恩 )

DataCon.TW2017
Who am I
• 陸振恩 (a.k.a popcorny, Pop)
• Director of Engineering @TenMax
• 之前經歷
– 交大資科所
– 第四屆趨勢百萬程式競賽冠軍
– 聯發科技 (2005- 2010)
– SmartQ (2011 – 2014)
– cacaFly/TenMax (2014-present)
• FB: https://fb.me/popcornylu
2

DataCon.TW2017
Current Workload
• 0.1B ~ 1B events generated per day
• About 200G data generated per day
• Data everywhere
– Reporting
– Analytics
– Content Profiling
– Audience Profiling
– Machine Learning
3

DataCon.TW2017
Context
• Each AD request has an serial of events: Request  Impression  Click
• We call it a session, which is identified by sessionId.
• Generate hourly report for sessions
– with some metrics (requests, impres, clicks)
– grouped by some dimensions (ad, space, geo, device, …)
4
Request Impression ClickSession: 1

DataCon.TW2017
System Architecture
5
Admin
Console
Load
Balancer
Log
Storage
Report
Server
Server
Server
Server

DataCon.TW2017
Data Pipeline
6
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Hourly
Event
Stream

DataCon.TW2017
Our Data Pipeline Timeline
7
2015 2016 2017

DataCon.TW2017
Data Pipeline Version 1
8
2015 2016 2017

DataCon.TW2017
• MongoDB 2.3
• Why MongoDB?
– Schemaless
– Horizontal scale out
– Replication
9
NoSQL is Hot!!

DataCon.TW2017
10
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
MongoDB RDBMS
In-place-update
aggregate
pipeline
Upsert
events
MongoDB solution
sessions report
Our problem
Bid
Request
Raw Events

DataCon.TW2017
Problem: Poor Write Performance
• MMAPv1 storage engine
– In-place update
– Fragmentation
– Random Access
– Big DB file
11
More Bytes + Random Access =
Poor Performance

DataCon.TW2017
Problem: Hard to Operate
• Too many roles of server
– Mongos
– Shard master
– Shard slave
– Config server
12

DataCon.TW2017
13
2015 2016 2017

DataCon.TW2017
• Cassandra 2.1
• Feature
– Google BigTable-like Architecture
– Excellent Write Performance
– Peer-to-peer architecture
– Data Compression
14

DataCon.TW2017
15
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
Cassandra RDBMS
compact
Java
Insert
events
Cassandra solution
sessions report
Our problem
Bid
Request
Raw Events
Event
Stream

DataCon.TW2017
Write Performance
• Use LSM Tree (Log Structure Merge Tree)
• Append only sequential write (including insert, update, delete)
• Compression support
• Flush, compact, read merge
1616
Write
Ahead
Log
SSTable SSTable
Write Memtable
SSTable
flush
Sorted String Table

DataCon.TW2017
LSM Tree - Compact
SSTable SSTable SSTable SSTable
SSTable
Level 1
Level 2
Level 3
Level 0 Memtable
Compact

DataCon.TW2017
LSM Tree – Read Merge
SSTable SSTable SSTable
Read MemtableMerge

DataCon.TW2017
Write Performance
• Who use LSM Tree?
1919

DataCon.TW2017
Peer-to-Peer Architecture
• Every nodes are
– Contact server
– Coordinator
– Data Node
– Meta Server
• Easy to operate!
20
coordinator
Replica1
Replica2
Replica3

DataCon.TW2017
How about aggregate
• Cassandra has no group-aggregation
• How to aggregate?
– Java Stream (thanks java8)
– Poppy (an in-house dataframe library)
https://github.com/tenmax/poppy
21
Cassandra RDBMS
Aggregate
Insert
events
report
Bid
Request
Raw Events
Event
Stream
Grouping

DataCon.TW2017
Problem: Cost
• SSD Disk costs USD $0.135 per month per GB, while Azure Blob
costs USD $0.02 per month per GB.
• SSD Disk should allocate space in advance, while Azure Blob is pay-
as-you-use.
• Azure Blob replicate data even for lowest pricing tier
• Azure Blob is much scalable and reliable than self-hosted cluster.
• People Cost
22
Cloud Storage Rocks!!

DataCon.TW2017
Problem: Aggregation
• In-house solution is not easy to evolve, while
Hadoop/Spark is a huge ecosystem
• Scalability issue
• Lack key feature: Group by high cardinality key
– Group by visitor
– Aggregate Multi-dimensional OLAP cubes
23

DataCon.TW2017
Data Pipeline Version 2.1
24
Cassandra RDBMS
Generate
Report
sessions report
Bid
Request
Raw Events
Azure
Blob
OLAP
Cube
ML
Model
Sampling
Data
Dump
• Dump the session data to azure blob for further use.
BI Tool
Analytics
Server
AD
Server

DataCon.TW2017
25
2015 2016 2017

DataCon.TW2017
• Kafka 0.11+ Fluentd + Azure blob + Spark 2.1
• Why
– Azure Blob is cheap
– High throughput for Azure Blob
– Spark is a Map-Shuffle-Reduce framework, making
grouping by high cardinality key possible.
26

DataCon.TW2017
Data Pipeline version 3
27
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
Spark solution
sessions report
Our problem
Azure Blob
Fluentd
Raw events
Spark
DataFrame

DataCon.TW2017
How to Ingest Log to Azure Blob?
28
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
sessions report
Azure Blob
Fluentd
Raw events
Spark
DateFrame

DataCon.TW2017
• Solution 1
– App write log to local log file
– Fluentd tail log files and upload to blob
• Pros
– Simple
• Cons
– Data is not uploaded as soon as event happens
29
LogServer Fluentd Azure
Blob

DataCon.TW2017
• Solution 2
– App append log to kafka
– Fluentd consume logs and batch-upload to blob
• Pros
– Log is stored as soon as event happens
– Log can be used for multiple purpose
• Cons
– Server aware of Kafka
– If connection to kafka fails, server need to handle buffer or OOM
30
Server Kafka Fluentd Azure
Blob

DataCon.TW2017
• Solution 3
– App write log to local log file
– Fluentd tail log file and push to kafka (<100ms latency)
– Fluentd consume logs from kafka and batch-upload to blob
• Pros
– Log is stored as soon as event happens
– Log can be used for multiple purpose
– Decouple app from kafka, and fluentd takes care of buffering and error
recovery.
• Cons
– Most complex solution.
31
Bidder Kafka Fluentd Azure
Blob
FluentdLog

DataCon.TW2017
Event-Time Window
• A click event may happens after several minutes from the
impression event. How to merge these events?
32
Id: 1
ts: 10:58
Event: impre
Id: 2
ts: 10:59
Event: impre
Id: 1
ts: 11:02
Event: click
Id: 3
ts: 11:02
Event: impre
Id: 3
ts: 11:03
Event: click
11:00
How to merge these events?

DataCon.TW2017
Event-Time Window
• Our solution
– Fluentd uploads events to the partition window
according to the session timestamp (partts) instead
of ingest timestamp.
– sessionId is type of TimeUUID, which embeds
timestamp in UUID.
– For every events
partts = timestampOf(sessionId)
33

DataCon.TW2017
Event-Time Window
34
Id: 1
ts: 10:58
partts: 10:58
Event: imp
Id: 2
ts: 10:59
partts: 10:59
Event: imp
Id: 1
ts: 11:02
partts:10:58
Event: click
Id: 3
ts: 11:02
Partts: 10:02
Event: impre
Id: 3
ts: 11:03
Partts: 11:02
Event: click
11:00
• Now, we can guarantee the events for the same session
locate at the same window.
In same window

DataCon.TW2017
Spark RDD and Spark SQL
35
• Use Spark RDD to merge events with the same session. (Just like
json object merge)
• Use Spark DataFrame to aggregate metrics by dimensions. (A high
dimension OLAP cube)
• Save DataFrame to Azure blob as Parquet format
• Save to RDBMS sub-dimension data with lower dimensions data.
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
sessions report
Azure Blob
Event
Stream
Raw events
Spark
DataFrame

DataCon.TW2017
Lessons Learned
• Everything is tradeoff.
• For big data, trade features for cost effective
– DFS for batch source
– Kafka for stream source
• Cloud Storage is very cheap!! Use it now
• Spark is a great tool for processing data. Even for non-
distributed application.
36

DataCon.TW2017
Storage Comparison
37
RDBMS Document Store
(e.g. MongoDB)
BigTable-like Stores
(e.g. Cassandra)
Distributed File System
(e.g. Azure Blob,
AWS S3, HDFS)
File/Table Scan Yes Yes Yes Yes
Point Query Yes Yes Yes
Secondary Index Yes Yes Yes*
AdHoc Query Yes Yes
Group and aggregate Yes Yes
Join Yes Yes*
Transaction Yes

DataCon.TW2017
Storage Comparison
38
RDBMS Document Store
(e.g. MongoDB)
BigTable-like Stores
(e.g. Cassandra)
Distributed File System
(e.g. Azure Blob,
AWS S3, HDFS)
Cost * ** ** *****
Query Latency ***** **** *** *
Throughput ** ** ** *****
Scalability * *** *** *****
Availability * *** *** *****

DataCon.TW2017
Where We Go Next?
• Stream processing
• Serverless Model for analytics workload
39

DataCon.TW2017
Stream Processing
• Why
– Latency
– Incremental Update
• Trend
– Batch and Stream in one system
– Exactly-once semantic
– Support both ingest time and event time
– Low watermark for late event
– Structured Streaming
40

DataCon.TW2017
Serverless Model for Analytics Workload
• Analytics Workload Characteristic
– Low utilization rate
– Require huge resource suddenly
– Interactive
• Not suitable for provisioned VMs solution, like
– AWS EMR, Azure HDInsight, GCP DataProc
• Serverless Solutions
– Google BigQuery, AWS Athena, Azure Data Lake Analytics
41

DataCon.TW2017
Recap
42
2015 2016 2017

DataCon.TW2017
43
Thanks
Question?

TenMax Data Pipeline Experience Sharing

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie TenMax Data Pipeline Experience Sharing

Ähnlich wie TenMax Data Pipeline Experience Sharing (20)

Mehr von Chen-en Lu

Mehr von Chen-en Lu (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

TenMax Data Pipeline Experience Sharing