This document summarizes TenMax's data pipeline experience over the years from 2015 to 2017. It describes three versions of the data pipeline used to generate reports from raw event data. Version 1 used MongoDB but had poor write performance. Version 2 used Cassandra and had better write performance using LSM trees, but was costly to operate. Version 3 uses Kafka, Fluentd, Azure Blob storage and Spark to provide a scalable, cost-effective solution that can handle high throughput and complex aggregations. The document also discusses lessons learned around balancing features, costs and technologies like Spark, streaming and serverless models.
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
TenMax Data Pipeline Experience Sharing
1. 2 0 1 7 . 0 9
T e n M a x D a t a P i p e l i n e E x p e r i e n c e S h a r i n g
P o p c o r n y ( 陸 振 恩 )
2. DataCon.TW2017
Who am I
• 陸振恩 (a.k.a popcorny, Pop)
• Director of Engineering @TenMax
• 之前經歷
– 交大資科所
– 第四屆趨勢百萬程式競賽冠軍
– 聯發科技 (2005- 2010)
– SmartQ (2011 – 2014)
– cacaFly/TenMax (2014-present)
• FB: https://fb.me/popcornylu
2
3. DataCon.TW2017
Current Workload
• 0.1B ~ 1B events generated per day
• About 200G data generated per day
• Data everywhere
– Reporting
– Analytics
– Content Profiling
– Audience Profiling
– Machine Learning
3
4. DataCon.TW2017
Context
• Each AD request has an serial of events: Request Impression Click
• We call it a session, which is identified by sessionId.
• Generate hourly report for sessions
– with some metrics (requests, impres, clicks)
– grouped by some dimensions (ad, space, geo, device, …)
4
Request Impression ClickSession: 1
Request Impression ClickSession: 2
Request Impression ClickSession: 3
10. DataCon.TW2017
Data Pipeline Version 1
10
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
MongoDB RDBMS
In-place-update
aggregate
pipeline
Upsert
events
MongoDB solution
sessions report
Our problem
Bid
Request
Raw Events
11. DataCon.TW2017
Problem: Poor Write Performance
• MMAPv1 storage engine
– In-place update
– Fragmentation
– Random Access
– Big DB file
11
More Bytes + Random Access =
Poor Performance
14. DataCon.TW2017
Data Pipeline Version 2
• Cassandra 2.1
• Feature
– Google BigTable-like Architecture
– Excellent Write Performance
– Peer-to-peer architecture
– Data Compression
14
15. DataCon.TW2017
Data Pipeline Version 2
15
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
Cassandra RDBMS
compact
Java
Insert
events
Cassandra solution
sessions report
Our problem
Bid
Request
Raw Events
Event
Stream
21. DataCon.TW2017
How about aggregate
• Cassandra has no group-aggregation
• How to aggregate?
– Java Stream (thanks java8)
– Poppy (an in-house dataframe library)
https://github.com/tenmax/poppy
21
Cassandra RDBMS
Aggregate
Insert
events
report
Bid
Request
Raw Events
Event
Stream
Grouping
22. DataCon.TW2017
Problem: Cost
• SSD Disk costs USD $0.135 per month per GB, while Azure Blob
costs USD $0.02 per month per GB.
• SSD Disk should allocate space in advance, while Azure Blob is pay-
as-you-use.
• Azure Blob replicate data even for lowest pricing tier
• Azure Blob is much scalable and reliable than self-hosted cluster.
• People Cost
22
Cloud Storage Rocks!!
23. DataCon.TW2017
Problem: Aggregation
• In-house solution is not easy to evolve, while
Hadoop/Spark is a huge ecosystem
• Scalability issue
• Lack key feature: Group by high cardinality key
– Group by visitor
– Aggregate Multi-dimensional OLAP cubes
23
24. DataCon.TW2017
Data Pipeline Version 2.1
24
Cassandra RDBMS
Generate
Report
sessions report
Bid
Request
Raw Events
Azure
Blob
OLAP
Cube
ML
Model
Sampling
Data
Dump
• Dump the session data to azure blob for further use.
BI Tool
Analytics
Server
AD
Server
26. DataCon.TW2017
Data Pipeline Version 3
• Kafka 0.11+ Fluentd + Azure blob + Spark 2.1
• Why
– Azure Blob is cheap
– High throughput for Azure Blob
– Spark is a Map-Shuffle-Reduce framework, making
grouping by high cardinality key possible.
26
27. DataCon.TW2017
Data Pipeline version 3
27
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
Spark solution
sessions report
Our problem
Azure Blob
Fluentd
Raw events
Spark
DataFrame
28. DataCon.TW2017
How to Ingest Log to Azure Blob?
28
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
sessions report
Azure Blob
Fluentd
Raw events
Spark
DateFrame
29. DataCon.TW2017
How to Ingest Log to Azure Blob?
• Solution 1
– App write log to local log file
– Fluentd tail log files and upload to blob
• Pros
– Simple
• Cons
– Data is not uploaded as soon as event happens
29
LogServer Fluentd Azure
Blob
30. DataCon.TW2017
How to Ingest Log to Azure Blob?
• Solution 2
– App append log to kafka
– Fluentd consume logs and batch-upload to blob
• Pros
– Log is stored as soon as event happens
– Log can be used for multiple purpose
• Cons
– Server aware of Kafka
– If connection to kafka fails, server need to handle buffer or OOM
30
Server Kafka Fluentd Azure
Blob
31. DataCon.TW2017
How to Ingest Log to Azure Blob?
• Solution 3
– App write log to local log file
– Fluentd tail log file and push to kafka (<100ms latency)
– Fluentd consume logs from kafka and batch-upload to blob
• Pros
– Log is stored as soon as event happens
– Log can be used for multiple purpose
– Decouple app from kafka, and fluentd takes care of buffering and error
recovery.
• Cons
– Most complex solution.
31
Bidder Kafka Fluentd Azure
Blob
FluentdLog
32. DataCon.TW2017
Event-Time Window
• A click event may happens after several minutes from the
impression event. How to merge these events?
32
Id: 1
ts: 10:58
Event: impre
Id: 2
ts: 10:59
Event: impre
Id: 1
ts: 11:02
Event: click
Id: 3
ts: 11:02
Event: impre
Id: 3
ts: 11:03
Event: click
11:00
How to merge these events?
33. DataCon.TW2017
Event-Time Window
• Our solution
– Fluentd uploads events to the partition window
according to the session timestamp (partts) instead
of ingest timestamp.
– sessionId is type of TimeUUID, which embeds
timestamp in UUID.
– For every events
partts = timestampOf(sessionId)
33
34. DataCon.TW2017
Event-Time Window
34
Id: 1
ts: 10:58
partts: 10:58
Event: imp
Id: 2
ts: 10:59
partts: 10:59
Event: imp
Id: 1
ts: 11:02
partts:10:58
Event: click
Id: 3
ts: 11:02
Partts: 10:02
Event: impre
Id: 3
ts: 11:03
Partts: 11:02
Event: click
11:00
• Now, we can guarantee the events for the same session
locate at the same window.
In same window
35. DataCon.TW2017
Spark RDD and Spark SQL
35
• Use Spark RDD to merge events with the same session. (Just like
json object merge)
• Use Spark DataFrame to aggregate metrics by dimensions. (A high
dimension OLAP cube)
• Save DataFrame to Azure blob as Parquet format
• Save to RDBMS sub-dimension data with lower dimensions data.
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
sessions report
Azure Blob
Event
Stream
Raw events
Spark
DataFrame
36. DataCon.TW2017
Lessons Learned
• Everything is tradeoff.
• For big data, trade features for cost effective
– DFS for batch source
– Kafka for stream source
• Cloud Storage is very cheap!! Use it now
• Spark is a great tool for processing data. Even for non-
distributed application.
36
37. DataCon.TW2017
Storage Comparison
37
RDBMS Document Store
(e.g. MongoDB)
BigTable-like Stores
(e.g. Cassandra)
Distributed File System
(e.g. Azure Blob,
AWS S3, HDFS)
File/Table Scan Yes Yes Yes Yes
Point Query Yes Yes Yes
Secondary Index Yes Yes Yes*
AdHoc Query Yes Yes
Group and aggregate Yes Yes
Join Yes Yes*
Transaction Yes
40. DataCon.TW2017
Stream Processing
• Why
– Latency
– Incremental Update
• Trend
– Batch and Stream in one system
– Exactly-once semantic
– Support both ingest time and event time
– Low watermark for late event
– Structured Streaming
40
41. DataCon.TW2017
Serverless Model for Analytics Workload
• Analytics Workload Characteristic
– Low utilization rate
– Require huge resource suddenly
– Interactive
• Not suitable for provisioned VMs solution, like
– AWS EMR, Azure HDInsight, GCP DataProc
• Serverless Solutions
– Google BigQuery, AWS Athena, Azure Data Lake Analytics
41