SlideShare ist ein Scribd-Unternehmen logo
1 von 43
2 0 1 7 . 0 9
T e n M a x D a t a P i p e l i n e E x p e r i e n c e S h a r i n g
P o p c o r n y ( 陸 振 恩 )
DataCon.TW2017
Who am I
• 陸振恩 (a.k.a popcorny, Pop)
• Director of Engineering @TenMax
• 之前經歷
– 交大資科所
– 第四屆趨勢百萬程式競賽冠軍
– 聯發科技 (2005- 2010)
– SmartQ (2011 – 2014)
– cacaFly/TenMax (2014-present)
• FB: https://fb.me/popcornylu
2
DataCon.TW2017
Current Workload
• 0.1B ~ 1B events generated per day
• About 200G data generated per day
• Data everywhere
– Reporting
– Analytics
– Content Profiling
– Audience Profiling
– Machine Learning
3
DataCon.TW2017
Context
• Each AD request has an serial of events: Request  Impression  Click
• We call it a session, which is identified by sessionId.
• Generate hourly report for sessions
– with some metrics (requests, impres, clicks)
– grouped by some dimensions (ad, space, geo, device, …)
4
Request Impression ClickSession: 1
Request Impression ClickSession: 2
Request Impression ClickSession: 3
DataCon.TW2017
System Architecture
5
Admin
Console
Load
Balancer
Log
Storage
Report
Server
Server
Server
Server
DataCon.TW2017
Data Pipeline
6
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Hourly
Event
Stream
DataCon.TW2017
Our Data Pipeline Timeline
7
2015 2016 2017
DataCon.TW2017
Data Pipeline Version 1
8
2015 2016 2017
DataCon.TW2017
Data Pipeline Version 1
• MongoDB 2.3
• Why MongoDB?
– Schemaless
– Horizontal scale out
– Replication
9
NoSQL is Hot!!
DataCon.TW2017
Data Pipeline Version 1
10
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
MongoDB RDBMS
In-place-update
aggregate
pipeline
Upsert
events
MongoDB solution
sessions report
Our problem
Bid
Request
Raw Events
DataCon.TW2017
Problem: Poor Write Performance
• MMAPv1 storage engine
– In-place update
– Fragmentation
– Random Access
– Big DB file
11
More Bytes + Random Access =
Poor Performance
DataCon.TW2017
Problem: Hard to Operate
• Too many roles of server
– Mongos
– Shard master
– Shard slave
– Config server
12
DataCon.TW2017
Data Pipeline Version 2
13
2015 2016 2017
DataCon.TW2017
Data Pipeline Version 2
• Cassandra 2.1
• Feature
– Google BigTable-like Architecture
– Excellent Write Performance
– Peer-to-peer architecture
– Data Compression
14
DataCon.TW2017
Data Pipeline Version 2
15
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
Cassandra RDBMS
compact
Java
Insert
events
Cassandra solution
sessions report
Our problem
Bid
Request
Raw Events
Event
Stream
DataCon.TW2017
Write Performance
• Use LSM Tree (Log Structure Merge Tree)
• Append only sequential write (including insert, update, delete)
• Compression support
• Flush, compact, read merge
1616
Write
Ahead
Log
SSTable SSTable
Write Memtable
SSTable
flush
Sorted String Table
DataCon.TW2017
LSM Tree - Compact
SSTable SSTable SSTable SSTable
SSTable
Level 1
Level 2
Level 3
Level 0 Memtable
Compact
DataCon.TW2017
LSM Tree – Read Merge
SSTable SSTable SSTable
Read MemtableMerge
DataCon.TW2017
Write Performance
• Who use LSM Tree?
1919
DataCon.TW2017
Peer-to-Peer Architecture
• Every nodes are
– Contact server
– Coordinator
– Data Node
– Meta Server
• Easy to operate!
20
coordinator
Replica1
Replica2
Replica3
DataCon.TW2017
How about aggregate
• Cassandra has no group-aggregation
• How to aggregate?
– Java Stream (thanks java8)
– Poppy (an in-house dataframe library)
https://github.com/tenmax/poppy
21
Cassandra RDBMS
Aggregate
Insert
events
report
Bid
Request
Raw Events
Event
Stream
Grouping
DataCon.TW2017
Problem: Cost
• SSD Disk costs USD $0.135 per month per GB, while Azure Blob
costs USD $0.02 per month per GB.
• SSD Disk should allocate space in advance, while Azure Blob is pay-
as-you-use.
• Azure Blob replicate data even for lowest pricing tier
• Azure Blob is much scalable and reliable than self-hosted cluster.
• People Cost
22
Cloud Storage Rocks!!
DataCon.TW2017
Problem: Aggregation
• In-house solution is not easy to evolve, while
Hadoop/Spark is a huge ecosystem
• Scalability issue
• Lack key feature: Group by high cardinality key
– Group by visitor
– Aggregate Multi-dimensional OLAP cubes
23
DataCon.TW2017
Data Pipeline Version 2.1
24
Cassandra RDBMS
Generate
Report
sessions report
Bid
Request
Raw Events
Azure
Blob
OLAP
Cube
ML
Model
Sampling
Data
Dump
• Dump the session data to azure blob for further use.
BI Tool
Analytics
Server
AD
Server
DataCon.TW2017
Data Pipeline Version 3
25
2015 2016 2017
DataCon.TW2017
Data Pipeline Version 3
• Kafka 0.11+ Fluentd + Azure blob + Spark 2.1
• Why
– Azure Blob is cheap
– High throughput for Azure Blob
– Spark is a Map-Shuffle-Reduce framework, making
grouping by high cardinality key possible.
26
DataCon.TW2017
Data Pipeline version 3
27
Bid
Request
Raw Events
Sessions Report
group by
sessionId
merge
events
aggregate
metrics
group by
dimensions
Event
Stream
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
Spark solution
sessions report
Our problem
Azure Blob
Fluentd
Raw events
Spark
DataFrame
DataCon.TW2017
How to Ingest Log to Azure Blob?
28
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
sessions report
Azure Blob
Fluentd
Raw events
Spark
DateFrame
DataCon.TW2017
How to Ingest Log to Azure Blob?
• Solution 1
– App write log to local log file
– Fluentd tail log files and upload to blob
• Pros
– Simple
• Cons
– Data is not uploaded as soon as event happens
29
LogServer Fluentd Azure
Blob
DataCon.TW2017
How to Ingest Log to Azure Blob?
• Solution 2
– App append log to kafka
– Fluentd consume logs and batch-upload to blob
• Pros
– Log is stored as soon as event happens
– Log can be used for multiple purpose
• Cons
– Server aware of Kafka
– If connection to kafka fails, server need to handle buffer or OOM
30
Server Kafka Fluentd Azure
Blob
DataCon.TW2017
How to Ingest Log to Azure Blob?
• Solution 3
– App write log to local log file
– Fluentd tail log file and push to kafka (<100ms latency)
– Fluentd consume logs from kafka and batch-upload to blob
• Pros
– Log is stored as soon as event happens
– Log can be used for multiple purpose
– Decouple app from kafka, and fluentd takes care of buffering and error
recovery.
• Cons
– Most complex solution.
31
Bidder Kafka Fluentd Azure
Blob
FluentdLog
DataCon.TW2017
Event-Time Window
• A click event may happens after several minutes from the
impression event. How to merge these events?
32
Id: 1
ts: 10:58
Event: impre
Id: 2
ts: 10:59
Event: impre
Id: 1
ts: 11:02
Event: click
Id: 3
ts: 11:02
Event: impre
Id: 3
ts: 11:03
Event: click
11:00
How to merge these events?
DataCon.TW2017
Event-Time Window
• Our solution
– Fluentd uploads events to the partition window
according to the session timestamp (partts) instead
of ingest timestamp.
– sessionId is type of TimeUUID, which embeds
timestamp in UUID.
– For every events
partts = timestampOf(sessionId)
33
DataCon.TW2017
Event-Time Window
34
Id: 1
ts: 10:58
partts: 10:58
Event: imp
Id: 2
ts: 10:59
partts: 10:59
Event: imp
Id: 1
ts: 11:02
partts:10:58
Event: click
Id: 3
ts: 11:02
Partts: 10:02
Event: impre
Id: 3
ts: 11:03
Partts: 11:02
Event: click
11:00
• Now, we can guarantee the events for the same session
locate at the same window.
In same window
DataCon.TW2017
Spark RDD and Spark SQL
35
• Use Spark RDD to merge events with the same session. (Just like
json object merge)
• Use Spark DataFrame to aggregate metrics by dimensions. (A high
dimension OLAP cube)
• Save DataFrame to Azure blob as Parquet format
• Save to RDBMS sub-dimension data with lower dimensions data.
Azure Blob
Azure Blob/
RDBMS
Spark
RDD
sessions report
Azure Blob
Event
Stream
Raw events
Spark
DataFrame
DataCon.TW2017
Lessons Learned
• Everything is tradeoff.
• For big data, trade features for cost effective
– DFS for batch source
– Kafka for stream source
• Cloud Storage is very cheap!! Use it now
• Spark is a great tool for processing data. Even for non-
distributed application.
36
DataCon.TW2017
Storage Comparison
37
RDBMS Document Store
(e.g. MongoDB)
BigTable-like Stores
(e.g. Cassandra)
Distributed File System
(e.g. Azure Blob,
AWS S3, HDFS)
File/Table Scan Yes Yes Yes Yes
Point Query Yes Yes Yes
Secondary Index Yes Yes Yes*
AdHoc Query Yes Yes
Group and aggregate Yes Yes
Join Yes Yes*
Transaction Yes
DataCon.TW2017
Storage Comparison
38
RDBMS Document Store
(e.g. MongoDB)
BigTable-like Stores
(e.g. Cassandra)
Distributed File System
(e.g. Azure Blob,
AWS S3, HDFS)
Cost * ** ** *****
Query Latency ***** **** *** *
Throughput ** ** ** *****
Scalability * *** *** *****
Availability * *** *** *****
DataCon.TW2017
Where We Go Next?
• Stream processing
• Serverless Model for analytics workload
39
DataCon.TW2017
Stream Processing
• Why
– Latency
– Incremental Update
• Trend
– Batch and Stream in one system
– Exactly-once semantic
– Support both ingest time and event time
– Low watermark for late event
– Structured Streaming
40
DataCon.TW2017
Serverless Model for Analytics Workload
• Analytics Workload Characteristic
– Low utilization rate
– Require huge resource suddenly
– Interactive
• Not suitable for provisioned VMs solution, like
– AWS EMR, Azure HDInsight, GCP DataProc
• Serverless Solutions
– Google BigQuery, AWS Athena, Azure Data Lake Analytics
41
DataCon.TW2017
Recap
42
2015 2016 2017
DataCon.TW2017
43
Thanks
Question?

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Alfresco Backup and Disaster Recovery White Paper
Alfresco Backup and Disaster Recovery White PaperAlfresco Backup and Disaster Recovery White Paper
Alfresco Backup and Disaster Recovery White Paper
 
Application modernization patterns with apache kafka, debezium, and kubernete...
Application modernization patterns with apache kafka, debezium, and kubernete...Application modernization patterns with apache kafka, debezium, and kubernete...
Application modernization patterns with apache kafka, debezium, and kubernete...
 
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
Devops Strategy Roadmap Lifecycle Ppt Powerpoint Presentation Slides Complete...
 
An introduction to terraform
An introduction to terraformAn introduction to terraform
An introduction to terraform
 
intro to DevOps
intro to DevOpsintro to DevOps
intro to DevOps
 
'Quality Engineering: Build It Right The First Time' by Allan Woodcock, Shoba...
'Quality Engineering: Build It Right The First Time' by Allan Woodcock, Shoba...'Quality Engineering: Build It Right The First Time' by Allan Woodcock, Shoba...
'Quality Engineering: Build It Right The First Time' by Allan Woodcock, Shoba...
 
Getting Started with DevOps
Getting Started with DevOpsGetting Started with DevOps
Getting Started with DevOps
 
SAP Migration Overview
SAP Migration OverviewSAP Migration Overview
SAP Migration Overview
 
Ax 2012 x++ code best practices
Ax 2012 x++ code best practicesAx 2012 x++ code best practices
Ax 2012 x++ code best practices
 
Automating Your Clone in E-Business Suite R12.2
Automating Your Clone in E-Business Suite R12.2Automating Your Clone in E-Business Suite R12.2
Automating Your Clone in E-Business Suite R12.2
 
Simplifying EBS 12.2 ADOP - Collaborate 2019
Simplifying EBS 12.2 ADOP - Collaborate 2019   Simplifying EBS 12.2 ADOP - Collaborate 2019
Simplifying EBS 12.2 ADOP - Collaborate 2019
 
Terraform
TerraformTerraform
Terraform
 
0 to hero with Azure DevOps
0 to hero with Azure DevOps0 to hero with Azure DevOps
0 to hero with Azure DevOps
 
DevOps explained
DevOps explainedDevOps explained
DevOps explained
 
Elastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full pictureElastic APM: Amping up your logs and metrics for the full picture
Elastic APM: Amping up your logs and metrics for the full picture
 
Introduction to DevOps
Introduction to DevOpsIntroduction to DevOps
Introduction to DevOps
 
How to implement DevOps in your Organization
How to implement DevOps in your OrganizationHow to implement DevOps in your Organization
How to implement DevOps in your Organization
 
The Container Storage Interface (CSI)
The Container Storage Interface (CSI)The Container Storage Interface (CSI)
The Container Storage Interface (CSI)
 
DevOps introduction
DevOps introductionDevOps introduction
DevOps introduction
 
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
Using HashiCorp’s Terraform to build your infrastructure on AWS - Pop-up Loft...
 

Ähnlich wie TenMax Data Pipeline Experience Sharing

Ähnlich wie TenMax Data Pipeline Experience Sharing (20)

Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
"EventStoreDb: To be, or not to be, that is the question",  Illia Maier"EventStoreDb: To be, or not to be, that is the question",  Illia Maier
"EventStoreDb: To be, or not to be, that is the question", Illia Maier
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017Curriculum Associates Strata NYC 2017
Curriculum Associates Strata NYC 2017
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Building cloud native data microservice
Building cloud native data microserviceBuilding cloud native data microservice
Building cloud native data microservice
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Google Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better OneGoogle Cloud Dataflow Two Worlds Become a Much Better One
Google Cloud Dataflow Two Worlds Become a Much Better One
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 

Mehr von Chen-en Lu

Introduction to rtb and retargeting
Introduction to rtb and retargetingIntroduction to rtb and retargeting
Introduction to rtb and retargeting
Chen-en Lu
 

Mehr von Chen-en Lu (6)

網路廣告的基本架構
網路廣告的基本架構網路廣告的基本架構
網路廣告的基本架構
 
給初學者的Spark教學
給初學者的Spark教學給初學者的Spark教學
給初學者的Spark教學
 
From Java Stream to Java DataFrame
From Java Stream to Java DataFrameFrom Java Stream to Java DataFrame
From Java Stream to Java DataFrame
 
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Gradle起步走: 以CLI Application為例 @ JCConf 2014
Gradle起步走: 以CLI Application為例 @ JCConf 2014Gradle起步走: 以CLI Application為例 @ JCConf 2014
Gradle起步走: 以CLI Application為例 @ JCConf 2014
 
Introduction to rtb and retargeting
Introduction to rtb and retargetingIntroduction to rtb and retargeting
Introduction to rtb and retargeting
 

Kürzlich hochgeladen

Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Chandigarh Call girls 9053900678 Call girls in Chandigarh
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
nirzagarg
 

Kürzlich hochgeladen (20)

Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Prashant Vihar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
VIP Model Call Girls NIBM ( Pune ) Call ON 8005736733 Starting From 5K to 25K...
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
Hire↠Young Call Girls in Tilak nagar (Delhi) ☎️ 9205541914 ☎️ Independent Esc...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Pollachi 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls DubaiDubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
Dubai=Desi Dubai Call Girls O525547819 Outdoor Call Girls Dubai
 
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men  🔝mehsana🔝   Escorts...
➥🔝 7737669865 🔝▻ mehsana Call-girls in Women Seeking Men 🔝mehsana🔝 Escorts...
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
(INDIRA) Call Girl Pune Call Now 8250077686 Pune Escorts 24x7
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 

TenMax Data Pipeline Experience Sharing

  • 1. 2 0 1 7 . 0 9 T e n M a x D a t a P i p e l i n e E x p e r i e n c e S h a r i n g P o p c o r n y ( 陸 振 恩 )
  • 2. DataCon.TW2017 Who am I • 陸振恩 (a.k.a popcorny, Pop) • Director of Engineering @TenMax • 之前經歷 – 交大資科所 – 第四屆趨勢百萬程式競賽冠軍 – 聯發科技 (2005- 2010) – SmartQ (2011 – 2014) – cacaFly/TenMax (2014-present) • FB: https://fb.me/popcornylu 2
  • 3. DataCon.TW2017 Current Workload • 0.1B ~ 1B events generated per day • About 200G data generated per day • Data everywhere – Reporting – Analytics – Content Profiling – Audience Profiling – Machine Learning 3
  • 4. DataCon.TW2017 Context • Each AD request has an serial of events: Request  Impression  Click • We call it a session, which is identified by sessionId. • Generate hourly report for sessions – with some metrics (requests, impres, clicks) – grouped by some dimensions (ad, space, geo, device, …) 4 Request Impression ClickSession: 1 Request Impression ClickSession: 2 Request Impression ClickSession: 3
  • 6. DataCon.TW2017 Data Pipeline 6 Bid Request Raw Events Sessions Report group by sessionId merge events aggregate metrics group by dimensions Hourly Event Stream
  • 7. DataCon.TW2017 Our Data Pipeline Timeline 7 2015 2016 2017
  • 9. DataCon.TW2017 Data Pipeline Version 1 • MongoDB 2.3 • Why MongoDB? – Schemaless – Horizontal scale out – Replication 9 NoSQL is Hot!!
  • 10. DataCon.TW2017 Data Pipeline Version 1 10 Bid Request Raw Events Sessions Report group by sessionId merge events aggregate metrics group by dimensions Event Stream MongoDB RDBMS In-place-update aggregate pipeline Upsert events MongoDB solution sessions report Our problem Bid Request Raw Events
  • 11. DataCon.TW2017 Problem: Poor Write Performance • MMAPv1 storage engine – In-place update – Fragmentation – Random Access – Big DB file 11 More Bytes + Random Access = Poor Performance
  • 12. DataCon.TW2017 Problem: Hard to Operate • Too many roles of server – Mongos – Shard master – Shard slave – Config server 12
  • 14. DataCon.TW2017 Data Pipeline Version 2 • Cassandra 2.1 • Feature – Google BigTable-like Architecture – Excellent Write Performance – Peer-to-peer architecture – Data Compression 14
  • 15. DataCon.TW2017 Data Pipeline Version 2 15 Bid Request Raw Events Sessions Report group by sessionId merge events aggregate metrics group by dimensions Event Stream Cassandra RDBMS compact Java Insert events Cassandra solution sessions report Our problem Bid Request Raw Events Event Stream
  • 16. DataCon.TW2017 Write Performance • Use LSM Tree (Log Structure Merge Tree) • Append only sequential write (including insert, update, delete) • Compression support • Flush, compact, read merge 1616 Write Ahead Log SSTable SSTable Write Memtable SSTable flush Sorted String Table
  • 17. DataCon.TW2017 LSM Tree - Compact SSTable SSTable SSTable SSTable SSTable Level 1 Level 2 Level 3 Level 0 Memtable Compact
  • 18. DataCon.TW2017 LSM Tree – Read Merge SSTable SSTable SSTable Read MemtableMerge
  • 20. DataCon.TW2017 Peer-to-Peer Architecture • Every nodes are – Contact server – Coordinator – Data Node – Meta Server • Easy to operate! 20 coordinator Replica1 Replica2 Replica3
  • 21. DataCon.TW2017 How about aggregate • Cassandra has no group-aggregation • How to aggregate? – Java Stream (thanks java8) – Poppy (an in-house dataframe library) https://github.com/tenmax/poppy 21 Cassandra RDBMS Aggregate Insert events report Bid Request Raw Events Event Stream Grouping
  • 22. DataCon.TW2017 Problem: Cost • SSD Disk costs USD $0.135 per month per GB, while Azure Blob costs USD $0.02 per month per GB. • SSD Disk should allocate space in advance, while Azure Blob is pay- as-you-use. • Azure Blob replicate data even for lowest pricing tier • Azure Blob is much scalable and reliable than self-hosted cluster. • People Cost 22 Cloud Storage Rocks!!
  • 23. DataCon.TW2017 Problem: Aggregation • In-house solution is not easy to evolve, while Hadoop/Spark is a huge ecosystem • Scalability issue • Lack key feature: Group by high cardinality key – Group by visitor – Aggregate Multi-dimensional OLAP cubes 23
  • 24. DataCon.TW2017 Data Pipeline Version 2.1 24 Cassandra RDBMS Generate Report sessions report Bid Request Raw Events Azure Blob OLAP Cube ML Model Sampling Data Dump • Dump the session data to azure blob for further use. BI Tool Analytics Server AD Server
  • 26. DataCon.TW2017 Data Pipeline Version 3 • Kafka 0.11+ Fluentd + Azure blob + Spark 2.1 • Why – Azure Blob is cheap – High throughput for Azure Blob – Spark is a Map-Shuffle-Reduce framework, making grouping by high cardinality key possible. 26
  • 27. DataCon.TW2017 Data Pipeline version 3 27 Bid Request Raw Events Sessions Report group by sessionId merge events aggregate metrics group by dimensions Event Stream Azure Blob Azure Blob/ RDBMS Spark RDD Spark solution sessions report Our problem Azure Blob Fluentd Raw events Spark DataFrame
  • 28. DataCon.TW2017 How to Ingest Log to Azure Blob? 28 Azure Blob Azure Blob/ RDBMS Spark RDD sessions report Azure Blob Fluentd Raw events Spark DateFrame
  • 29. DataCon.TW2017 How to Ingest Log to Azure Blob? • Solution 1 – App write log to local log file – Fluentd tail log files and upload to blob • Pros – Simple • Cons – Data is not uploaded as soon as event happens 29 LogServer Fluentd Azure Blob
  • 30. DataCon.TW2017 How to Ingest Log to Azure Blob? • Solution 2 – App append log to kafka – Fluentd consume logs and batch-upload to blob • Pros – Log is stored as soon as event happens – Log can be used for multiple purpose • Cons – Server aware of Kafka – If connection to kafka fails, server need to handle buffer or OOM 30 Server Kafka Fluentd Azure Blob
  • 31. DataCon.TW2017 How to Ingest Log to Azure Blob? • Solution 3 – App write log to local log file – Fluentd tail log file and push to kafka (<100ms latency) – Fluentd consume logs from kafka and batch-upload to blob • Pros – Log is stored as soon as event happens – Log can be used for multiple purpose – Decouple app from kafka, and fluentd takes care of buffering and error recovery. • Cons – Most complex solution. 31 Bidder Kafka Fluentd Azure Blob FluentdLog
  • 32. DataCon.TW2017 Event-Time Window • A click event may happens after several minutes from the impression event. How to merge these events? 32 Id: 1 ts: 10:58 Event: impre Id: 2 ts: 10:59 Event: impre Id: 1 ts: 11:02 Event: click Id: 3 ts: 11:02 Event: impre Id: 3 ts: 11:03 Event: click 11:00 How to merge these events?
  • 33. DataCon.TW2017 Event-Time Window • Our solution – Fluentd uploads events to the partition window according to the session timestamp (partts) instead of ingest timestamp. – sessionId is type of TimeUUID, which embeds timestamp in UUID. – For every events partts = timestampOf(sessionId) 33
  • 34. DataCon.TW2017 Event-Time Window 34 Id: 1 ts: 10:58 partts: 10:58 Event: imp Id: 2 ts: 10:59 partts: 10:59 Event: imp Id: 1 ts: 11:02 partts:10:58 Event: click Id: 3 ts: 11:02 Partts: 10:02 Event: impre Id: 3 ts: 11:03 Partts: 11:02 Event: click 11:00 • Now, we can guarantee the events for the same session locate at the same window. In same window
  • 35. DataCon.TW2017 Spark RDD and Spark SQL 35 • Use Spark RDD to merge events with the same session. (Just like json object merge) • Use Spark DataFrame to aggregate metrics by dimensions. (A high dimension OLAP cube) • Save DataFrame to Azure blob as Parquet format • Save to RDBMS sub-dimension data with lower dimensions data. Azure Blob Azure Blob/ RDBMS Spark RDD sessions report Azure Blob Event Stream Raw events Spark DataFrame
  • 36. DataCon.TW2017 Lessons Learned • Everything is tradeoff. • For big data, trade features for cost effective – DFS for batch source – Kafka for stream source • Cloud Storage is very cheap!! Use it now • Spark is a great tool for processing data. Even for non- distributed application. 36
  • 37. DataCon.TW2017 Storage Comparison 37 RDBMS Document Store (e.g. MongoDB) BigTable-like Stores (e.g. Cassandra) Distributed File System (e.g. Azure Blob, AWS S3, HDFS) File/Table Scan Yes Yes Yes Yes Point Query Yes Yes Yes Secondary Index Yes Yes Yes* AdHoc Query Yes Yes Group and aggregate Yes Yes Join Yes Yes* Transaction Yes
  • 38. DataCon.TW2017 Storage Comparison 38 RDBMS Document Store (e.g. MongoDB) BigTable-like Stores (e.g. Cassandra) Distributed File System (e.g. Azure Blob, AWS S3, HDFS) Cost * ** ** ***** Query Latency ***** **** *** * Throughput ** ** ** ***** Scalability * *** *** ***** Availability * *** *** *****
  • 39. DataCon.TW2017 Where We Go Next? • Stream processing • Serverless Model for analytics workload 39
  • 40. DataCon.TW2017 Stream Processing • Why – Latency – Incremental Update • Trend – Batch and Stream in one system – Exactly-once semantic – Support both ingest time and event time – Low watermark for late event – Structured Streaming 40
  • 41. DataCon.TW2017 Serverless Model for Analytics Workload • Analytics Workload Characteristic – Low utilization rate – Require huge resource suddenly – Interactive • Not suitable for provisioned VMs solution, like – AWS EMR, Azure HDInsight, GCP DataProc • Serverless Solutions – Google BigQuery, AWS Athena, Azure Data Lake Analytics 41