Story of migrating event pipeline from batch to streaming

Story of moving 4 Trillion Events
(Log Pipeline) from Batch to Streaming
ApacheCon 2020
Lohit VijayaRenu, Zhenzhao Wang, Praveen Killamsetti

1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A

Scale of Event Log Aggregation
~10PB
Across millions of clients
Still growing
~3.4~4.1
Trillion Events a Day of Data a Day
Incoming uncompressed
How many and how big?

Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
● Clients log events specifying a Category
name. Eg: ads_click, like_event ...
● Events are stored on HDFS, bucketed every
hour into separate directories
○ /logs/ads_click/2020/09/01/23
○ /logs/like_event/2020/09/01/23

Log Management Components
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Rufous
Storage HDFS (Replicated)
Event Aggregation
Event Log Processing
Event Log Replication
Event Log Management
Clients

● Modularization
○ Each component should be independent and plugable.
○ The communication between components should follow via simple protocol
○ Each component could scale indepently
● Tier based approach
○ Resource should be shared to inside tier improve utilization and resiliency.
○ Resource should be isolated between tier to control blast radius.
● Scalability is always primary concern
○ Traffic grows every year.
○ Scale leads to problem. E.g. HDFS file number limit.
○ QOS of network traffic
● Users make mistakes
○ E.g. user might make back incompatible schema changes
○ User might want to restate the data because of error
● Debuggability, long tail problem, DC failover support, and etc
Lessons Learnt

● Seamless integration of
on-prem clusters and cloud
● On-prem parity on cloud
such as data format
Goals
01
● Empower streaming user
cases. E.g. dataflow.
● Support batch users cases
such as spark, dataflow,
presto.
02
● Leverage cloud native
technologies and unlock
more cloud native tools
● PDP(Private data
protection) is always a big
thing at Twitter
03
Scalability
● The traffic grows every year.
The new log pipeline should
be able to handle the traffic
04
Hybrid Environments Streaming/Batching Cloud Native and PDP

User cases Overview
PUB/SUB
…
Producers
GKE
Container 1
Container N
…
VMs
Service 1
Service N
…
Serverless
CloudFunction
App Engine
…
Rest
API
IDL
API
Topic 1
Topic 2
Topic N
…
Batch
DataFlowJob 1
DataFlowJobN
… GCS
Consumers
Restful
API
IDL
API
Kafka
Stream Processing
Stream
Ingestion Jobs
BigQuery
User stream
jobs
… …

Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Abstracts backend implementation
● Google PubSub as subscribable storage
○ Rich meta headers. E.g. checksum
○ Exclusive subscription per destination
● Schedule processors and export metrics
● Processors: Dataflow jobs which stream
data to different destinations.
● State store:
○ Schema info
○ Per category meta such as owner
● Various destination. BQ, GCS, Druid, and
etc
● Replication service: Glue of destinations
Replication Service

Log Processors
Streaming Processor:
● Per topic data ﬂow job reads from PubSub
and write to BQ
● E2E latency in few seconds
● Dead letter table to handle corrupt
data/schema errors.
● E2E Checksum validation
Batch Processor:
● Multiple Format output.
○ Thrift-lzo: row based format.
○ Parquet: column based format
● E2E Checksum validation
● Tackle cold start with dummy events.
○ To handle empty time ranges
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service

Event Controller
Processor Scheduler
Config
Watcher
● User friendly conﬁguration.
○ No need to worry about
implementation
○ Rich options including destination,
data format and etc.
● Scalable and extendable
○ Multiple destination sinks support
○ Stream and Batch support.
● Managed execution
○ Provide Metrics, health check
○ Priority and quota control support
(planned)
Status
Watcher
Rest API
Event Execution Pool
Job Abstraction Layer
GCS
Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User
Config
Restful
CMD

Other Components
Client Library
● Uniform way to publish log events
● Per log category metrics tracking
● Static schema validation check at event
source
● Privacy Data Protection improvements
○ End to End checksums
○ End to End encryption
○ Optional Base-64 encoding
Schema Management
● CI job to create schema jar and upload to
GCS
● Each Processor re-loads latest schema
bundle periodically
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service

Log Replication
● Used for batch workﬂow
● Logs are collected separately at each data center independently
● Log Replicator merges the logs across the data centers
○ Copies data from one DC to rest
○ Use GCS connector to write to GCS using HDFS apis

Deployment
● Separate Log Pipeline for each organization(GCP project) for better security and charge
back
● Provisioning Log Category
○ Map log category to GCP project during provisioning
○ Create GCP resources (pubsub topics, buckets, BQ datasets) automatically using
demigod service (Terraform)
○ Conﬁgure event routing
○ Access Control:
■ Limit write access to storage(GCS/BQ) to pillar org speciﬁc log processor
■ Read access of the GCS bucket/BQ limited to service account only

Streaming Data from Twitter DCs to Cloud Log Pipeline
Application1
(TWTTR-DC1)
Flume Aggregation
GCP Pub-Sub
Streaming Log Processor
(Data Flow)
Scribe Daemon
Client
Library
Application2
(TWTTR-DC1)
Scribe Daemon
Client
Library

Log Delivery - Big Picture
Application2
(TWTTR-DC1)
Flume Aggregation
(One per each Twitter DC)
GCP Pub-Sub
Application1(
GCP)
Streaming Log Processor
(Data Flow)
Client
Library
Kafka
(Twitter DC)
Scribe Daemon
Client
Library
Application3
(TWTTR-DC2)
Scribe Daemon
Client
Library
Tez
Log Processor
Replication
Service
Possible Routings
● Stream Flows:
LPClient -> PubSub -> BQ
LPClient -> Flume -> Pubsub -> BQ
● Batch Flows
LPClient -> Flume -> Hdfs
LPClient -> PubSub -> GCS

Conclusion
● Embrace hybrid cloud environment and provide uniﬁed experience to publish log
events
● Log Pipeline serves as global scale log data delivery mechanism inside Twitter
○ Aggregation of data across DCs
○ Streaming and batch mode delivery
○ Support various sinks
○ Conﬁgure routing with simple knobs for user

DataFlow Processors
Streaming Processor:
● Per topic data flow job reads from PubSub and
write to BQ
● E2E latency in few seconds
● Dead letter table to handle corrupt data/schema
errors.
● Checksum validation
Batch Processor:
● Multiple Format output.
○ Thrift-lzo: row based format.
○ Parquet: column based format
● E2E Checksum support.
● Tackle cold start with dummy events.
○ To handle empty time ranges

Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow GCS
Processor
Log
Processors
Scheduler
State
Store
User
Interface
(UI/CLI)
Log Pipeline
Client Lib

Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Disguise client difference
○ Rich context headers. E.g. checksum
○ Exclusive destination per destination
● Schedule processors
● Processors. Dataflow jobs which stream
data to different destinations.
data
● State store.
○ Schema info

Job Scheduler
● A processor is a Stream/Batch ETL
job which could deliver data to user
specified Destination:
○ BigQuery Stream ingestion
○ Gcs Stream Ingestion

Event Controller
Job Schedulers
Config Watcher
● User friendly Configuration.
○ User could config the data
destination easily.
○ Rich options including output format,
● Managed Execution Env
○ Move
○ Plugable engine. Simple Transfer
Storage Supported.
Status Watcher Rest API
Event Execution Pool
Job Abstraction Layer
GCS Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User Config Restful Cmd

Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Disguise client difference
○ Rich context headers. E.g. checksum
○ Exclusive destination per destination
● Schedule processors and export metrics
data to different destinations.
● Various destination. BQ, GCS, Druid, and etc
● State store.
○ Schema info
Replication Service
● Replication service. Glue of destinations

Twitter Data Analytics : Scale
29
>1EB
>100PB
Several >10K
Hadoop clusters
>10K
Nodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50K
Analytic Jobs
Jobs running on Data
Platform per day

● Clients log events specifying a Category name.
Eg: ads_click, like_event ...
● Events are grouped together across all clients
into the Category
● Events are stored on HDFS, bucketed every hour
into separate directories
○ /logs/ads_click/2020/09/01/23
○ /logs/like_event/2020/09/01/23
● Event logs are replicated to other clusters or
GCP
○ On-prem HDFS clusters
○ GCS
Clients
Http Clients
Clients
Client Daemon
Http Endpoint
Life of an event

● Clients log events specifying a Category
name. Eg ad_activated_keywords,
login_event ...
● Events are grouped together across all
clients into the Category
● Events are stored on HDFS, bucketed every
hour into separate directories
○ /logs/ad_activated_keywords/2017/05/
01/23
○ /logs/login_event/2017/05/01/23
● Event logs are replicated to other clusters
Life of an Event
Clients
Http Clients
Clients
Client Daemon
Rufous

● Terminologies
○ GCS - google cloud storage
○ GCP - google cloud platform
○ Project - google cloud project which is a organization of google resources including
API
● The backend components into diﬀerent pillar cloud projects.
○ Pillar is decided based the organizations. E.g. ads
○ Resource isolation and planned indepently
○ Better chargeback control
Log Pipeline In GCP

Twitter DataCenter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
Data Ingestion
Data Replication
Data Retention & Management

● Modularization
○ Each component should be independent.
○ The communication between components should follow via simple protocol
○ Each component could scale indepently
●
Lessons learned

Story of migrating event pipeline from batch to streaming

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Story of migrating event pipeline from batch to streaming

Ähnlich wie Story of migrating event pipeline from batch to streaming (20)

Mehr von lohitvijayarenu

Mehr von lohitvijayarenu (6)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Story of migrating event pipeline from batch to streaming