3. Scale of Event Log Aggregation
~10PB
Across millions of clients
Still growing
~3.4~4.1
Trillion Events a Day of Data a Day
Incoming uncompressed
How many and how big?
4. Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
â Clients log events specifying a Category
name. Eg: ads_click, like_event ...
â Events are stored on HDFS, bucketed every
hour into separate directories
â /logs/ads_click/2020/09/01/23
â /logs/like_event/2020/09/01/23
6. â Modularization
â Each component should be independent and plugable.
â The communication between components should follow via simple protocol
â Each component could scale indepently
â Tier based approach
â Resource should be shared to inside tier improve utilization and resiliency.
â Resource should be isolated between tier to control blast radius.
â Scalability is always primary concern
â TraïŹc grows every year.
â Scale leads to problem. E.g. HDFS ïŹle number limit.
â QOS of network traïŹc
â Users make mistakes
â E.g. user might make back incompatible schema changes
â User might want to restate the data because of error
â Debuggability, long tail problem, DC failover support, and etc
Lessons Learnt
7. â Seamless integration of
on-prem clusters and cloud
â On-prem parity on cloud
such as data format
Goals
01
â Empower streaming user
cases. E.g. dataïŹow.
â Support batch users cases
such as spark, dataïŹow,
presto.
02
â Leverage cloud native
technologies and unlock
more cloud native tools
â PDP(Private data
protection) is always a big
thing at Twitter
03
Scalability
â The traïŹc grows every year.
The new log pipeline should
be able to handle the traïŹc
04
Hybrid Environments Streaming/Batching Cloud Native and PDP
9. User cases Overview
PUB/SUB
âŠ
Producers
GKE
Container 1
Container N
âŠ
VMs
Service 1
Service N
âŠ
Serverless
CloudFunction
App Engine
âŠ
Rest
API
IDL
API
Topic 1
Topic 2
Topic N
âŠ
Batch
DataFlowJob 1
DataFlowJobN
⊠GCS
Consumers
Restful
API
IDL
API
Kafka
Stream Processing
Stream
Ingestion Jobs
BigQuery
User stream
jobs
⊠âŠ
10. Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
â UniïŹed client lib
â Abstracts backend implementation
â Google PubSub as subscribable storage
â Rich meta headers. E.g. checksum
â Exclusive subscription per destination
â Schedule processors and export metrics
â Processors: DataïŹow jobs which stream
data to diïŹerent destinations.
â State store:
â Schema info
â Per category meta such as owner
â Various destination. BQ, GCS, Druid, and
etc
â Replication service: Glue of destinations
Replication Service
11. Log Processors
Streaming Processor:
â Per topic data ïŹow job reads from PubSub
and write to BQ
â E2E latency in few seconds
â Dead letter table to handle corrupt
data/schema errors.
â E2E Checksum validation
Batch Processor:
â Multiple Format output.
â Thrift-lzo: row based format.
â Parquet: column based format
â E2E Checksum validation
â Tackle cold start with dummy events.
â To handle empty time ranges
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service
12. Event Controller
Processor Scheduler
Config
Watcher
â User friendly conïŹguration.
â No need to worry about
implementation
â Rich options including destination,
data format and etc.
â Scalable and extendable
â Multiple destination sinks support
â Stream and Batch support.
â Managed execution
â Provide Metrics, health check
â Priority and quota control support
(planned)
Status
Watcher
Rest API
Event Execution Pool
Job Abstraction Layer
GCS
Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User
Config
Restful
CMD
13. Other Components
Client Library
â Uniform way to publish log events
â Per log category metrics tracking
â Static schema validation check at event
source
â Privacy Data Protection improvements
â End to End checksums
â End to End encryption
â Optional Base-64 encoding
Schema Management
â CI job to create schema jar and upload to
GCS
â Each Processor re-loads latest schema
bundle periodically
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service
14. Log Replication
â Used for batch workïŹow
â Logs are collected separately at each data center independently
â Log Replicator merges the logs across the data centers
â Copies data from one DC to rest
â Use GCS connector to write to GCS using HDFS apis
15. Deployment
â Separate Log Pipeline for each organization(GCP project) for better security and charge
back
â Provisioning Log Category
â Map log category to GCP project during provisioning
â Create GCP resources (pubsub topics, buckets, BQ datasets) automatically using
demigod service (Terraform)
â ConïŹgure event routing
â Access Control:
â Limit write access to storage(GCS/BQ) to pillar org speciïŹc log processor
â Read access of the GCS bucket/BQ limited to service account only
20. Conclusion
â Embrace hybrid cloud environment and provide uniïŹed experience to publish log
events
â Log Pipeline serves as global scale log data delivery mechanism inside Twitter
â Aggregation of data across DCs
â Streaming and batch mode delivery
â Support various sinks
â ConïŹgure routing with simple knobs for user
23. DataFlow Processors
Streaming Processor:
â Per topic data flow job reads from PubSub and
write to BQ
â E2E latency in few seconds
â Dead letter table to handle corrupt data/schema
errors.
â Checksum validation
Batch Processor:
â Multiple Format output.
â Thrift-lzo: row based format.
â Parquet: column based format
â E2E Checksum support.
â Tackle cold start with dummy events.
â To handle empty time ranges
24. Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow GCS
Processor
Log
Processors
Scheduler
State
Store
User
Interface
(UI/CLI)
Log Pipeline
Client Lib
25. Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
â Unified client lib
â Disguise client difference
â Google PubSub as subscribable storage
â Rich context headers. E.g. checksum
â Exclusive destination per destination
â Schedule processors
â Processors. Dataflow jobs which stream
data to different destinations.
â Processors. Dataflow jobs which stream
data
â State store.
â Schema info
â Per category meta such as owner
26. Job Scheduler
â A processor is a Stream/Batch ETL
job which could deliver data to user
specified Destination:
â BigQuery Stream ingestion
â Gcs Stream Ingestion
27. Event Controller
Job Schedulers
Config Watcher
â User friendly Configuration.
â User could config the data
destination easily.
â Rich options including output format,
â Managed Execution Env
â Move
â Plugable engine. Simple Transfer
Storage Supported.
Status Watcher Rest API
Event Execution Pool
Job Abstraction Layer
GCS Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User Config Restful Cmd
28. Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
â Unified client lib
â Disguise client difference
â Google PubSub as subscribable storage
â Rich context headers. E.g. checksum
â Exclusive destination per destination
â Schedule processors and export metrics
â Processors. Dataflow jobs which stream
data to different destinations.
â Various destination. BQ, GCS, Druid, and etc
â State store.
â Schema info
â Per category meta such as owner
Replication Service
â Replication service. Glue of destinations
29. Twitter Data Analytics : Scale
29
>1EB
>100PB
Several >10K
Hadoop clusters
>10K
Nodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50K
Analytic Jobs
Jobs running on Data
Platform per day
30. â Clients log events specifying a Category name.
Eg: ads_click, like_event ...
â Events are grouped together across all clients
into the Category
â Events are stored on HDFS, bucketed every hour
into separate directories
â /logs/ads_click/2020/09/01/23
â /logs/like_event/2020/09/01/23
â Event logs are replicated to other clusters or
GCP
â On-prem HDFS clusters
â GCS
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Http Endpoint
Storage HDFS (Replicated)
Events and Event Logs @Twitter
Life of an event
31. Events and Event Logs @Twitter
â Clients log events specifying a Category
name. Eg ad_activated_keywords,
login_event ...
â Events are grouped together across all
clients into the Category
â Events are stored on HDFS, bucketed every
hour into separate directories
â /logs/ad_activated_keywords/2017/05/
01/23
â /logs/login_event/2017/05/01/23
â Event logs are replicated to other clusters
Life of an Event
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Rufous
Storage HDFS (Replicated)
32. â Terminologies
â GCS - google cloud storage
â GCP - google cloud platform
â Project - google cloud project which is a organization of google resources including
API
â The backend components into diïŹerent pillar cloud projects.
â Pillar is decided based the organizations. E.g. ads
â Resource isolation and planned indepently
â Better chargeback control
Log Pipeline In GCP
33. Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
Data Ingestion
Data Replication
Data Retention & Management
34. â Modularization
â Each component should be independent.
â The communication between components should follow via simple protocol
â Each component could scale indepently
â
Lessons learned