SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Story of moving 4 Trillion Events
(Log Pipeline) from Batch to Streaming
ApacheCon 2020
Lohit VijayaRenu, Zhenzhao Wang, Praveen Killamsetti
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
Scale of Event Log Aggregation
~10PB
Across millions of clients
Still growing
~3.4~4.1
Trillion Events a Day of Data a Day
Incoming uncompressed
How many and how big?
Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
● Clients log events specifying a Category
name. Eg: ads_click, like_event ...
● Events are stored on HDFS, bucketed every
hour into separate directories
○ /logs/ads_click/2020/09/01/23
○ /logs/like_event/2020/09/01/23
Events and Event Logs @Twitter
Log Management Components
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Rufous
Storage HDFS (Replicated)
Event Aggregation
Event Log Processing
Event Log Replication
Event Log Management
Clients
● Modularization
○ Each component should be independent and plugable.
○ The communication between components should follow via simple protocol
○ Each component could scale indepently
● Tier based approach
○ Resource should be shared to inside tier improve utilization and resiliency.
○ Resource should be isolated between tier to control blast radius.
● Scalability is always primary concern
○ TraïŹƒc grows every year.
○ Scale leads to problem. E.g. HDFS ïŹle number limit.
○ QOS of network traïŹƒc
● Users make mistakes
○ E.g. user might make back incompatible schema changes
○ User might want to restate the data because of error
● Debuggability, long tail problem, DC failover support, and etc
Lessons Learnt
● Seamless integration of
on-prem clusters and cloud
● On-prem parity on cloud
such as data format
Goals
01
● Empower streaming user
cases. E.g. dataïŹ‚ow.
● Support batch users cases
such as spark, dataïŹ‚ow,
presto.
02
● Leverage cloud native
technologies and unlock
more cloud native tools
● PDP(Private data
protection) is always a big
thing at Twitter
03
Scalability
● The traïŹƒc grows every year.
The new log pipeline should
be able to handle the traïŹƒc
04
Hybrid Environments Streaming/Batching Cloud Native and PDP
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
User cases Overview
PUB/SUB


Producers
GKE
Container 1
Container N


VMs
Service 1
Service N


Serverless
CloudFunction
App Engine


Rest
API
IDL
API
Topic 1
Topic 2
Topic N


Batch
DataFlowJob 1
DataFlowJobN

 GCS
Consumers
Restful
API
IDL
API
Kafka
Stream Processing
Stream
Ingestion Jobs
BigQuery
User stream
jobs

 

Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● UniïŹed client lib
○ Abstracts backend implementation
● Google PubSub as subscribable storage
○ Rich meta headers. E.g. checksum
○ Exclusive subscription per destination
● Schedule processors and export metrics
● Processors: DataïŹ‚ow jobs which stream
data to diïŹ€erent destinations.
● State store:
○ Schema info
○ Per category meta such as owner
● Various destination. BQ, GCS, Druid, and
etc
● Replication service: Glue of destinations
Replication Service
Log Processors
Streaming Processor:
● Per topic data ïŹ‚ow job reads from PubSub
and write to BQ
● E2E latency in few seconds
● Dead letter table to handle corrupt
data/schema errors.
● E2E Checksum validation
Batch Processor:
● Multiple Format output.
○ Thrift-lzo: row based format.
○ Parquet: column based format
● E2E Checksum validation
● Tackle cold start with dummy events.
○ To handle empty time ranges
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service
Event Controller
Processor Scheduler
Config
Watcher
● User friendly conïŹguration.
○ No need to worry about
implementation
○ Rich options including destination,
data format and etc.
● Scalable and extendable
○ Multiple destination sinks support
○ Stream and Batch support.
● Managed execution
○ Provide Metrics, health check
○ Priority and quota control support
(planned)
Status
Watcher
Rest API
Event Execution Pool
Job Abstraction Layer
GCS
Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User
Config
Restful
CMD
Other Components
Client Library
● Uniform way to publish log events
● Per log category metrics tracking
● Static schema validation check at event
source
● Privacy Data Protection improvements
○ End to End checksums
○ End to End encryption
○ Optional Base-64 encoding
Schema Management
● CI job to create schema jar and upload to
GCS
● Each Processor re-loads latest schema
bundle periodically
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service
Log Replication
● Used for batch workïŹ‚ow
● Logs are collected separately at each data center independently
● Log Replicator merges the logs across the data centers
○ Copies data from one DC to rest
○ Use GCS connector to write to GCS using HDFS apis
Deployment
● Separate Log Pipeline for each organization(GCP project) for better security and charge
back
● Provisioning Log Category
○ Map log category to GCP project during provisioning
○ Create GCP resources (pubsub topics, buckets, BQ datasets) automatically using
demigod service (Terraform)
○ ConïŹgure event routing
○ Access Control:
■ Limit write access to storage(GCS/BQ) to pillar org speciïŹc log processor
■ Read access of the GCS bucket/BQ limited to service account only
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
Streaming Data from Twitter DCs to Cloud Log Pipeline
Application1
(TWTTR-DC1)
Flume Aggregation
GCP Pub-Sub
Streaming Log Processor
(Data Flow)
Scribe Daemon
Client
Library
Application2
(TWTTR-DC1)
Scribe Daemon
Client
Library
Log Delivery - Big Picture
Application2
(TWTTR-DC1)
Flume Aggregation
(One per each Twitter DC)
GCP Pub-Sub
Application1(
GCP)
Streaming Log Processor
(Data Flow)
Client
Library
Kafka
(Twitter DC)
Scribe Daemon
Client
Library
Application3
(TWTTR-DC2)
Scribe Daemon
Client
Library
Tez
Log Processor
Replication
Service
Possible Routings
● Stream Flows:
LPClient -> PubSub -> BQ
LPClient -> Flume -> Pubsub -> BQ
● Batch Flows
LPClient -> Flume -> Hdfs
LPClient -> PubSub -> GCS
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
Conclusion
● Embrace hybrid cloud environment and provide uniïŹed experience to publish log
events
● Log Pipeline serves as global scale log data delivery mechanism inside Twitter
○ Aggregation of data across DCs
○ Streaming and batch mode delivery
○ Support various sinks
○ ConïŹgure routing with simple knobs for user
Q&A
Thank you.
DataFlow Processors
Streaming Processor:
● Per topic data flow job reads from PubSub and
write to BQ
● E2E latency in few seconds
● Dead letter table to handle corrupt data/schema
errors.
● Checksum validation
Batch Processor:
● Multiple Format output.
○ Thrift-lzo: row based format.
○ Parquet: column based format
● E2E Checksum support.
● Tackle cold start with dummy events.
○ To handle empty time ranges
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow GCS
Processor
Log
Processors
Scheduler
State
Store
User
Interface
(UI/CLI)
Log Pipeline
Client Lib
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Disguise client difference
● Google PubSub as subscribable storage
○ Rich context headers. E.g. checksum
○ Exclusive destination per destination
● Schedule processors
● Processors. Dataflow jobs which stream
data to different destinations.
● Processors. Dataflow jobs which stream
data
● State store.
○ Schema info
○ Per category meta such as owner
Job Scheduler
● A processor is a Stream/Batch ETL
job which could deliver data to user
specified Destination:
○ BigQuery Stream ingestion
○ Gcs Stream Ingestion
Event Controller
Job Schedulers
Config Watcher
● User friendly Configuration.
○ User could config the data
destination easily.
○ Rich options including output format,
● Managed Execution Env
○ Move
○ Plugable engine. Simple Transfer
Storage Supported.
Status Watcher Rest API
Event Execution Pool
Job Abstraction Layer
GCS Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User Config Restful Cmd
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Disguise client difference
● Google PubSub as subscribable storage
○ Rich context headers. E.g. checksum
○ Exclusive destination per destination
● Schedule processors and export metrics
● Processors. Dataflow jobs which stream
data to different destinations.
● Various destination. BQ, GCS, Druid, and etc
● State store.
○ Schema info
○ Per category meta such as owner
Replication Service
● Replication service. Glue of destinations
Twitter Data Analytics : Scale
29
>1EB
>100PB
Several >10K
Hadoop clusters
>10K
Nodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50K
Analytic Jobs
Jobs running on Data
Platform per day
● Clients log events specifying a Category name.
Eg: ads_click, like_event ...
● Events are grouped together across all clients
into the Category
● Events are stored on HDFS, bucketed every hour
into separate directories
○ /logs/ads_click/2020/09/01/23
○ /logs/like_event/2020/09/01/23
● Event logs are replicated to other clusters or
GCP
○ On-prem HDFS clusters
○ GCS
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Http Endpoint
Storage HDFS (Replicated)
Events and Event Logs @Twitter
Life of an event
Events and Event Logs @Twitter
● Clients log events specifying a Category
name. Eg ad_activated_keywords,
login_event ...
● Events are grouped together across all
clients into the Category
● Events are stored on HDFS, bucketed every
hour into separate directories
○ /logs/ad_activated_keywords/2017/05/
01/23
○ /logs/login_event/2017/05/01/23
● Event logs are replicated to other clusters
Life of an Event
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Rufous
Storage HDFS (Replicated)
● Terminologies
○ GCS - google cloud storage
○ GCP - google cloud platform
○ Project - google cloud project which is a organization of google resources including
API
● The backend components into diïŹ€erent pillar cloud projects.
○ Pillar is decided based the organizations. E.g. ads
○ Resource isolation and planned indepently
○ Better chargeback control
Log Pipeline In GCP
Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
Data Ingestion
Data Replication
Data Retention & Management
● Modularization
○ Each component should be independent.
○ The communication between components should follow via simple protocol
○ Each component could scale indepently
●
Lessons learned

Weitere Àhnliche Inhalte

Was ist angesagt?

Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 

Was ist angesagt? (20)

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at Hotstar
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy IndustriesWebinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
 
Stream Processing with Ballerina
Stream Processing with BallerinaStream Processing with Ballerina
Stream Processing with Ballerina
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 

Ähnlich wie Story of migrating event pipeline from batch to streaming

Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
confluent
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
 

Ähnlich wie Story of migrating event pipeline from batch to streaming (20)

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
 
Kubernetes + netflix oss
Kubernetes + netflix ossKubernetes + netflix oss
Kubernetes + netflix oss
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
 
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 

Mehr von lohitvijayarenu (6)

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 

KĂŒrzlich hochgeladen

DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
MayuraD1
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
jaanualu31
 

KĂŒrzlich hochgeladen (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
BhubaneswarđŸŒčCall Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...
BhubaneswarđŸŒčCall Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...BhubaneswarđŸŒčCall Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...
BhubaneswarđŸŒčCall Girls Bhubaneswar ❀Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxOrlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 

Story of migrating event pipeline from batch to streaming

  • 1. Story of moving 4 Trillion Events (Log Pipeline) from Batch to Streaming ApacheCon 2020 Lohit VijayaRenu, Zhenzhao Wang, Praveen Killamsetti
  • 2. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 3. Scale of Event Log Aggregation ~10PB Across millions of clients Still growing ~3.4~4.1 Trillion Events a Day of Data a Day Incoming uncompressed How many and how big?
  • 4. Twitter DataCenter Events and Event Logs @Twitter Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Streaming systems GCP Google Cloud Storage Services to manage data Data Processing frameworks ● Clients log events specifying a Category name. Eg: ads_click, like_event ... ● Events are stored on HDFS, bucketed every hour into separate directories ○ /logs/ads_click/2020/09/01/23 ○ /logs/like_event/2020/09/01/23
  • 5. Events and Event Logs @Twitter Log Management Components Clients Aggregated by Category Storage HDFS (Incoming) Http Clients Clients Client Daemon Client Daemon Client Daemon Rufous Storage HDFS (Replicated) Event Aggregation Event Log Processing Event Log Replication Event Log Management Clients
  • 6. ● Modularization ○ Each component should be independent and plugable. ○ The communication between components should follow via simple protocol ○ Each component could scale indepently ● Tier based approach ○ Resource should be shared to inside tier improve utilization and resiliency. ○ Resource should be isolated between tier to control blast radius. ● Scalability is always primary concern ○ TraïŹƒc grows every year. ○ Scale leads to problem. E.g. HDFS ïŹle number limit. ○ QOS of network traïŹƒc ● Users make mistakes ○ E.g. user might make back incompatible schema changes ○ User might want to restate the data because of error ● Debuggability, long tail problem, DC failover support, and etc Lessons Learnt
  • 7. ● Seamless integration of on-prem clusters and cloud ● On-prem parity on cloud such as data format Goals 01 ● Empower streaming user cases. E.g. dataïŹ‚ow. ● Support batch users cases such as spark, dataïŹ‚ow, presto. 02 ● Leverage cloud native technologies and unlock more cloud native tools ● PDP(Private data protection) is always a big thing at Twitter 03 Scalability ● The traïŹƒc grows every year. The new log pipeline should be able to handle the traïŹƒc 04 Hybrid Environments Streaming/Batching Cloud Native and PDP
  • 8. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 9. User cases Overview PUB/SUB 
 Producers GKE Container 1 Container N 
 VMs Service 1 Service N 
 Serverless CloudFunction App Engine 
 Rest API IDL API Topic 1 Topic 2 Topic N 
 Batch DataFlowJob 1 DataFlowJobN 
 GCS Consumers Restful API IDL API Kafka Stream Processing Stream Ingestion Jobs BigQuery User stream jobs 
 

  • 10. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Processors Scheduler State Store Log Pipeline Client Lib ● UniïŹed client lib ○ Abstracts backend implementation ● Google PubSub as subscribable storage ○ Rich meta headers. E.g. checksum ○ Exclusive subscription per destination ● Schedule processors and export metrics ● Processors: DataïŹ‚ow jobs which stream data to diïŹ€erent destinations. ● State store: ○ Schema info ○ Per category meta such as owner ● Various destination. BQ, GCS, Druid, and etc ● Replication service: Glue of destinations Replication Service
  • 11. Log Processors Streaming Processor: ● Per topic data ïŹ‚ow job reads from PubSub and write to BQ ● E2E latency in few seconds ● Dead letter table to handle corrupt data/schema errors. ● E2E Checksum validation Batch Processor: ● Multiple Format output. ○ Thrift-lzo: row based format. ○ Parquet: column based format ● E2E Checksum validation ● Tackle cold start with dummy events. ○ To handle empty time ranges Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Pipeline Client Lib Replication Service
  • 12. Event Controller Processor Scheduler Config Watcher ● User friendly conïŹguration. ○ No need to worry about implementation ○ Rich options including destination, data format and etc. ● Scalable and extendable ○ Multiple destination sinks support ○ Stream and Batch support. ● Managed execution ○ Provide Metrics, health check ○ Priority and quota control support (planned) Status Watcher Rest API Event Execution Pool Job Abstraction Layer GCS Stream Ingestion BigQuery Stream Ingestion Druid Ingestion ... ... User Config Restful CMD
  • 13. Other Components Client Library ● Uniform way to publish log events ● Per log category metrics tracking ● Static schema validation check at event source ● Privacy Data Protection improvements ○ End to End checksums ○ End to End encryption ○ Optional Base-64 encoding Schema Management ● CI job to create schema jar and upload to GCS ● Each Processor re-loads latest schema bundle periodically Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Pipeline Client Lib Replication Service
  • 14. Log Replication ● Used for batch workïŹ‚ow ● Logs are collected separately at each data center independently ● Log Replicator merges the logs across the data centers ○ Copies data from one DC to rest ○ Use GCS connector to write to GCS using HDFS apis
  • 15. Deployment ● Separate Log Pipeline for each organization(GCP project) for better security and charge back ● Provisioning Log Category ○ Map log category to GCP project during provisioning ○ Create GCP resources (pubsub topics, buckets, BQ datasets) automatically using demigod service (Terraform) ○ ConïŹgure event routing ○ Access Control: ■ Limit write access to storage(GCS/BQ) to pillar org speciïŹc log processor ■ Read access of the GCS bucket/BQ limited to service account only
  • 16. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 17. Streaming Data from Twitter DCs to Cloud Log Pipeline Application1 (TWTTR-DC1) Flume Aggregation GCP Pub-Sub Streaming Log Processor (Data Flow) Scribe Daemon Client Library Application2 (TWTTR-DC1) Scribe Daemon Client Library
  • 18. Log Delivery - Big Picture Application2 (TWTTR-DC1) Flume Aggregation (One per each Twitter DC) GCP Pub-Sub Application1( GCP) Streaming Log Processor (Data Flow) Client Library Kafka (Twitter DC) Scribe Daemon Client Library Application3 (TWTTR-DC2) Scribe Daemon Client Library Tez Log Processor Replication Service Possible Routings ● Stream Flows: LPClient -> PubSub -> BQ LPClient -> Flume -> Pubsub -> BQ ● Batch Flows LPClient -> Flume -> Hdfs LPClient -> PubSub -> GCS
  • 19. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 20. Conclusion ● Embrace hybrid cloud environment and provide uniïŹed experience to publish log events ● Log Pipeline serves as global scale log data delivery mechanism inside Twitter ○ Aggregation of data across DCs ○ Streaming and batch mode delivery ○ Support various sinks ○ ConïŹgure routing with simple knobs for user
  • 21. Q&A
  • 23. DataFlow Processors Streaming Processor: ● Per topic data flow job reads from PubSub and write to BQ ● E2E latency in few seconds ● Dead letter table to handle corrupt data/schema errors. ● Checksum validation Batch Processor: ● Multiple Format output. ○ Thrift-lzo: row based format. ○ Parquet: column based format ● E2E Checksum support. ● Tackle cold start with dummy events. ○ To handle empty time ranges
  • 24. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow GCS Processor Log Processors Scheduler State Store User Interface (UI/CLI) Log Pipeline Client Lib
  • 25. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Processors Scheduler State Store Log Pipeline Client Lib ● Unified client lib ○ Disguise client difference ● Google PubSub as subscribable storage ○ Rich context headers. E.g. checksum ○ Exclusive destination per destination ● Schedule processors ● Processors. Dataflow jobs which stream data to different destinations. ● Processors. Dataflow jobs which stream data ● State store. ○ Schema info ○ Per category meta such as owner
  • 26. Job Scheduler ● A processor is a Stream/Batch ETL job which could deliver data to user specified Destination: ○ BigQuery Stream ingestion ○ Gcs Stream Ingestion
  • 27. Event Controller Job Schedulers Config Watcher ● User friendly Configuration. ○ User could config the data destination easily. ○ Rich options including output format, ● Managed Execution Env ○ Move ○ Plugable engine. Simple Transfer Storage Supported. Status Watcher Rest API Event Execution Pool Job Abstraction Layer GCS Stream Ingestion BigQuery Stream Ingestion Druid Ingestion ... ... User Config Restful Cmd
  • 28. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Processors Scheduler State Store Log Pipeline Client Lib ● Unified client lib ○ Disguise client difference ● Google PubSub as subscribable storage ○ Rich context headers. E.g. checksum ○ Exclusive destination per destination ● Schedule processors and export metrics ● Processors. Dataflow jobs which stream data to different destinations. ● Various destination. BQ, GCS, Druid, and etc ● State store. ○ Schema info ○ Per category meta such as owner Replication Service ● Replication service. Glue of destinations
  • 29. Twitter Data Analytics : Scale 29 >1EB >100PB Several >10K Hadoop clusters >10K Nodes Hadoop Cluster Storage capacity Reads and Write ~1 Exabyte Storage capacity Amount of data read and written daily >50K Analytic Jobs Jobs running on Data Platform per day
  • 30. ● Clients log events specifying a Category name. Eg: ads_click, like_event ... ● Events are grouped together across all clients into the Category ● Events are stored on HDFS, bucketed every hour into separate directories ○ /logs/ads_click/2020/09/01/23 ○ /logs/like_event/2020/09/01/23 ● Event logs are replicated to other clusters or GCP ○ On-prem HDFS clusters ○ GCS Clients Aggregated by Category Storage HDFS (Incoming) Http Clients Clients Client Daemon Client Daemon Client Daemon Http Endpoint Storage HDFS (Replicated) Events and Event Logs @Twitter Life of an event
  • 31. Events and Event Logs @Twitter ● Clients log events specifying a Category name. Eg ad_activated_keywords, login_event ... ● Events are grouped together across all clients into the Category ● Events are stored on HDFS, bucketed every hour into separate directories ○ /logs/ad_activated_keywords/2017/05/ 01/23 ○ /logs/login_event/2017/05/01/23 ● Event logs are replicated to other clusters Life of an Event Clients Aggregated by Category Storage HDFS (Incoming) Http Clients Clients Client Daemon Client Daemon Client Daemon Rufous Storage HDFS (Replicated)
  • 32. ● Terminologies ○ GCS - google cloud storage ○ GCP - google cloud platform ○ Project - google cloud project which is a organization of google resources including API ● The backend components into diïŹ€erent pillar cloud projects. ○ Pillar is decided based the organizations. E.g. ads ○ Resource isolation and planned indepently ○ Better chargeback control Log Pipeline In GCP
  • 33. Twitter DataCenter Events and Event Logs @Twitter Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Streaming systems GCP Google Cloud Storage Services to manage data Data Processing frameworks Data Ingestion Data Replication Data Retention & Management
  • 34. ● Modularization ○ Each component should be independent. ○ The communication between components should follow via simple protocol ○ Each component could scale indepently ● Lessons learned