SlideShare a Scribd company logo
1 of 63
Download to read offline
TWITTER IS REAL TIME
WHAT IS REAL TIME?
REAL TIME PIPELINE
REAL TIME COMPONENTS
REAL TIME USE CASES
ETL BI
PRODUCT
SAFETY
TRENDS
ML MEDIA OPS ADS
20 PB
2 Trillion
Events/Day
100 ms
e2e
latency
400 Real
Time Jobs
DLOG &
HERON are
Open
Sourced
WE ARE HIRING!
Messaging
Data Infrastructure
Core Services
Search Infrastructure
Traffic
Real Time Compute
Compute Platform
Platform Engineering
Kernel
#LoveWhereYouWork
Learn more at careers.twitter.com
Hadoop
Core Data Libraries
Data Applications
Core Metrics
- Easy operations
- Small technology portfolio
- Quick development Iteration
- Diverse use cases
Bookkeeper
Write
Proxy
Read
Proxy
client
client
Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
20 PB
2 Trillion Events
100 ms
e2e
latency
- Event
A discrete, self-contained, piece of data
- Stream
A persistent, unordered collection of events with a time
- Partition
A portion of a stream with a proportional amount of that the overall capacity
- Subscriber
A collection of processes collectively consuming a copy of the stream
Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve
Flow Control
Stream
Configuration
Partition
Ownership
DistributedLog
(E => Future[Unit])
Offset
Tracking
Offset
Store
Metadata
DL Read
Proxy
@DistributedLog
http://distributedlog.io
Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny
<@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar
<@mahakp>, Philip Su <@philipsu522>, Yiming Zang
<@zang_yiming>
Messaging Alumni: David Helder, Aniruddha Laud, Robin
Dhamankar
STORM/HERON TERMINOLOGY
- TOPOLOGY
Directed acyclic graph
Vertices=computation, and edges=streams of data tuples
- SPOUTS
Sources of data tuples for the topology
Examples - Kafka/Distributed Log/MySQL/Postgres
- BOLTS
Process incoming tuples and emit outgoing tuples
Examples - filtering/aggregation/join/arbitrary function
STORM/HERON TOPOLOGY
BOLT 1
BOLT 2
BOLT 3
BOLT 4
BOLT 5
SPOUT 1
SPOUT 2
WHY HERON?
● SCALABILITY and PERFORMANCE PREDICTABILITY
● IMPROVE DEVELOPER PRODUCTIVITY
● EASE OF MANAGEABILITY
TOPOLOGY ARCHITECTURE
Topology
Master
ZK
CLUSTER
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics
Manager
Metrics
Manager
HERON ARCHITECTURE
Topology 1
TOPOLOGY
SUBMISSION
Scheduler
Topology 2
Topology 3
Topology N
HERON SAMPLE TOPOLOGIES
Large amount of data
produced every day
Large cluster Several hundred
topologies deployed
Several million
messages every second
HERON @TWITTER
1 stage 10 stages
3x reduction in cores and memory
Heron has been in production for 2 years
STRAGGLERS
Stragglers are the norm in a multi-tenant distributed systems
● BAD/SLOW HOST
● EXECUTION SKEW
● INADEQUATE PROVISIONING
APPROACHES TO HANDLE STRAGGLERS

d
● SENDERS TO STRAGGLER DROP DATA
● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
● DETECT STRAGGLERS AND RESCHEDULE THEM
S1 B2
B3
SLOW DOWN SENDERS STRATEGY
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
S1 S1
S1S1
BACK PRESSURE IN PRACTICE
● IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
● SOMETIMES USER PREFER DROPPING OF DATA
Care about only latest data
● SUSTAINED BACK PRESSURE
Irrecoverable GC cycles
Bad or faulty host
ENVIRONMENT'S SUPPORTED
STORM API
PRE- 1.0.0
POST 1.0.0

SUMMINGBIRD FOR HERON
CURIOUS TO LEARN MORE…
INTERESTED IN HERON?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING
● 100K+ Advertisers, $2B+ revenue/year
● 300M+ Users
● Impressions/Engagements
○ Tens of billions of events daily
Use Heron & EventBus:
● Prediction
● Serving
● Analytics
● Online learning: models require real-time data
○ On-going training for existing ads
■ CTR, conversions, RTs, Likes
○ On-going training for user data
■ Interests change, targeting must stay relevant
○ New ads arrive constantly
● Consumes 150 GB/second from EventBus streams
Ad Server
● Reads Prediction models
● Finalizes Ad selection
● Writes 56GB/second to EventBus
○ Served impressions
○ Spend events
Callback Service
● Receives engagements from clients
● Writes engagements to EventBus
○ Consumed by Prediction
and Analytics
Advertiser Dashboard keeps advertisers informed in real-time
For Ads:
● Impressions
● Engagements
● Spend rate
● Uniques
For Users:
● Geolocation
● Gender
● Age
● Followers
● Keywords
● Interests
Offline layer (hours)
● Engagement log
● Billing pipeline
● 14TB/hour
Online layer (seconds)
● Heron topologies read 1M events/sec
From EventBus, provide real-time analytics
Advertiser Dashboard
● Ad-hoc queries for desired time range
● View performance of ads in real-time
http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
(~6 hrs)
#RealTime processing helps us scale our Ads
business:
● Prediction - Online learning
○ Ads
○ Users
● Analytics - Advertisers get real-time
visibility into ad performance
This enables us to provide high ROI for
Advertisers.
Image Credits:
http://images.clipartpanda.com/cycle-clipart-bike_red.png
http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png
http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
Observation
● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter
● Spam campaign comes in large batch
● Despite of randomized tweaks, enough similarity among spammy entities are preserved
Requirement
● Real-time : a competition game with spammers i.e. “detect” vs “mutate”
● Generic : need to support all common feature representations
Crest is a generic online similarity clustering system
● Inputs are a stream of entities
● Similar Clustering system groups similar entities together ( according to predefined
similarity metric)
● outputs are the clusters and the cluster entity members.
“Built on top of Heron“ https://github.com/twitter/heron
● Locality sensitive hashing
probabilistic similarity-preserving random projection method
Entity1 => hashValue1 (010010001110010100101001000011)
Entity2 => hashValue2 (000111001110010101100110100100)
Sim(Entity1, Entity2) ~ Sim(hash1, hash2)
● No “Pair-wise” similarity calculation
Similarity match based on “signature band”
Similarity match based on “signature band” collision
Cut signatures into bands:
01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)
Two entities become similarly candidates, if they collide on at least one band.
(i.e. need to match all signatures within some band)
1. Given entity features, calculate signatures, and cut into bands
2. Match with all existing clusters in cluster store, which collide with at least one band
3. Find the closest cluster
Incoming Entity: 01001 00011 10010 10010 10010 00011
Known Cluster1: 01011 00011 01010 10111 11110 10011
Known Cluster2: 01101 01011 01000 10010 10010 01111
1. Count for each band signatures
2. Use Count-Min Sketch to find the hot signatures
3. Send entities with hot signatures for clustering
1. Group entities by band signatures
2. Run in-memory clustering algorithm when the group is big enough
3. Save the cluster in cluster key-value store
1. Real-time : streamline data processing flow
2. Scalability : flexible grouping and shuffling ( Application / Signature )
3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
● Crest : similarity clustering system , based on locality-sensitive
hashing
● Detect spam in real-time , built on top of heron topology
● Generic interface, clustering “everything” happening in Twitter
#TwitterRealTime - Real time processing @twitter

More Related Content

What's hot

Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaZach Cox
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksMatthias Niehoff
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisParis Data Engineers !
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud lohitvijayarenu
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidCharles Allen
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexingSeoeun Park
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...Dataconomy Media
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with DruidYousun Jeong
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Charles Allen
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wildAtif Akhtar
 
Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentHostedbyConfluent
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteGigaom
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talkDanny Yuan
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Ashley Brown
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafkaconfluent
 

What's hot (20)

Event Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and SamzaEvent Stream Processing with Kafka and Samza
Event Stream Processing with Kafka and Samza
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Data Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and FrameworksData Stream Processing - Concepts and Frameworks
Data Stream Processing - Concepts and Frameworks
 
Building highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin FrançoisBuilding highly reliable data pipeline @datadog par Quentin François
Building highly reliable data pipeline @datadog par Quentin François
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
Log Events @Twitter
Log Events @TwitterLog Events @Twitter
Log Events @Twitter
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Druid realtime indexing
Druid realtime indexingDruid realtime indexing
Druid realtime indexing
 
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...Kai Wähner, Technology Evangelist at Confluent: "Development of  Scalable Mac...
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
 
Data Analytics with Druid
Data Analytics with DruidData Analytics with Druid
Data Analytics with Druid
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Streaming options in the wild
Streaming options in the wildStreaming options in the wild
Streaming options in the wild
 
Principles in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, ConfluentPrinciples in Data Stream Processing | Matthias J Sax, Confluent
Principles in Data Stream Processing | Matthias J Sax, Confluent
 
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan WaiteStructure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
Structure Data 2014: BIG DATA ANALYTICS RE-INVENTED, Ryan Waite
 
netflix-real-time-data-strata-talk
netflix-real-time-data-strata-talknetflix-real-time-data-strata-talk
netflix-real-time-data-strata-talk
 
Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18Storm at spider.io - London Storm Meetup 2013-06-18
Storm at spider.io - London Storm Meetup 2013-06-18
 
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache KafkaKafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
 

Similar to #TwitterRealTime - Real time processing @twitter

Keystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architectureKeystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architectureZhenzhong Xu
 
High throughput data streaming in Azure
High throughput data streaming in AzureHigh throughput data streaming in Azure
High throughput data streaming in AzureAlexander Laysha
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...Amazon Web Services
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteStreamNative
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsSamantha Quiñones
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code EuropeDavid Pilato
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructuremattlieber
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ NetflixIdo Shilon
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 

Similar to #TwitterRealTime - Real time processing @twitter (20)

Keystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architectureKeystone event processing pipeline on a dockerized microservices architecture
Keystone event processing pipeline on a dockerized microservices architecture
 
High throughput data streaming in Azure
High throughput data streaming in AzureHigh throughput data streaming in Azure
High throughput data streaming in Azure
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
BDA403 The Visible Network: How Netflix Uses Kinesis Streams to Monitor Appli...
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 KeynoteAdvanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
Advanced Stream Processing with Flink and Pulsar - Pulsar Summit NA 2021 Keynote
 
Drinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time MetricsDrinking from the Firehose - Real-time Metrics
Drinking from the Firehose - Real-time Metrics
 
Managing your black friday logs - Code Europe
Managing your black friday logs - Code EuropeManaging your black friday logs - Code Europe
Managing your black friday logs - Code Europe
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
BDX 2016- Monal daxini @ Netflix
BDX 2016-  Monal daxini  @ NetflixBDX 2016-  Monal daxini  @ Netflix
BDX 2016- Monal daxini @ Netflix
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

#TwitterRealTime - Real time processing @twitter

  • 1.
  • 2.
  • 4. WHAT IS REAL TIME?
  • 7. REAL TIME USE CASES ETL BI PRODUCT SAFETY TRENDS ML MEDIA OPS ADS
  • 8. 20 PB 2 Trillion Events/Day 100 ms e2e latency 400 Real Time Jobs DLOG & HERON are Open Sourced
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. WE ARE HIRING! Messaging Data Infrastructure Core Services Search Infrastructure Traffic Real Time Compute Compute Platform Platform Engineering Kernel #LoveWhereYouWork Learn more at careers.twitter.com Hadoop Core Data Libraries Data Applications Core Metrics
  • 14.
  • 15. - Easy operations - Small technology portfolio - Quick development Iteration - Diverse use cases
  • 18. 20 PB 2 Trillion Events 100 ms e2e latency
  • 19. - Event A discrete, self-contained, piece of data - Stream A persistent, unordered collection of events with a time - Partition A portion of a stream with a proportional amount of that the overall capacity - Subscriber A collection of processes collectively consuming a copy of the stream
  • 21. Flow Control Stream Configuration Partition Ownership DistributedLog (E => Future[Unit]) Offset Tracking Offset Store Metadata DL Read Proxy
  • 22. @DistributedLog http://distributedlog.io Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny <@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar <@mahakp>, Philip Su <@philipsu522>, Yiming Zang <@zang_yiming> Messaging Alumni: David Helder, Aniruddha Laud, Robin Dhamankar
  • 23.
  • 24.
  • 25. STORM/HERON TERMINOLOGY - TOPOLOGY Directed acyclic graph Vertices=computation, and edges=streams of data tuples - SPOUTS Sources of data tuples for the topology Examples - Kafka/Distributed Log/MySQL/Postgres - BOLTS Process incoming tuples and emit outgoing tuples Examples - filtering/aggregation/join/arbitrary function
  • 26. STORM/HERON TOPOLOGY BOLT 1 BOLT 2 BOLT 3 BOLT 4 BOLT 5 SPOUT 1 SPOUT 2
  • 27. WHY HERON? ● SCALABILITY and PERFORMANCE PREDICTABILITY ● IMPROVE DEVELOPER PRODUCTIVITY ● EASE OF MANAGEABILITY
  • 28. TOPOLOGY ARCHITECTURE Topology Master ZK CLUSTER Stream Manager I1 I2 I3 I4 Stream Manager I1 I2 I3 I4 Logical Plan, Physical Plan and Execution State Sync Physical Plan CONTAINER CONTAINER Metrics Manager Metrics Manager
  • 30.
  • 32. Large amount of data produced every day Large cluster Several hundred topologies deployed Several million messages every second HERON @TWITTER 1 stage 10 stages 3x reduction in cores and memory Heron has been in production for 2 years
  • 33.
  • 34. STRAGGLERS Stragglers are the norm in a multi-tenant distributed systems ● BAD/SLOW HOST ● EXECUTION SKEW ● INADEQUATE PROVISIONING
  • 35. APPROACHES TO HANDLE STRAGGLERS  d ● SENDERS TO STRAGGLER DROP DATA ● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER ● DETECT STRAGGLERS AND RESCHEDULE THEM
  • 36. S1 B2 B3 SLOW DOWN SENDERS STRATEGY Stream Manager Stream Manager Stream Manager Stream Manager S1 B2 B3 B4 S1 B2 B3 S1 B2 B3 B4 B4 S1 S1 S1S1
  • 37. BACK PRESSURE IN PRACTICE ● IN MOST SCENARIOS BACK PRESSURE RECOVERS Without any manual intervention ● SOMETIMES USER PREFER DROPPING OF DATA Care about only latest data ● SUSTAINED BACK PRESSURE Irrecoverable GC cycles Bad or faulty host
  • 38. ENVIRONMENT'S SUPPORTED STORM API PRE- 1.0.0 POST 1.0.0  SUMMINGBIRD FOR HERON
  • 39. CURIOUS TO LEARN MORE…
  • 40. INTERESTED IN HERON? CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron http://heronstreaming.io HERON IS OPEN SOURCED FOLLOW US @HERONSTREAMING
  • 41.
  • 42.
  • 43. ● 100K+ Advertisers, $2B+ revenue/year ● 300M+ Users ● Impressions/Engagements ○ Tens of billions of events daily
  • 44. Use Heron & EventBus: ● Prediction ● Serving ● Analytics
  • 45.
  • 46. ● Online learning: models require real-time data ○ On-going training for existing ads ■ CTR, conversions, RTs, Likes ○ On-going training for user data ■ Interests change, targeting must stay relevant ○ New ads arrive constantly ● Consumes 150 GB/second from EventBus streams
  • 47. Ad Server ● Reads Prediction models ● Finalizes Ad selection ● Writes 56GB/second to EventBus ○ Served impressions ○ Spend events Callback Service ● Receives engagements from clients ● Writes engagements to EventBus ○ Consumed by Prediction and Analytics
  • 48. Advertiser Dashboard keeps advertisers informed in real-time For Ads: ● Impressions ● Engagements ● Spend rate ● Uniques For Users: ● Geolocation ● Gender ● Age ● Followers ● Keywords ● Interests
  • 49. Offline layer (hours) ● Engagement log ● Billing pipeline ● 14TB/hour Online layer (seconds) ● Heron topologies read 1M events/sec From EventBus, provide real-time analytics Advertiser Dashboard ● Ad-hoc queries for desired time range ● View performance of ads in real-time http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
  • 51. #RealTime processing helps us scale our Ads business: ● Prediction - Online learning ○ Ads ○ Users ● Analytics - Advertisers get real-time visibility into ad performance This enables us to provide high ROI for Advertisers. Image Credits: http://images.clipartpanda.com/cycle-clipart-bike_red.png http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
  • 52.
  • 53. Observation ● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter ● Spam campaign comes in large batch ● Despite of randomized tweaks, enough similarity among spammy entities are preserved Requirement ● Real-time : a competition game with spammers i.e. “detect” vs “mutate” ● Generic : need to support all common feature representations
  • 54. Crest is a generic online similarity clustering system ● Inputs are a stream of entities ● Similar Clustering system groups similar entities together ( according to predefined similarity metric) ● outputs are the clusters and the cluster entity members. “Built on top of Heron“ https://github.com/twitter/heron
  • 55.
  • 56. ● Locality sensitive hashing probabilistic similarity-preserving random projection method Entity1 => hashValue1 (010010001110010100101001000011) Entity2 => hashValue2 (000111001110010101100110100100) Sim(Entity1, Entity2) ~ Sim(hash1, hash2) ● No “Pair-wise” similarity calculation Similarity match based on “signature band”
  • 57. Similarity match based on “signature band” collision Cut signatures into bands: 01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band) Two entities become similarly candidates, if they collide on at least one band. (i.e. need to match all signatures within some band)
  • 58. 1. Given entity features, calculate signatures, and cut into bands 2. Match with all existing clusters in cluster store, which collide with at least one band 3. Find the closest cluster Incoming Entity: 01001 00011 10010 10010 10010 00011 Known Cluster1: 01011 00011 01010 10111 11110 10011 Known Cluster2: 01101 01011 01000 10010 10010 01111
  • 59. 1. Count for each band signatures 2. Use Count-Min Sketch to find the hot signatures 3. Send entities with hot signatures for clustering
  • 60. 1. Group entities by band signatures 2. Run in-memory clustering algorithm when the group is big enough 3. Save the cluster in cluster key-value store
  • 61. 1. Real-time : streamline data processing flow 2. Scalability : flexible grouping and shuffling ( Application / Signature ) 3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
  • 62. ● Crest : similarity clustering system , based on locality-sensitive hashing ● Detect spam in real-time , built on top of heron topology ● Generic interface, clustering “everything” happening in Twitter