#TwitterRealTime - Real time processing @twitter

REAL TIME USE CASES
ETL BI
PRODUCT
SAFETY
TRENDS
ML MEDIA OPS ADS

20 PB
2 Trillion
Events/Day
100 ms
e2e
latency
400 Real
Time Jobs
DLOG &
HERON are
Open
Sourced

WE ARE HIRING!
Messaging
Data Infrastructure
Core Services
Search Infrastructure
Traffic
Real Time Compute
Compute Platform
Platform Engineering
Kernel
#LoveWhereYouWork
Learn more at careers.twitter.com
Hadoop
Core Data Libraries
Data Applications
Core Metrics

- Easy operations
- Small technology portfolio
- Quick development Iteration
- Diverse use cases

Bookkeeper
Write
Proxy
Read
Proxy
client
client

Bookkeeper
Write
Proxy
Read
Proxy
PublisherSubscriber
Read Write
DistributedLog
Metadata
Self Serve

20 PB
2 Trillion Events
100 ms
e2e
latency

- Event
A discrete, self-contained, piece of data
- Stream
A persistent, unordered collection of events with a time
- Partition
A portion of a stream with a proportional amount of that the overall capacity
- Subscriber
A collection of processes collectively consuming a copy of the stream

Flow Control
Stream
Configuration
Partition
Ownership
DistributedLog
(E => Future[Unit])
Offset
Tracking
Offset
Store
Metadata
DL Read
Proxy

@DistributedLog
http://distributedlog.io
Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny
<@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar
<@mahakp>, Philip Su <@philipsu522>, Yiming Zang
<@zang_yiming>
Messaging Alumni: David Helder, Aniruddha Laud, Robin
Dhamankar

STORM/HERON TERMINOLOGY
- TOPOLOGY
Directed acyclic graph
Vertices=computation, and edges=streams of data tuples
- SPOUTS
Sources of data tuples for the topology
Examples - Kafka/Distributed Log/MySQL/Postgres
- BOLTS
Process incoming tuples and emit outgoing tuples
Examples - filtering/aggregation/join/arbitrary function

STORM/HERON TOPOLOGY
BOLT 1
BOLT 2
BOLT 3
BOLT 4
BOLT 5
SPOUT 1
SPOUT 2

WHY HERON?
● SCALABILITY and PERFORMANCE PREDICTABILITY
● IMPROVE DEVELOPER PRODUCTIVITY
● EASE OF MANAGEABILITY

TOPOLOGY ARCHITECTURE
Topology
Master
ZK
CLUSTER
Stream
Manager
I1 I2 I3 I4
Stream
Manager
I1 I2 I3 I4
Logical Plan,
Physical Plan and
Execution State
Sync Physical Plan
CONTAINER CONTAINER
Metrics
Manager
Metrics
Manager

HERON ARCHITECTURE
Topology 1
TOPOLOGY
SUBMISSION
Scheduler
Topology 2
Topology 3
Topology N

Large amount of data
produced every day
Large cluster Several hundred
topologies deployed
Several million
messages every second
HERON @TWITTER
1 stage 10 stages
3x reduction in cores and memory
Heron has been in production for 2 years

STRAGGLERS
Stragglers are the norm in a multi-tenant distributed systems
● BAD/SLOW HOST
● EXECUTION SKEW
● INADEQUATE PROVISIONING

APPROACHES TO HANDLE STRAGGLERS

d
● SENDERS TO STRAGGLER DROP DATA
● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
● DETECT STRAGGLERS AND RESCHEDULE THEM

S1 B2
B3
SLOW DOWN SENDERS STRATEGY
Stream
Manager
Stream
Manager
Stream
Manager
Stream
Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
S1 S1
S1S1

BACK PRESSURE IN PRACTICE
● IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
● SOMETIMES USER PREFER DROPPING OF DATA
Care about only latest data
● SUSTAINED BACK PRESSURE
Irrecoverable GC cycles
Bad or faulty host

ENVIRONMENT'S SUPPORTED
STORM API
PRE- 1.0.0
POST 1.0.0

SUMMINGBIRD FOR HERON

INTERESTED IN HERON?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING

● 100K+ Advertisers, $2B+ revenue/year
● 300M+ Users
● Impressions/Engagements
○ Tens of billions of events daily

Use Heron & EventBus:
● Prediction
● Serving
● Analytics

● Online learning: models require real-time data
○ On-going training for existing ads
■ CTR, conversions, RTs, Likes
○ On-going training for user data
■ Interests change, targeting must stay relevant
○ New ads arrive constantly
● Consumes 150 GB/second from EventBus streams

Ad Server
● Reads Prediction models
● Finalizes Ad selection
● Writes 56GB/second to EventBus
○ Served impressions
○ Spend events
Callback Service
● Receives engagements from clients
● Writes engagements to EventBus
○ Consumed by Prediction
and Analytics

Advertiser Dashboard keeps advertisers informed in real-time
For Ads:
● Impressions
● Engagements
● Spend rate
● Uniques
For Users:
● Geolocation
● Gender
● Age
● Followers
● Keywords
● Interests

Offline layer (hours)
● Engagement log
● Billing pipeline
● 14TB/hour
Online layer (seconds)
● Heron topologies read 1M events/sec
From EventBus, provide real-time analytics
Advertiser Dashboard
● Ad-hoc queries for desired time range
● View performance of ads in real-time
http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html

#RealTime processing helps us scale our Ads
business:
● Prediction - Online learning
○ Ads
○ Users
● Analytics - Advertisers get real-time
visibility into ad performance
This enables us to provide high ROI for
Advertisers.
Image Credits:
http://images.clipartpanda.com/cycle-clipart-bike_red.png
http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png
http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png

Observation
● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter
● Spam campaign comes in large batch
● Despite of randomized tweaks, enough similarity among spammy entities are preserved
Requirement
● Real-time : a competition game with spammers i.e. “detect” vs “mutate”
● Generic : need to support all common feature representations

Crest is a generic online similarity clustering system
● Inputs are a stream of entities
● Similar Clustering system groups similar entities together ( according to predefined
similarity metric)
● outputs are the clusters and the cluster entity members.
“Built on top of Heron“ https://github.com/twitter/heron

● Locality sensitive hashing
probabilistic similarity-preserving random projection method
Entity1 => hashValue1 (010010001110010100101001000011)
Entity2 => hashValue2 (000111001110010101100110100100)
Sim(Entity1, Entity2) ~ Sim(hash1, hash2)
● No “Pair-wise” similarity calculation
Similarity match based on “signature band”

Similarity match based on “signature band” collision
Cut signatures into bands:
01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)
Two entities become similarly candidates, if they collide on at least one band.
(i.e. need to match all signatures within some band)

1. Given entity features, calculate signatures, and cut into bands
2. Match with all existing clusters in cluster store, which collide with at least one band
3. Find the closest cluster
Incoming Entity: 01001 00011 10010 10010 10010 00011
Known Cluster1: 01011 00011 01010 10111 11110 10011
Known Cluster2: 01101 01011 01000 10010 10010 01111

1. Count for each band signatures
2. Use Count-Min Sketch to find the hot signatures
3. Send entities with hot signatures for clustering

1. Group entities by band signatures
2. Run in-memory clustering algorithm when the group is big enough
3. Save the cluster in cluster key-value store

1. Real-time : streamline data processing flow
2. Scalability : flexible grouping and shuffling ( Application / Signature )
3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )

● Crest : similarity clustering system , based on locality-sensitive
hashing
● Detect spam in real-time , built on top of heron topology
● Generic interface, clustering “everything” happening in Twitter

#TwitterRealTime - Real time processing @twitter

#TwitterRealTime - Real time processing @twitter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to #TwitterRealTime - Real time processing @twitter

Similar to #TwitterRealTime - Real time processing @twitter (20)

Recently uploaded

Recently uploaded (20)

#TwitterRealTime - Real time processing @twitter