Who: Karthik Ramasamy (@karthikz)
Date: September 20, 2016
Event: #TwitterRealTime
This slide deck consists of presentations from various teams about Twitter's real time infrastructure, the components it uses, and how they function. It includes presentations from David Rusek (@davidrusek), Maosong Fu (@Louis_Fumaosong), Sandy Strong (@st5are), and Yimin Tan (@YiminTan_Kevin).
13. WE ARE HIRING!
Messaging
Data Infrastructure
Core Services
Search Infrastructure
Traffic
Real Time Compute
Compute Platform
Platform Engineering
Kernel
#LoveWhereYouWork
Learn more at careers.twitter.com
Hadoop
Core Data Libraries
Data Applications
Core Metrics
14.
15. - Easy operations
- Small technology portfolio
- Quick development Iteration
- Diverse use cases
19. - Event
A discrete, self-contained, piece of data
- Stream
A persistent, unordered collection of events with a time
- Partition
A portion of a stream with a proportional amount of that the overall capacity
- Subscriber
A collection of processes collectively consuming a copy of the stream
22. @DistributedLog
http://distributedlog.io
Leigh Stewart <@l4stewar>, Sijie Guo <@sijieg>, Franck Cuny
<@franckcuny>, Jordan Bull <@jordangbull>, Mahak Patidar
<@mahakp>, Philip Su <@philipsu522>, Yiming Zang
<@zang_yiming>
Messaging Alumni: David Helder, Aniruddha Laud, Robin
Dhamankar
23.
24.
25. STORM/HERON TERMINOLOGY
- TOPOLOGY
Directed acyclic graph
Vertices=computation, and edges=streams of data tuples
- SPOUTS
Sources of data tuples for the topology
Examples - Kafka/Distributed Log/MySQL/Postgres
- BOLTS
Process incoming tuples and emit outgoing tuples
Examples - filtering/aggregation/join/arbitrary function
32. Large amount of data
produced every day
Large cluster Several hundred
topologies deployed
Several million
messages every second
HERON @TWITTER
1 stage 10 stages
3x reduction in cores and memory
Heron has been in production for 2 years
33.
34. STRAGGLERS
Stragglers are the norm in a multi-tenant distributed systems
● BAD/SLOW HOST
● EXECUTION SKEW
● INADEQUATE PROVISIONING
35. APPROACHES TO HANDLE STRAGGLERS
d
● SENDERS TO STRAGGLER DROP DATA
● SENDERS SLOW DOWN TO THE SPEED OF STRAGGLER
● DETECT STRAGGLERS AND RESCHEDULE THEM
37. BACK PRESSURE IN PRACTICE
● IN MOST SCENARIOS BACK PRESSURE RECOVERS
Without any manual intervention
● SOMETIMES USER PREFER DROPPING OF DATA
Care about only latest data
● SUSTAINED BACK PRESSURE
Irrecoverable GC cycles
Bad or faulty host
40. INTERESTED IN HERON?
CONTRIBUTIONS ARE WELCOME!
https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING
41.
42.
43. ● 100K+ Advertisers, $2B+ revenue/year
● 300M+ Users
● Impressions/Engagements
○ Tens of billions of events daily
46. ● Online learning: models require real-time data
○ On-going training for existing ads
■ CTR, conversions, RTs, Likes
○ On-going training for user data
■ Interests change, targeting must stay relevant
○ New ads arrive constantly
● Consumes 150 GB/second from EventBus streams
47. Ad Server
● Reads Prediction models
● Finalizes Ad selection
● Writes 56GB/second to EventBus
○ Served impressions
○ Spend events
Callback Service
● Receives engagements from clients
● Writes engagements to EventBus
○ Consumed by Prediction
and Analytics
48. Advertiser Dashboard keeps advertisers informed in real-time
For Ads:
● Impressions
● Engagements
● Spend rate
● Uniques
For Users:
● Geolocation
● Gender
● Age
● Followers
● Keywords
● Interests
49. Offline layer (hours)
● Engagement log
● Billing pipeline
● 14TB/hour
Online layer (seconds)
● Heron topologies read 1M events/sec
From EventBus, provide real-time analytics
Advertiser Dashboard
● Ad-hoc queries for desired time range
● View performance of ads in real-time
http://tech.lalitbhatt.net/2015/03/big-and-fast-data-lambda-architecture.html
51. #RealTime processing helps us scale our Ads
business:
● Prediction - Online learning
○ Ads
○ Users
● Analytics - Advertisers get real-time
visibility into ad performance
This enables us to provide high ROI for
Advertisers.
Image Credits:
http://images.clipartpanda.com/cycle-clipart-bike_red.png
http://sweetclipart.com/multisite/sweetclipart/files/motor_scooter_blue.png
http://www.clipartkid.com/images/152/clipart-car-car-clip-art-mHtTUp-clipart.png
52.
53. Observation
● Anti-Spam Team fights spammy content, engagements, behaviors in Twitter
● Spam campaign comes in large batch
● Despite of randomized tweaks, enough similarity among spammy entities are preserved
Requirement
● Real-time : a competition game with spammers i.e. “detect” vs “mutate”
● Generic : need to support all common feature representations
54. Crest is a generic online similarity clustering system
● Inputs are a stream of entities
● Similar Clustering system groups similar entities together ( according to predefined
similarity metric)
● outputs are the clusters and the cluster entity members.
“Built on top of Heron“ https://github.com/twitter/heron
55.
56. ● Locality sensitive hashing
probabilistic similarity-preserving random projection method
Entity1 => hashValue1 (010010001110010100101001000011)
Entity2 => hashValue2 (000111001110010101100110100100)
Sim(Entity1, Entity2) ~ Sim(hash1, hash2)
● No “Pair-wise” similarity calculation
Similarity match based on “signature band”
57. Similarity match based on “signature band” collision
Cut signatures into bands:
01001 00011 10010 10010 10010 00011 ( 30 sigs = 6 bands * 5sigs/band)
Two entities become similarly candidates, if they collide on at least one band.
(i.e. need to match all signatures within some band)
58. 1. Given entity features, calculate signatures, and cut into bands
2. Match with all existing clusters in cluster store, which collide with at least one band
3. Find the closest cluster
Incoming Entity: 01001 00011 10010 10010 10010 00011
Known Cluster1: 01011 00011 01010 10111 11110 10011
Known Cluster2: 01101 01011 01000 10010 10010 01111
59. 1. Count for each band signatures
2. Use Count-Min Sketch to find the hot signatures
3. Send entities with hot signatures for clustering
60. 1. Group entities by band signatures
2. Run in-memory clustering algorithm when the group is big enough
3. Save the cluster in cluster key-value store
61. 1. Real-time : streamline data processing flow
2. Scalability : flexible grouping and shuffling ( Application / Signature )
3. Maintenance : separated bolts for system optimizations ( Memory, GC, CPU, etc )
62. ● Crest : similarity clustering system , based on locality-sensitive
hashing
● Detect spam in real-time , built on top of heron topology
● Generic interface, clustering “everything” happening in Twitter