3. What we do?
Customer Traffic
• 14B page views per month
• At peak, 8000-10000 per sec Web Servers
• Deployed Storm to production ~ 1
Ka=a
month ago Data Transform/
Aggrega8on
• Storm cluster of ~50 instances on Storm
AWS
Databases
Dashboard Algo
Automa8on
4. Before Storm
• Built our own distributed data processing
• ZMQ
• Batch based process
• Hashing processing by customers
• Advantages
• Simple in-house system built from very basic components
• Well understood
• Disadvantages
• Hard to scale, constant battle for keeping up with pings
• Machine management was clumsy
• Uneven distribution of traffic
• Multiple processes doing similar work, wasting resources
5. Why Kafka/Storm?
• Kafka
• open-sourced, distributed publish-subscribe messaging system
• Storm
• open-sourced, real-time computation system for continuous
computation
• They are awesome
• Distributed, highly scalable, and fault tolerance
• High throughput
• Reliable
• Real-time
• Great at in-memory analytics, and real-time decision support
7. Learning / Ideas
1. Kafka + zookeeper is extremely scalable and easy to setup.
Check out the Brod library if you are doing Python
2. Use the Storm UI (Ganglia based) to monitor your cluster
3. Shell Bolts were inefficient and hard to debug (at least for us)
4. Upgrade to at least Storm version 0.8.2 which gives you capacity
metrics on top of other goodies
5. Storm’s anchoring/replay capability is awesome but comes with a
visible overhead
6. Use a good framework to manage your cluster, we use Salt Stack
7. Our unit tests are built in Junit. Most built in unit tests for Storm
are only available in Clojure for now
8. Thank You
Alex Poon
@alexpoon06
@Outbrain
Yes, it is true. We are
Hiring!!
www.visualrevenue.com/jobs