SlideShare a Scribd company logo
1 of 33
Twitter Heron
Evolution or Revolution?
Analytics Conf, November 15-16, 2016
Grzegorz Kolpuc
@gkolpuc
https://pl.linkedin.com/in/grzegorz-kolpuc-7000b755
There are 310M monthly active users
https://www.brandwatch.com/blog/44-twitter-stats-2016/
A total of 1.3 billion accounts have been
created
https://www.brandwatch.com/blog/44-twitter-stats-2016/
There are 500 million Tweets sent each day.
That’s 6,000 Tweets every second.
https://www.brandwatch.com/blog/44-twitter-stats-2016/
Enable analytics:
scoring, stats, trends,
recommendations, real-time
reporting
http://www.slideshare.net/KrishnaGade2/storm-at-twitter
A long time ago...
What is Storm?
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
What is Storm?
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Storm makes it easy to reliably process unbounded streams of data, doing for
realtime processing what Hadoop did for batch processing (DAG processing engine)
http://hortonworks.com/blog/brief-history-apache-storm/
2011 : Twitter acquires @BackType
Storm at Twitter
(2013)
Benchmarked at a million tuples
processed per second
Running 30 topologies in a 200
node cluster
Processing 50 billion messages a
day with an average complete
latency under 50ms
http://www.slideshare.net/KrishnaGade2/storm-at-twitter/39-numbers_benchmarked_at_a_million
Storm is very powerful, but...
http://www.slideshare.net/KrishnaGade2/storm-at-twitter
Apache Storm issues
Performance
● Every worker is
homogeneous, which results
in inefficient utilization of
allocated resources
● There is no backpressure
mechanism
● Topologies using a large
amount of RAM for a worker
encounter gc cycles greater
than a minute
Debugging
Each worker runs a mix of
tasks
Logs from multiple tasks are
written into a single file
Each tuple has to pass through
four threads in the
worker process from the
point of entry to the point
of exit
Scheduling
● Multiple level of
scheduling
● Single task takes down
the whole worker process
● Nimbus is a single point of
failure
https://blog.acolyer.org/2015/06/15/twitter-heron-stream-processing-at-scale/
Enhancing Storm would take too long and no other system met their
scaling, throughput and latency needs. Plus, other systems are not
compatible with Storm’s API, requiring rewriting all topologies. The
decision was to create Heron, but keep its external API compatible with
Storm’s.
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Twitter approach...
Flying faster with Twitter Heron
Tuesday, June 2, 2015 | By Karthik Ramasamy (@karthikz), Engineering Manager
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Flying Faster with Twitter Heron
Scheduler
Pluggable solution. Fit to Twitter infrastructure:
Apache Mesos + Apache Aurora
Back Pressure
Automatically slows down on tuples producing
when queues overloaded
Easy Debugging
Moved from typical thread-based system to
process-based system (running each tusk in isolation)
Compatibility with Storm Easy migration from Storm to Heron
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Heron Performance
We compared the performance of Heron with Twitter’s production version of Storm, which was forked from an open source
version in October 2013, using word count topology. This topology counts the distinct words in a stream generated from a
set of 150,000 words.
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Heron at Twitter
At Twitter, Heron is used as our primary streaming system, running
hundreds of development and production topologies. Since Heron is
efficient in terms of resource usage, after migrating all Twitter’s
topologies to it we’ve seen an overall 3x reduction in hardware, causing
a significant improvement in our infrastructure efficiency.
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Heron Topology
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Topology master
Stream
Manager
Stream
Manager
Metrics
Manager
Metrics
Manager
I1 I2 I3 I4 I1 I2 I3 I4
ZK Cluster
Nimbus
Supervisor Supervisor
W1 W2 W3 W4 W1 W2 W3 W4
ZK Cluster
Storm Topology
W1 W2 W3 W4 W1 W2 W3 W4
Topology master
Stream
Manager
Stream
Manager
Metrics
Manager
Metrics
Manager
I1 I2 I3 I4 I1 I2 I3 I4
Heron Topology
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
ZK Cluster
Nimbus
Supervisor Supervisor
ZK Cluster
Storm Topology
Scheduler Uploader
Heron
Tracker
Open Sourcing Twitter Heron
Wednesday, May 25, 2016 | By Karthik Ramasamy (@karthikz), Engineering Manager
https://blog.twitter.com/2016/open-sourcing-twitter-heron
Inside Heron
Written in Java & Python (~80%)
Critical parts of the framework, the code that manages the topologies
and network communications are not written in a JVM language
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
In the meantime...
Storm has evolved
Heron's speed improvements are measured from the Storm 0.8.x code it
diverged from, not the current version; if you have migrated over to
Storm 1.0 already, you might not see much more improvement over your
current Storm topologies, and you may run into incompatibilities
between the implementation of new features like back-pressure
support between Storm and Heron
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
http://hortonworks.com/blog/brief-history-apache-storm/
Storm has evolved
➢ Support for back pressure
➢ Introduced pacemaker (daemon for offloading heartbeat traffic
from ZooKeeper, freeing larger topologies from the infamous
ZooKeeper bottleneck)
➢ Nimbus HA
➢ Distributed cache
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Storm has evolved
➢ improved debugging and profiling options
➢ 60 percent decrease in latency
➢ up to 16x speed improvement.
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
When to use Storm?
➢ Want to avoid infrastructure configuration overhead (Heron is
currently tied to Mesos, so if you don't have existing Mesos
infrastructure, you'll need to set that up as well, which is no small
undertaking)
➢ Don’t need extremely large scale
➢ DRPC (deprecated in Heron)
➢ More ready to use integrationshttps://blog.twitter.com/2015/flying-faster-with-twitter-heron
When to use Heron?
➢ Have Mesos infrastructure
➢ Larger scale
➢ Running multiple clusters
https://blog.twitter.com/2015/flying-faster-with-twitter-heron
Evolution or Revolution?
Q&A
@gkolpuc

More Related Content

What's hot

Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
Uday Vakalapudi
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
DataWorks Summit
 

What's hot (8)

From Warehouses to Lakes: The Value of Streams
From Warehouses to Lakes: The Value of StreamsFrom Warehouses to Lakes: The Value of Streams
From Warehouses to Lakes: The Value of Streams
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Realtime processing with storm presentation
Realtime processing with storm presentationRealtime processing with storm presentation
Realtime processing with storm presentation
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Improved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as exampleImproved Reliable Streaming Processing: Apache Storm as example
Improved Reliable Streaming Processing: Apache Storm as example
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
OGCE Project Overview
OGCE Project OverviewOGCE Project Overview
OGCE Project Overview
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Similar to Twitter Heron. Evolution or Revolution

The Open Source... Behind the Tweets
The Open Source... Behind the TweetsThe Open Source... Behind the Tweets
The Open Source... Behind the Tweets
Chris Aniszczyk
 
20100214 TweeSpeed @ CafeNumerique
20100214 TweeSpeed @ CafeNumerique20100214 TweeSpeed @ CafeNumerique
20100214 TweeSpeed @ CafeNumerique
Pascal Alberty
 
Building TweetEngine
Building TweetEngineBuilding TweetEngine
Building TweetEngine
ikailan
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time Analytics
Edureka!
 

Similar to Twitter Heron. Evolution or Revolution (20)

The Open Source... Behind the Tweets
The Open Source... Behind the TweetsThe Open Source... Behind the Tweets
The Open Source... Behind the Tweets
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Social Developers London update for Twitter Developers
Social Developers London update for Twitter Developers Social Developers London update for Twitter Developers
Social Developers London update for Twitter Developers
 
SpringOnePlatform2017 recap
SpringOnePlatform2017 recapSpringOnePlatform2017 recap
SpringOnePlatform2017 recap
 
Core Spring + Reactive 김민석
Core Spring + Reactive  김민석Core Spring + Reactive  김민석
Core Spring + Reactive 김민석
 
20100214 TweeSpeed @ CafeNumerique
20100214 TweeSpeed @ CafeNumerique20100214 TweeSpeed @ CafeNumerique
20100214 TweeSpeed @ CafeNumerique
 
Amazon CloudFront Seminar Accelerated TLS/SSL Adoption
Amazon CloudFront Seminar Accelerated TLS/SSL AdoptionAmazon CloudFront Seminar Accelerated TLS/SSL Adoption
Amazon CloudFront Seminar Accelerated TLS/SSL Adoption
 
[Rakuten TechConf2014] [Fukuoka] Security checking which is as a part of Cont...
[Rakuten TechConf2014] [Fukuoka] Security checking which is as a part of Cont...[Rakuten TechConf2014] [Fukuoka] Security checking which is as a part of Cont...
[Rakuten TechConf2014] [Fukuoka] Security checking which is as a part of Cont...
 
Building TweetEngine
Building TweetEngineBuilding TweetEngine
Building TweetEngine
 
Ferramentas de apoio ao desenvolvedor
Ferramentas de apoio ao desenvolvedorFerramentas de apoio ao desenvolvedor
Ferramentas de apoio ao desenvolvedor
 
Simple Ways to Get Your Organization to Adopt the AsyncAPI Spec
Simple Ways to Get Your Organization to Adopt the AsyncAPI SpecSimple Ways to Get Your Organization to Adopt the AsyncAPI Spec
Simple Ways to Get Your Organization to Adopt the AsyncAPI Spec
 
Automation and Culture Changes for 40M Subscriber Platform Operation
Automation and Culture Changes for 40M Subscriber Platform OperationAutomation and Culture Changes for 40M Subscriber Platform Operation
Automation and Culture Changes for 40M Subscriber Platform Operation
 
Apache Storm - Real Time Analytics
Apache Storm - Real Time AnalyticsApache Storm - Real Time Analytics
Apache Storm - Real Time Analytics
 
20091112 - Mars Jug - Apache Maven
20091112 - Mars Jug - Apache Maven20091112 - Mars Jug - Apache Maven
20091112 - Mars Jug - Apache Maven
 
Twitter System Design
Twitter System DesignTwitter System Design
Twitter System Design
 
Ti.developers.meeting
Ti.developers.meetingTi.developers.meeting
Ti.developers.meeting
 
Choose Your Own Adventure with JHipster & Kubernetes - Utah JUG 2020
Choose Your Own Adventure with JHipster & Kubernetes - Utah JUG 2020Choose Your Own Adventure with JHipster & Kubernetes - Utah JUG 2020
Choose Your Own Adventure with JHipster & Kubernetes - Utah JUG 2020
 
OpenStack Murano
OpenStack MuranoOpenStack Murano
OpenStack Murano
 

Recently uploaded

result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Recently uploaded (20)

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Vivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design SpainVivazz, Mieres Social Housing Design Spain
Vivazz, Mieres Social Housing Design Spain
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 

Twitter Heron. Evolution or Revolution

  • 1. Twitter Heron Evolution or Revolution? Analytics Conf, November 15-16, 2016
  • 3.
  • 4. There are 310M monthly active users https://www.brandwatch.com/blog/44-twitter-stats-2016/
  • 5. A total of 1.3 billion accounts have been created https://www.brandwatch.com/blog/44-twitter-stats-2016/
  • 6. There are 500 million Tweets sent each day. That’s 6,000 Tweets every second. https://www.brandwatch.com/blog/44-twitter-stats-2016/
  • 7. Enable analytics: scoring, stats, trends, recommendations, real-time reporting http://www.slideshare.net/KrishnaGade2/storm-at-twitter
  • 8. A long time ago...
  • 9.
  • 11. What is Storm? https://blog.twitter.com/2015/flying-faster-with-twitter-heron Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing (DAG processing engine)
  • 13. Storm at Twitter (2013) Benchmarked at a million tuples processed per second Running 30 topologies in a 200 node cluster Processing 50 billion messages a day with an average complete latency under 50ms http://www.slideshare.net/KrishnaGade2/storm-at-twitter/39-numbers_benchmarked_at_a_million
  • 14. Storm is very powerful, but... http://www.slideshare.net/KrishnaGade2/storm-at-twitter
  • 15. Apache Storm issues Performance ● Every worker is homogeneous, which results in inefficient utilization of allocated resources ● There is no backpressure mechanism ● Topologies using a large amount of RAM for a worker encounter gc cycles greater than a minute Debugging Each worker runs a mix of tasks Logs from multiple tasks are written into a single file Each tuple has to pass through four threads in the worker process from the point of entry to the point of exit Scheduling ● Multiple level of scheduling ● Single task takes down the whole worker process ● Nimbus is a single point of failure https://blog.acolyer.org/2015/06/15/twitter-heron-stream-processing-at-scale/
  • 16. Enhancing Storm would take too long and no other system met their scaling, throughput and latency needs. Plus, other systems are not compatible with Storm’s API, requiring rewriting all topologies. The decision was to create Heron, but keep its external API compatible with Storm’s. https://blog.twitter.com/2015/flying-faster-with-twitter-heron Twitter approach...
  • 17. Flying faster with Twitter Heron Tuesday, June 2, 2015 | By Karthik Ramasamy (@karthikz), Engineering Manager https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 18. Flying Faster with Twitter Heron Scheduler Pluggable solution. Fit to Twitter infrastructure: Apache Mesos + Apache Aurora Back Pressure Automatically slows down on tuples producing when queues overloaded Easy Debugging Moved from typical thread-based system to process-based system (running each tusk in isolation) Compatibility with Storm Easy migration from Storm to Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 19. Heron Performance We compared the performance of Heron with Twitter’s production version of Storm, which was forked from an open source version in October 2013, using word count topology. This topology counts the distinct words in a stream generated from a set of 150,000 words. https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 20. Heron at Twitter At Twitter, Heron is used as our primary streaming system, running hundreds of development and production topologies. Since Heron is efficient in terms of resource usage, after migrating all Twitter’s topologies to it we’ve seen an overall 3x reduction in hardware, causing a significant improvement in our infrastructure efficiency. https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 21. Heron Topology https://blog.twitter.com/2015/flying-faster-with-twitter-heron Topology master Stream Manager Stream Manager Metrics Manager Metrics Manager I1 I2 I3 I4 I1 I2 I3 I4 ZK Cluster Nimbus Supervisor Supervisor W1 W2 W3 W4 W1 W2 W3 W4 ZK Cluster Storm Topology
  • 22. W1 W2 W3 W4 W1 W2 W3 W4 Topology master Stream Manager Stream Manager Metrics Manager Metrics Manager I1 I2 I3 I4 I1 I2 I3 I4 Heron Topology https://blog.twitter.com/2015/flying-faster-with-twitter-heron ZK Cluster Nimbus Supervisor Supervisor ZK Cluster Storm Topology Scheduler Uploader Heron Tracker
  • 23. Open Sourcing Twitter Heron Wednesday, May 25, 2016 | By Karthik Ramasamy (@karthikz), Engineering Manager https://blog.twitter.com/2016/open-sourcing-twitter-heron
  • 24. Inside Heron Written in Java & Python (~80%) Critical parts of the framework, the code that manages the topologies and network communications are not written in a JVM language https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 26. Storm has evolved Heron's speed improvements are measured from the Storm 0.8.x code it diverged from, not the current version; if you have migrated over to Storm 1.0 already, you might not see much more improvement over your current Storm topologies, and you may run into incompatibilities between the implementation of new features like back-pressure support between Storm and Heron https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 28. Storm has evolved ➢ Support for back pressure ➢ Introduced pacemaker (daemon for offloading heartbeat traffic from ZooKeeper, freeing larger topologies from the infamous ZooKeeper bottleneck) ➢ Nimbus HA ➢ Distributed cache https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 29. Storm has evolved ➢ improved debugging and profiling options ➢ 60 percent decrease in latency ➢ up to 16x speed improvement. https://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 30. When to use Storm? ➢ Want to avoid infrastructure configuration overhead (Heron is currently tied to Mesos, so if you don't have existing Mesos infrastructure, you'll need to set that up as well, which is no small undertaking) ➢ Don’t need extremely large scale ➢ DRPC (deprecated in Heron) ➢ More ready to use integrationshttps://blog.twitter.com/2015/flying-faster-with-twitter-heron
  • 31. When to use Heron? ➢ Have Mesos infrastructure ➢ Larger scale ➢ Running multiple clusters https://blog.twitter.com/2015/flying-faster-with-twitter-heron

Editor's Notes

  1. Let us count the ways… Multiple levels of scheduling and their complex interaction leads to uncertainty about when tasks are being scheduled. Each worker runs a mix of tasks, making it difficult to reason about the behaviour and performance of a particular task, since it is not possible to isolate its resource usage. Logs from multiple tasks are written into a single file making it hard to identify errors and exceptions associated with a particular task, and causes tasks that log verbosely to swamp the logs of other tasks. An unhandled exception in a single task takes down the whole worker process killing other (perfectly fine) tasks. Storm assumes that every worker is homogeneous, which results in inefficient utilization of allocated resources, and often results in over-provisioning. Because of the large amount of memory allocated to workers, use of common profiling tools becomes very cumbersome. Dumps take so long that the heartbeats are missed and the supervisor kills the process (preventing the dump from completing). Re-architecting Storm to run one task per-worker would led to big inefficiencies in resource usage and limit the degree of parallelism achieved. Each tuple has to pass through four (count ’em) threads in the worker process from the point of entry to the point of exit. This design leads to significant overhead and contention issues. Nimbus is functionally overloaded and becomes an operational bottleneck. Storm workers belonging to different topologies but running on the same machine can interfere with each other, which leads to untraceable performance issues. Thus Twitter had to run production Storm topologies in isolation on dedicated machines. Which of course leads to wasted resources. Nimbus is a single point of failure. When it fails, you can’t submit any new topologies or kill existing ones. Nor can any topology that undergoes failures be detected and recovered. There is no backpressure mechanism. This can result in unbounded tuple drops with little visibility into the situation when acknowledgements are disabled. Work done by upstream components can be lost, and in extreme scenarios the topology can fail to make any progress while consuming all resources. A tuple failure anywhere in the tuple tree leads to failure of the whole tuple tree. Topologies using a large amount of RAM for a worker encounter gc cycles greater than a minute. There can be a lot of contention at the transfer queues, especially when a worker runs several executors. To mitigate some of these performance risks, Twitter often had to over provision the allocated resources. And they really do mean overprovision – one of their topologies used 600 cores at an average 20-30% utilization. From the analysis, one would have expected the topology to require only 150 cores.