SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Stream Computing
(The engineer’s perspective)
Ilya Ganelin
Batch vs. Stream
• Batch
• Process chunks of data instead of one at a time
• Throughput over latency (seconds, minutes, hours)
• E.g. MapReduce, Spark, Tez
• Stream
• Data processed one at a time
• Latency over throughput (microseconds, milliseconds)
• E.g. Storm, Flink, Apex, KafkaStreams, GearPump
Scalability, Performance, Durability, Availability
• How do we handle more data?
• Quickly?
• Without ever losing data or compute?
• And ensure the system keeps working, even if there are failures?
What are the tradeoffs?
• If we focus on scalability, it’s harder to guarantee
• Durability – more moving pieces, more coordination, more failures
• Availability – more failures, harder to stay operational
• Performance – bottlenecks and synchronization
• If we focus on availability, it’s harder to guarantee
• Performance – monitoring and synchronization overhead
• Scalability and performance
• Durability – must recover without losing data
• If we focus on durability, it’s harder to guarantee
• Performance
• Scalability
Batch compute has it easy.
• Get scale-out and performance by adding hardware and taking longer
• Get durability with a durable data store and recompute
• Get availability by taking longer to recover (this makes life easier!)
• In stream processing, you don’t have time!
It’s not about performance and scale.
• Most platforms handle large volume of data relatively quickly
• It’s about:
• Ease of use – how quickly can I build a complex application? Not word count.
• Failure-handling – what happens when things break?
• Durability – how do I avoid losing data without sacrificing performance?
• Availability – how can I keep my system operational with a minimum of labor
and without sacrificing performance?
Next: Case Studies in Open-Source Streaming
• Storm
• Flink
• Apex
Apache Storm
• Tried and true, was deployed on 10,000 node clusters at Twitter
• Scalable
• Performant
• Easy to use
• Weaknesses:
• Failure handling
• Operationalization at scale
• Flexibility
• Obsolete?
How does it work?
How does it work?
Failure Detection
Failure Detection
No durability of data in flight or guarantee of exactly once processing!
Where do the weakness come from?
• Nimbus was a single point of failure (fixed as of 1.0.0 release)
• Upstream bolt/spout failure triggers re-compute on entire tree
• Can only create parallel independent stream by having separate redundant
topologies
• Bolts/spouts share JVM  Hard to debug
• Failed tuples cannot be replayed quicker than 1s (lower limit on Ack)
• No dynamic topologies
• Cannot add or remove applications without service interruption
• Poor resource sharing in large clusters
Enter the Competition – Apache Flink
• Declarative functional API (like Spark)
• But, true streaming platform with support for CEP
• Optimized query execution
• Fast-growing popularity
How does it work?
Failure Handling
So what’s different from Storm?
• Flink handles planning and optimization for you
• Abstracts lower level internals
• Clear semantics around windowing (which Storm has lacked)
• Failure handling is lightweight and fast!
• Exactly once processing (given appropriate connectors at start/end)
• Can run Storm
What can’t it do?
• Dynamically update topology
• Dynamically scale
• Recover from errors without stopping the entire DAG
• Allow fine-grained control of how data moves through the system –
locality, data partitioning, routing
• You can do these individually, but not all at once
• The high level API can be a curse (like in Spark)!
So what else is there?
Onyx
Which are unique?
• Apache Beam (Google’s baby - unifies all the platforms)
• Apache Apex (Robust architecture, scalable, fast, durable)
• IBM InfoSphere Streams (proprietary, expensive, the best)
Let’s look at Apex
• Unique provenance
• Built for the business from experience at Yahoo! – not a research project
• Built for reliability and strict processing semantics, not performance
• Apex just works
• Strengths
• Dynamism
• Scalability
• Failure-handling
• Weaknesses
• Nascent high-level API
• More complex architecture
How does it work?
Failure Handling
So it’s the best? Sort of!
• Most robust failure-handling
• Allows fine-tuning of data flows and DAG setup
• Excellent exploratory UI
• But
• Learning curve
• Nascent high-level API
• No machine learning support
• Built for business, not for simplicity
Streaming is great – what about state?
• What if I need to persist data?
• Across operators?
• Retrieve it quickly?
• Do complex analytics?
• And build models?
Why state?
• Historical features (e.g. spend amount over 30 days)
• Statistical aggregates
• Machine learning model training
• Why Cross operator? Because of how data is partitioned, allows
aggregation over multiple fields.
Distributed In-Memory Databases
• Can support low-latency streaming use cases
• Durability becomes complicated because memory is volatile
• Memory is expensive and limited
• Examples: ScyllaDB, Geode, Memcached, Redis, MemSQL, Ignite,
Hazelcast, Distributed Hash Tables
Your Guide to Streaming - The Engineer's Perspective

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Databricks
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Databricks
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Thoughtworks
 

Was ist angesagt? (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Migration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQLMigration From Oracle to PostgreSQL
Migration From Oracle to PostgreSQL
 
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUsCreating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
Creating an 86,000 Hour Speech Dataset with Apache Spark and TPUs
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Structured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at AppleStructured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at Apple
 
Open Air 2016 Mini Talk
Open Air 2016 Mini TalkOpen Air 2016 Mini Talk
Open Air 2016 Mini Talk
 
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
Big data pipeline with scala by Rohit Rai, Tuplejump - presented at Pune Scal...
 
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDeploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark Operator
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
FOSS4G In The Cloud: Using Open Source to build Cloud based Spatial Infrastru...
 

Ähnlich wie Your Guide to Streaming - The Engineer's Perspective

Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
d0nn9n
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
Yves Goeleven
 

Ähnlich wie Your Guide to Streaming - The Engineer's Perspective (20)

Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16
 
Tiger oracle
Tiger oracleTiger oracle
Tiger oracle
 
Five Years of EC2 Distilled
Five Years of EC2 DistilledFive Years of EC2 Distilled
Five Years of EC2 Distilled
 
Building large scale, job processing systems with Scala Akka Actor framework
Building large scale, job processing systems with Scala Akka Actor frameworkBuilding large scale, job processing systems with Scala Akka Actor framework
Building large scale, job processing systems with Scala Akka Actor framework
 
Next-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2msNext-Gen Decision Making in Under 2ms
Next-Gen Decision Making in Under 2ms
 
Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)Tale of two streaming frameworks (Karthik D - Walmart)
Tale of two streaming frameworks (Karthik D - Walmart)
 
Capital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 msCapital One's Next Generation Decision in less than 2 ms
Capital One's Next Generation Decision in less than 2 ms
 
Azug - successfully breeding rabits
Azug - successfully breeding rabitsAzug - successfully breeding rabits
Azug - successfully breeding rabits
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
World-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon RedshiftWorld-class Data Engineering with Amazon Redshift
World-class Data Engineering with Amazon Redshift
 
Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)Mtc learnings from isv & enterprise (dated - Dec -2014)
Mtc learnings from isv & enterprise (dated - Dec -2014)
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
Performance architecture for cloud connect
Performance architecture for cloud connectPerformance architecture for cloud connect
Performance architecture for cloud connect
 
Scaling Systems: Architectures that grow
Scaling Systems: Architectures that growScaling Systems: Architectures that grow
Scaling Systems: Architectures that grow
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 

Kürzlich hochgeladen

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Christo Ananth
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Kürzlich hochgeladen (20)

Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 

Your Guide to Streaming - The Engineer's Perspective

  • 1. Stream Computing (The engineer’s perspective) Ilya Ganelin
  • 2. Batch vs. Stream • Batch • Process chunks of data instead of one at a time • Throughput over latency (seconds, minutes, hours) • E.g. MapReduce, Spark, Tez • Stream • Data processed one at a time • Latency over throughput (microseconds, milliseconds) • E.g. Storm, Flink, Apex, KafkaStreams, GearPump
  • 3. Scalability, Performance, Durability, Availability • How do we handle more data? • Quickly? • Without ever losing data or compute? • And ensure the system keeps working, even if there are failures?
  • 4.
  • 5. What are the tradeoffs? • If we focus on scalability, it’s harder to guarantee • Durability – more moving pieces, more coordination, more failures • Availability – more failures, harder to stay operational • Performance – bottlenecks and synchronization • If we focus on availability, it’s harder to guarantee • Performance – monitoring and synchronization overhead • Scalability and performance • Durability – must recover without losing data • If we focus on durability, it’s harder to guarantee • Performance • Scalability
  • 6. Batch compute has it easy. • Get scale-out and performance by adding hardware and taking longer • Get durability with a durable data store and recompute • Get availability by taking longer to recover (this makes life easier!) • In stream processing, you don’t have time!
  • 7. It’s not about performance and scale. • Most platforms handle large volume of data relatively quickly • It’s about: • Ease of use – how quickly can I build a complex application? Not word count. • Failure-handling – what happens when things break? • Durability – how do I avoid losing data without sacrificing performance? • Availability – how can I keep my system operational with a minimum of labor and without sacrificing performance?
  • 8. Next: Case Studies in Open-Source Streaming • Storm • Flink • Apex
  • 9. Apache Storm • Tried and true, was deployed on 10,000 node clusters at Twitter • Scalable • Performant • Easy to use • Weaknesses: • Failure handling • Operationalization at scale • Flexibility • Obsolete?
  • 10. How does it work?
  • 11. How does it work?
  • 13. Failure Detection No durability of data in flight or guarantee of exactly once processing!
  • 14. Where do the weakness come from? • Nimbus was a single point of failure (fixed as of 1.0.0 release) • Upstream bolt/spout failure triggers re-compute on entire tree • Can only create parallel independent stream by having separate redundant topologies • Bolts/spouts share JVM  Hard to debug • Failed tuples cannot be replayed quicker than 1s (lower limit on Ack) • No dynamic topologies • Cannot add or remove applications without service interruption • Poor resource sharing in large clusters
  • 15. Enter the Competition – Apache Flink • Declarative functional API (like Spark) • But, true streaming platform with support for CEP • Optimized query execution • Fast-growing popularity
  • 16. How does it work?
  • 18. So what’s different from Storm? • Flink handles planning and optimization for you • Abstracts lower level internals • Clear semantics around windowing (which Storm has lacked) • Failure handling is lightweight and fast! • Exactly once processing (given appropriate connectors at start/end) • Can run Storm
  • 19. What can’t it do? • Dynamically update topology • Dynamically scale • Recover from errors without stopping the entire DAG • Allow fine-grained control of how data moves through the system – locality, data partitioning, routing • You can do these individually, but not all at once • The high level API can be a curse (like in Spark)!
  • 20. So what else is there? Onyx
  • 21. Which are unique? • Apache Beam (Google’s baby - unifies all the platforms) • Apache Apex (Robust architecture, scalable, fast, durable) • IBM InfoSphere Streams (proprietary, expensive, the best)
  • 22. Let’s look at Apex • Unique provenance • Built for the business from experience at Yahoo! – not a research project • Built for reliability and strict processing semantics, not performance • Apex just works • Strengths • Dynamism • Scalability • Failure-handling • Weaknesses • Nascent high-level API • More complex architecture
  • 23. How does it work?
  • 24.
  • 26. So it’s the best? Sort of! • Most robust failure-handling • Allows fine-tuning of data flows and DAG setup • Excellent exploratory UI • But • Learning curve • Nascent high-level API • No machine learning support • Built for business, not for simplicity
  • 27. Streaming is great – what about state? • What if I need to persist data? • Across operators? • Retrieve it quickly? • Do complex analytics? • And build models?
  • 28. Why state? • Historical features (e.g. spend amount over 30 days) • Statistical aggregates • Machine learning model training • Why Cross operator? Because of how data is partitioned, allows aggregation over multiple fields.
  • 29. Distributed In-Memory Databases • Can support low-latency streaming use cases • Durability becomes complicated because memory is volatile • Memory is expensive and limited • Examples: ScyllaDB, Geode, Memcached, Redis, MemSQL, Ignite, Hazelcast, Distributed Hash Tables