SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Apache Storm and Spark
Streaming Compared
P. Taylor Goetz, Hortonworks
@ptgoetz
Honestly...
• I know a lot more about Apache Storm than I do
Apache Spark Streaming.
• I've been involved with Apache Storm, in one
way or another, since it was open-sourced.
• I'm admittedly biased.
But...
• A number of articles/papers comparing Apache
Storm and Spark Streaming are inaccurate in
terms of Storm’s features and performance
characteristics.
• Code and configuration for those studies is not
available, so independent verification is
impossible.
• Claims don't match real-world observations.
But...
• There is an inherent “Home Team Advantage” in
any benchmark comparison.
• Without open source code, any benchmark
claims are essentially marketing fluff, and should
be taken with a grain or two of NaCl.
• Any benchmark claim should be independently
verifiable.
Spark Streaming Paper
• Compares Spark Streaming (Micro-Batch) to
Core Storm (One-at-a-Time)
• A more appropriate comparison would have
been with Storm’s Trident (Micro-Batch) API
• Trident mentioned only in passing (on pages 3
and 12)
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Spark Streaming Paper
• Benchmark code/configuration not publicly
available
• Performance claims not independently verifiable
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Spark Streaming Paper
• Granted, the Spark Streaming paper is almost 2
years old and written at a time when Trident was
relatively new.
• However, that paper is often cited when
comparing Apache Storm and Spark Streaming,
particularly in terms of performance.
• A lot can change in 2 years.
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
Streaming and batch
processing are
fundamentally different.
Batch vs. Streaming
• Storm is a stream processing framework that
also does micro-batching (Trident).

• Spark is a batch processing framework that also
does micro-batching (Spark Streaming).
Batch vs. Streaming
Batch Streaming
Batch vs. Streaming
Batch Streaming
Micro-Batch
Apache Storm: Two
Streaming APIs
Core Storm (Spouts and Bolts)!
• One at a Time
• Lower Latency
• Operates on Tuple Streams
Trident (Streams and Operations)!
• Micro-Batch
• Higher Throughput
• Operates on Streams of Tuple Batches and Partitions
Language Options
Core Storm Storm Trident Spark Streaming
• Java
• Clojure
• Scala
• Python
• Ruby
• others*
• Java
• Clojure
• Scala
• Java
• Scala
• Python
*Storm’s Multi-Lang feature allows the use of virtually any programming language.
Reliability Models
Core Storm Storm Trident
Spark
Streaming
At Most Once Yes Yes No
At Least Once Yes Yes No*
Exactly Once No Yes Yes*
*In some node failure scenarios, Spark Streaming
falls back to at-least-once processing or data loss.
Programing Model
Core Storm Storm Trident Spark Streaming
Stream Primitive Tuple
Tuple, Tuple
Batch, Partition
DStream
Stream Source Spouts
Spouts, Trident
Spouts
HDFS, Network
Computation/
Transformation
Bolts
Filters,
Functions,
Aggregations,
Joins
Transformation,
Window
Operations
Stateful
Operations
No
(roll your own)
Yes Yes
Output/
Persistence
Bolts State, MapState foreachRDD
Production Deployments
Apache Storm Spark Streaming
• Too many to list



http://
storm.incubator.apache.org/
documentation/Powered-
By.html
• Sharethrough



http://
engineering.sharethrough.com/blog/
2014/06/27/sharethrough-at-spark-
summit-2014-spark-streaming-for-
realtime-auctions/
Support
Apache Storm Spark
Spark
Streaming
Hadoop Distro
Hortonworks,
MapR
Cloudera,
MapR,
Hortonworks
(preview)
Hortonworks,
Cloudera,
MapR
Resource
Management
YARN, Mesos YARN, Mesos YARN*, Mesos
Provisioning/
Monitoring
Apache
Ambari
Cloudera
Manager
?
*With issues: http://spark-summit.org/wp-content/uploads/2014/07/
Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
Failure Scenarios
Worker Failure:
Spark Streaming
"So if a worker node fails, then the system can recompute
the lost from the the left over copy of the input data.
However, if the worker node where a network receiver was
running fails, then a tiny bit of data may be lost, that is, the
data received by the system but not yet replicated to other
node(s)."
Only HDFS-backed data sources are fully fault tolerant.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties
Worker Failure:
Spark Streaming
https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and-
zero-data-loss-in-spark-streaming.html
Solution?: Write Ahead Logs (SPARK-3129)
• Enabling WAL requires DFS (HDFS, S3) — no such
requirement with Storm
• Incurs a performance penalty that adds to overall latency
• Full fault tolerance still requires a data source that can
replay data (e.g. Kafka)!
• Architectural band aid?
Worker Failure:
Apache Storm
• If a supervisor node fails, Nimbus will reassign that node's
tasks to other nodes in the cluster.
• Any tuples sent to a failed node will time out and be
replayed (In Trident, any batches will be replayed).
• Delivery guarantees dependent on a reliable data source.
Data Source Reliability
• A data source is considered unreliable if there is no means
to replay a previously-received message.
• A data source is considered reliable if it can somehow replay
a message if processing fails at any point.
• A data source is considered durable if it can replay any
message or set of messages given the necessary selection
criteria.
!
(These are my terms.)
Reliability Limitations:
Apache Storm
• Exactly once processing requires a durable data source.
• At least once processing requires a reliable data source.
• An unreliable data source can be wrapped to provide
additional guarantees.
• With durable and reliable sources, Storm will not drop data.
• Common pattern: Back unreliable data sources with
Apache Kafka (minor latency hit traded for 100% durability).
Apache Storm Spouts
Durable!
Kafka











Reliable!
JMS
RabbitMQ /
AMQP
Kestrel
Amazon SQS
Amazon Kinesis
Unreliable!
Twitter
Scribe
MongoDB
Apache Storm Output
(Bolts, Trident State)
• Cassandra
• HBase
• HDFS
• Kafka
• Redis
• Memcached
• R
• JMS
• MongoDB
• RDBMS
Apache Storm + Kafka
Apache Kafka is an ideal source for Storm topologies. It
provides everything necessary for:
• At most once processing
• At least once processing
• Exactly once processing
Apache Storm includes Kafka spout implementations for all
levels of reliability.
Kafka Supports a wide variety of languages and integration
points for both producers and consumers.
Reliability Limitations:
Spark Streaming
• Fault tolerance and reliability guarantees require
HDFS-backed data source.
• Moving data to HDFS prior to stream processing
introduces additional latency.
• Network data sources (Kafka, etc.) are
vulnerable to data loss in the event of a worker
node failure.
https://spark.apache.org/docs/latest/streaming-programming-
guide.html#fault-tolerance-properties
Performance
“The main reason cited by Tathagata for Spark's
performance gain over Storm is the aggregation of
small records that occurs through the mechanics
of RDDs.”
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
In other words: Micro-Batching
Performance
http://www.cs.duke.edu/~kmoses/cps516/dstream.html
Storm capped at 10k msgs/sec/node?
Spark Streaming 40x faster than Storm?
Others may disagree…
https://twitter.com/
nathanmarz/status/
207989068519317505
http://www.slideshare.net/
JamesSirota/cisco-opensoc
Netty Transport
• Introduced in Apache Storm
0.9.0
• Faster, pure Java alternative
for 0MQ
• Yahoo! Engineering
announcement:

http://yahooeng.tumblr.com/post/
64758709722/making-storm-fly-
with-netty
• Performance Test Code:

https://github.com/yahoo/storm-
perf-test
Netty
0mq
STORM-297
• Introduced in Apache Storm
0.9.2-incubating
• Big performance boost,
especially for small messages
• JIRA Discussion:

https://issues.apache.org/jira/
browse/STORM-297
• Performance Test Code:

https://github.com/yahoo/storm-
perf-test
Benchmarking Storm
• 5 nodes on AWS (m1.large - not very powerful)
• 1 ZooKeeper, 1 Nimbus, 3 Supervisors
• Storm Core API and Trident API benchmarks
• Is Trident API slower than Core API?
https://github.com/ptgoetz/storm-benchmark
Is Trident API slower than
Core API?
• On low-power hardware with 3 supervisor nodes…
• Core API:
~150k msg./sec. with ~80 ms. latency
• Trident API:
~300k msg./sec. with ~250 ms. latency
• Higher throughput possible with increased latency
• Better performance with bigger hardware
Is Spark + Spark Streaming a
"Lambda Architecture in a Box?"
• No!
• Lambda is a lot more than batch + streaming.
• Lambda is powerful when applied correctly, but is
not right for every use case.
• Spark and Spark Streaming have overlapping
programming models for batch and micro-batch.
• The rest is up to you (as it is with Storm).
Final Thoughts
In general (not specific to Spark Streaming):!
• Beware any claim that A is X times faster than B.
• Performance is a matter of proper tuning for the
use case at hand.
• Any system can be hobbled to look bad in a
benchmark.
Recommendation
• It is up to you, and your specific use case.
• Consider fault tolerance. Is data loss
acceptable?
• Consider all facets and make informed
decisions.
• Rely on your own benchmarks
Questions?
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservicespflueras
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin PodvalMartin Podval
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsGuozhang Wang
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibTaras Matyashovsky
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsAndrzej Michałowski
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafkaemreakis
 

Was ist angesagt? (20)

Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Stability Patterns for Microservices
Stability Patterns for MicroservicesStability Patterns for Microservices
Stability Patterns for Microservices
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka StreamsExactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
 
Introduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlibIntroduction to ML with Apache Spark MLlib
Introduction to ML with Apache Spark MLlib
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Feature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systemsFeature store: Solving anti-patterns in ML-systems
Feature store: Solving anti-patterns in ML-systems
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 

Andere mochten auch

Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기Ted Won
 
검색로그시스템 with Python
검색로그시스템 with Python검색로그시스템 with Python
검색로그시스템 with Pythonitproman35
 
파이썬 데이터 분석 3종세트
파이썬 데이터 분석 3종세트파이썬 데이터 분석 3종세트
파이썬 데이터 분석 3종세트itproman35
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for HadoopJoe Crobak
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 

Andere mochten auch (8)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기지금 핫한 Real-time In-memory Stream Processing 이야기
지금 핫한 Real-time In-memory Stream Processing 이야기
 
검색로그시스템 with Python
검색로그시스템 with Python검색로그시스템 with Python
검색로그시스템 with Python
 
파이썬 데이터 분석 3종세트
파이썬 데이터 분석 3종세트파이썬 데이터 분석 3종세트
파이썬 데이터 분석 3종세트
 
Workflow Engines for Hadoop
Workflow Engines for HadoopWorkflow Engines for Hadoop
Workflow Engines for Hadoop
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 

Ähnlich wie Apache storm vs. Spark Streaming

Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormP. Taylor Goetz
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Spark Summit
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Yuval Itzchakov
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Chris Fregly
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Terraform - Taming Modern Clouds
Terraform  - Taming Modern CloudsTerraform  - Taming Modern Clouds
Terraform - Taming Modern CloudsNic Jackson
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?MapR Technologies
 
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017Quentin Adam
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar clusterShivji Kumar Jha
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh LalwaniDatabricks
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 

Ähnlich wie Apache storm vs. Spark Streaming (20)

Apache Storm
Apache StormApache Storm
Apache Storm
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache Storm
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)Spark Streaming @ Scale (Clicktale)
Spark Streaming @ Scale (Clicktale)
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Terraform - Taming Modern Clouds
Terraform  - Taming Modern CloudsTerraform  - Taming Modern Clouds
Terraform - Taming Modern Clouds
 
Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?Stream Processing Everywhere - What to use?
Stream Processing Everywhere - What to use?
 
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
Monitoring the unknown, 1000*100 series a day - Big Data Vilnius 2017
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE PlatformsFIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 

Mehr von P. Taylor Goetz

Flux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentFlux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentP. Taylor Goetz
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...P. Taylor Goetz
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache StormP. Taylor Goetz
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 

Mehr von P. Taylor Goetz (6)

Flux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentFlux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & Deployment
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 

Kürzlich hochgeladen

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantAxelRicardoTrocheRiq
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Kürzlich hochgeladen (20)

Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Salesforce Certified Field Service Consultant
Salesforce Certified Field Service ConsultantSalesforce Certified Field Service Consultant
Salesforce Certified Field Service Consultant
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Apache storm vs. Spark Streaming

  • 1. Apache Storm and Spark Streaming Compared P. Taylor Goetz, Hortonworks @ptgoetz
  • 2. Honestly... • I know a lot more about Apache Storm than I do Apache Spark Streaming. • I've been involved with Apache Storm, in one way or another, since it was open-sourced. • I'm admittedly biased.
  • 3. But... • A number of articles/papers comparing Apache Storm and Spark Streaming are inaccurate in terms of Storm’s features and performance characteristics. • Code and configuration for those studies is not available, so independent verification is impossible. • Claims don't match real-world observations.
  • 4. But... • There is an inherent “Home Team Advantage” in any benchmark comparison. • Without open source code, any benchmark claims are essentially marketing fluff, and should be taken with a grain or two of NaCl. • Any benchmark claim should be independently verifiable.
  • 5. Spark Streaming Paper • Compares Spark Streaming (Micro-Batch) to Core Storm (One-at-a-Time) • A more appropriate comparison would have been with Storm’s Trident (Micro-Batch) API • Trident mentioned only in passing (on pages 3 and 12) http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 6. Spark Streaming Paper • Benchmark code/configuration not publicly available • Performance claims not independently verifiable http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 7. Spark Streaming Paper • Granted, the Spark Streaming paper is almost 2 years old and written at a time when Trident was relatively new. • However, that paper is often cited when comparing Apache Storm and Spark Streaming, particularly in terms of performance. • A lot can change in 2 years. http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
  • 8. Streaming and batch processing are fundamentally different.
  • 9. Batch vs. Streaming • Storm is a stream processing framework that also does micro-batching (Trident).
 • Spark is a batch processing framework that also does micro-batching (Spark Streaming).
  • 11. Batch vs. Streaming Batch Streaming Micro-Batch
  • 12. Apache Storm: Two Streaming APIs Core Storm (Spouts and Bolts)! • One at a Time • Lower Latency • Operates on Tuple Streams Trident (Streams and Operations)! • Micro-Batch • Higher Throughput • Operates on Streams of Tuple Batches and Partitions
  • 13. Language Options Core Storm Storm Trident Spark Streaming • Java • Clojure • Scala • Python • Ruby • others* • Java • Clojure • Scala • Java • Scala • Python *Storm’s Multi-Lang feature allows the use of virtually any programming language.
  • 14. Reliability Models Core Storm Storm Trident Spark Streaming At Most Once Yes Yes No At Least Once Yes Yes No* Exactly Once No Yes Yes* *In some node failure scenarios, Spark Streaming falls back to at-least-once processing or data loss.
  • 15. Programing Model Core Storm Storm Trident Spark Streaming Stream Primitive Tuple Tuple, Tuple Batch, Partition DStream Stream Source Spouts Spouts, Trident Spouts HDFS, Network Computation/ Transformation Bolts Filters, Functions, Aggregations, Joins Transformation, Window Operations Stateful Operations No (roll your own) Yes Yes Output/ Persistence Bolts State, MapState foreachRDD
  • 16. Production Deployments Apache Storm Spark Streaming • Too many to list
 
 http:// storm.incubator.apache.org/ documentation/Powered- By.html • Sharethrough
 
 http:// engineering.sharethrough.com/blog/ 2014/06/27/sharethrough-at-spark- summit-2014-spark-streaming-for- realtime-auctions/
  • 17. Support Apache Storm Spark Spark Streaming Hadoop Distro Hortonworks, MapR Cloudera, MapR, Hortonworks (preview) Hortonworks, Cloudera, MapR Resource Management YARN, Mesos YARN, Mesos YARN*, Mesos Provisioning/ Monitoring Apache Ambari Cloudera Manager ? *With issues: http://spark-summit.org/wp-content/uploads/2014/07/ Productionizing-a-247-Spark-Streaming-Service-on-YARN-Ooyala.pdf
  • 19. Worker Failure: Spark Streaming "So if a worker node fails, then the system can recompute the lost from the the left over copy of the input data. However, if the worker node where a network receiver was running fails, then a tiny bit of data may be lost, that is, the data received by the system but not yet replicated to other node(s)." Only HDFS-backed data sources are fully fault tolerant. https://spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-properties
  • 20. Worker Failure: Spark Streaming https://databricks.com/blog/2015/01/15/improved-driver-fault-tolerance-and- zero-data-loss-in-spark-streaming.html Solution?: Write Ahead Logs (SPARK-3129) • Enabling WAL requires DFS (HDFS, S3) — no such requirement with Storm • Incurs a performance penalty that adds to overall latency • Full fault tolerance still requires a data source that can replay data (e.g. Kafka)! • Architectural band aid?
  • 21. Worker Failure: Apache Storm • If a supervisor node fails, Nimbus will reassign that node's tasks to other nodes in the cluster. • Any tuples sent to a failed node will time out and be replayed (In Trident, any batches will be replayed). • Delivery guarantees dependent on a reliable data source.
  • 22. Data Source Reliability • A data source is considered unreliable if there is no means to replay a previously-received message. • A data source is considered reliable if it can somehow replay a message if processing fails at any point. • A data source is considered durable if it can replay any message or set of messages given the necessary selection criteria. ! (These are my terms.)
  • 23. Reliability Limitations: Apache Storm • Exactly once processing requires a durable data source. • At least once processing requires a reliable data source. • An unreliable data source can be wrapped to provide additional guarantees. • With durable and reliable sources, Storm will not drop data. • Common pattern: Back unreliable data sources with Apache Kafka (minor latency hit traded for 100% durability).
  • 24. Apache Storm Spouts Durable! Kafka
 
 
 
 
 
 Reliable! JMS RabbitMQ / AMQP Kestrel Amazon SQS Amazon Kinesis Unreliable! Twitter Scribe MongoDB
  • 25. Apache Storm Output (Bolts, Trident State) • Cassandra • HBase • HDFS • Kafka • Redis • Memcached • R • JMS • MongoDB • RDBMS
  • 26. Apache Storm + Kafka Apache Kafka is an ideal source for Storm topologies. It provides everything necessary for: • At most once processing • At least once processing • Exactly once processing Apache Storm includes Kafka spout implementations for all levels of reliability. Kafka Supports a wide variety of languages and integration points for both producers and consumers.
  • 27. Reliability Limitations: Spark Streaming • Fault tolerance and reliability guarantees require HDFS-backed data source. • Moving data to HDFS prior to stream processing introduces additional latency. • Network data sources (Kafka, etc.) are vulnerable to data loss in the event of a worker node failure. https://spark.apache.org/docs/latest/streaming-programming- guide.html#fault-tolerance-properties
  • 28. Performance “The main reason cited by Tathagata for Spark's performance gain over Storm is the aggregation of small records that occurs through the mechanics of RDDs.” http://www.cs.duke.edu/~kmoses/cps516/dstream.html In other words: Micro-Batching
  • 29. Performance http://www.cs.duke.edu/~kmoses/cps516/dstream.html Storm capped at 10k msgs/sec/node? Spark Streaming 40x faster than Storm? Others may disagree…
  • 31. Netty Transport • Introduced in Apache Storm 0.9.0 • Faster, pure Java alternative for 0MQ • Yahoo! Engineering announcement:
 http://yahooeng.tumblr.com/post/ 64758709722/making-storm-fly- with-netty • Performance Test Code:
 https://github.com/yahoo/storm- perf-test Netty 0mq
  • 32. STORM-297 • Introduced in Apache Storm 0.9.2-incubating • Big performance boost, especially for small messages • JIRA Discussion:
 https://issues.apache.org/jira/ browse/STORM-297 • Performance Test Code:
 https://github.com/yahoo/storm- perf-test
  • 33. Benchmarking Storm • 5 nodes on AWS (m1.large - not very powerful) • 1 ZooKeeper, 1 Nimbus, 3 Supervisors • Storm Core API and Trident API benchmarks • Is Trident API slower than Core API? https://github.com/ptgoetz/storm-benchmark
  • 34. Is Trident API slower than Core API? • On low-power hardware with 3 supervisor nodes… • Core API: ~150k msg./sec. with ~80 ms. latency • Trident API: ~300k msg./sec. with ~250 ms. latency • Higher throughput possible with increased latency • Better performance with bigger hardware
  • 35. Is Spark + Spark Streaming a "Lambda Architecture in a Box?" • No! • Lambda is a lot more than batch + streaming. • Lambda is powerful when applied correctly, but is not right for every use case. • Spark and Spark Streaming have overlapping programming models for batch and micro-batch. • The rest is up to you (as it is with Storm).
  • 36. Final Thoughts In general (not specific to Spark Streaming):! • Beware any claim that A is X times faster than B. • Performance is a matter of proper tuning for the use case at hand. • Any system can be hobbled to look bad in a benchmark.
  • 37. Recommendation • It is up to you, and your specific use case. • Consider fault tolerance. Is data loss acceptable? • Consider all facets and make informed decisions. • Rely on your own benchmarks