SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Performance Comparison of Streaming Big
Data Platforms
Reza Farivar
Capital One Inc.
Kyle Knusbaum
Yahoo Inc.
Streaming Computation engines
• Designed to process a continuous stream of data.
• Designed to process data with low latency – data (ideally) doesn’t buffer up before
being processed. Contrasts with batch processing - MapReduce.
• Designed to handle big data. The systems are distributed by design.
• Apache Storm has the TopologyBuilder API to create a directed graph (topology) through
which streams of data flow.
• “Spouts” are the entry point to the graph, and “bolts” perform the processing.
• Data flows through the system as individual tuples.
• Graphs are not necessarily acyclic (although that is often the case)
Kafka Spout
Database
• Apache Flink has the DataStream API to perform operations on streams of data. (map,
filter, reduce, join, etc.)
• These operations are turned into a graph at job submission time by Flink.
• Underlying graph works similarly to Storm’s model.
• Also supports a Storm-compatible API
Database
• Apache Spark has the DStream API to perform operations on streams of data. (map,
filter, reduce, join, etc.) Based on Spark’s RDD (Resilient Distributed Dataset)
abstraction.
• Similar to Flink’s API.
• Streaming accomplished through micro-batches.
• Spark streaming job consists of one small batch after another.
RDDRDDRDDRDDRDD
RDDRDDRDDRDDRDD
RDDRDDRDDRDDRDD
Spark Streaming
Database
Benchmark
• We would like to compare the platforms, but which benchmark?
– How to compare the relative effectiveness of these systems?
• Throughput (events per second)
• End-to-end latency (How long for an event to get through the system)
• Completeness (Is the computation correct?)
– Current benchmarks did not test with workloads similar to a real world use
case
• Speed of light tests only reveal so much information
• So we created a new benchmark (on github)
– A simple advertisement counting application
– Mimic some common ETL operations on data streams
Our Streaming benchmark
• Goal is to correlate latency with throughput.
• Simulation of an advertisement analytics pipeline.
• Must be implemented and run in all three engines.
• Initial data:
– Some number of advertising campaigns.
– Some number of ads per campaign.
• Initial data stored in Redis.
• Our producers read the initial data, and start generating various events. (view, click, purchase)
• Events are then sent to a Kafka cluster.
Benchmark Event
Producer
Flow of an event
Measuring Latency
– Windows periodically stored into Redis along with a timestamp of when the window
was written into Redis.
• Application given an SLA (Service-Level Agreement) as part of the simulation,
demanding that tuples be processed in under 1 second.
• The period of writes was chosen to meet the SLA. Writes to Redis were performed once
per second. Spark is an exception. It wrote windows out once per batch
Measuring Latency
• Ten second window
• First event generated
• 10 seconds of events – 10’s of thousands of events
per second
• Last event generated near end of window
• At some point later, the window is written into Redis.
• We know the time of the end of the window,
and the time the window was written.
• This time gives us a data point of latency – length of
time between event generation and being written in
database.
• Events processed late will cause their windows to be
written at a later time, and will be reflected in the
data.
10 s
1st event
in window
Last event
in window
Window data
written into
Redis
Latency data point
(Ideally less than SLA)
Our methodology
• Generate a particular throughput of events, then measure the latency.
– Throughputs measured varied between 50,000 events/s and 170,000 events/s
• 100 advertising campaigns
• 10 ads per campaign
• SLA set at 1 second
• 10 second windows
• 5 Kafka nodes with 5 topic partitions
• 1 Redis node
• 3 ZooKeeper nodes (cluster-coordination software)
• 10 worker nodes (doing computation)
• Handful of nodes used by the systems as masters, other non-compute servers.
Our methodology
1. Totally clear Kafka of data
2. Populate Redis with initial data
3. Launch the advertising analytics application on Spark, Flink, or Storm
4. Wait a bit for all workers to finish launching
5. Start up producers with instructions to produce tuples at a given rate – this rate determines the throughput.
– Ex: 5 producers writing 10,000 events per second generates a throughput of 50,000 events/s.
6. Let the system run for 30 minutes after starting the producers, then shut the producers down.
7. Run data gathering tool on the Redis database to generate latency points from the windows.
Hardware Setup
• Homogeneous nodes, each with two Intel E5530 @2.4GHz, 16 hyperthreading cores per
node
• 24GiB of memory
• Machines on the same rack
• Gigabit Ethernet switch
• The cluster has 40 nodes, 20-25 used in benchmark
• Multiple instances of Kafka producers to create load
– individual producers fall behind at around 17,000 events per second
• The use of 10 workers for a topology is near the average number we see being used by
topologies internal to Yahoo
– The Storm clusters are larger, but multi-tenant & run many topologies
About the implementations
• Apache Flink
– Tested 0.10.1-SNAPSHOT (commit hash 7364ce1).
– Application written in Java using the DataStream API.
– Checkpointing – a feature that guarantees at-least-once processing – was disabled.
• Apache Spark
– Tested version 1.5
– Application written in Scala using the DStreams API.
– At-least-once processing not implemented.
• Apache Storm
– Tested both versions 0.10 and 0.11-SNAPSHOT (commit hash a8d253a).
– Application written using the Java API.
– Acking provides at-least-once processing – turned off for high throughputs in 0.11-SNAPSHOT
Flink
• Most tuples finished
within 1 second SLA.
• Sharp curve indicates
there was a very small
number of straggling
tuples that were written
into Redis late.
• Red dots mark 1st 10th
25th 50th 75th 90th 99th
and 100th percentiles.
Flink
Late Tuples
• Of late tuples, most were
written within a few
milliseconds of the SLA’s
deadline.
• This emphasizes only a
very small number were
significantly late.
• Beyond about 170,000
tuples, Flink was unable
to handle the
throughput, and tuples
backed up.
Spark Streaming
• Benchmark written in Scala, using DStreams (a.k.a streaming RDDs) and direct
Kafka Consumer
• Micro-batching
– different than the pure streaming nature of Storm and Flink
– To meet 1 sec SLA, the batch duration was set to 1 second
• Forced to increase the batch duration for larger throughputs
• Transformations (e.g. maps and filters) applied on the Dstreams
• Joining data with Redis a special case
– Should not create a separate connection to Redis for each record  use a mapPartitions
operation that can give control of a whole RDD partition to our code
• create one connection to Redis and use this single connection to query information from Redis for
all the events in that RDD partition.
Spark 2-dimensional Parameter Adjustment
• Micro-batch duration
– This is a control dimension that is not present in a pure streaming system like Storm
– Increasing the duration increases latency while reducing overhead and therefore increasing
maximum throughput
– Finding optimal batch duration that minimizes latency while allowing spark to handle the
throughput is a time consuming process
• Set a batch duration, run the benchmark for 30 minutes, check the results  decrease/increase the
duration
• Parallelism
– increasing parallelism is simpler said than done in Spark
– In a true streaming system like Storm, one bolt instance can send its results to any number of
subsequent bolt instances
– In a micro batch system like Spark, perform a reshuffle operation
• similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the
cluster.
• But the reshuffling itself introduces considerable overhead.
Spark
• Spark had more
interesting results than
Flink.
• Due to the micro-batch
design, it was unable to
process events at low
latencies
• The overhead of
scheduling and
launching a task per
batch is very high
• Batch size had to be
increased – this
overcame the launch
overhead.
Spark
• If we reduce the batch
duration sufficiently, we
get into a region where
the incoming events are
processed within 3 or 4
subsequent batches.
• The system on the verge
of falling behind, but is
still manageable, and
results in better latency.
Spark
Falling behind
• Without increasing the
batch size, Spark was
unable to keep up with
the throughput, tuples
backed up, and latencies
continuously increased
until the job was shut
down.
• After increasing the
batchsize, Spark handled
larger throughputs than
either Storm or Flink.
Spark
• Tuning the batch size was time-consuming, since it had to be done manually – this was one of the largest
problems we faced in testing Spark’s Streaming capabilities.
• If the batch size was set too high, latency numbers would be bad. If it was set too low, Spark would fall behind,
tuples would back up, and latency numbers would be worse.
• Spark had a new feature at the time called ‘backpressure’ which was supposed to help address this, but we were
unable to make it work properly. In fact, enabling backpressure hindered our numbers in all cases.
Storm Results
• Benchmark uses Java API, One worker process per host, each worker has 16 tasks to run in 16
executors - one for each core.
• In 0.11.0, Storm added a simple back pressure controller  avoid the overhead of acking
– In 0.10.0 benchmark topology, acking was used for flow control but not for processing guarantees.
• With acking disabled, Storm even beat Flink for latency at high throughput.
– But no tuple failure handling
Storm 0.10.0 Storm 0.11.0
Storm
• Storm behaved very
similarly to Flink.
• However, Storm was
unable to handle more
than 130,000 events/s
with its acking system
enabled.
• Acking keeps track of
successfully processed
events within Storm.
• With acking disabled,
Storm achieved numbers
similar to Flink at
throughputs up to
170,000 events/s.
Storm
Late Tuples
• Similar to Flink’s late
tuple graph.
• Tuples that were late
were slightly less late
than Flink’s.
Three-way Comparison
• Flink and Storm have
similar linear
performance profiles
– These two systems
process an incoming
event as it becomes
available
• Spark Streaming has
much higher latency,
but is expected to
handle higher
throughputs
– System behaves in a
stepwise function, a
direct result from its
micro-batching
nature
Flink
Spark
Storm
• Comparisons of 99-th
percentile latencies are
revealing.
• Storm 0.11 consistently
lower latency than Flink
and Spark.
• Flink’s latency comparable
to Storm 0.10, but
handled higher
throughput with at-least-
once guarantees.
• Spark had the highest
latency, but was able to
handle higher throughput
than either Storm or Flink
Future work
• Many variables involved – many we didn’t adjust.
• Applications were not optimized – all were written in a fairly plain manner and configuration
settings were not tweaked
• SLA deadline of 1 second is very low. We did this to test the limits of the low-latency streaming
systems. Higher SLA deadlines are reasonable, and testing those would be worthwhile – likely
showing Spark being highly competitive with the others.
• The throughputs we tested at were incredibly high.
– 170,000 events/s comes to 14688000000 events per day – 1.4*1010 events per day
• Didn’t test with exactly-once semantics.
• Ran small tests and checked for correctness of computations, but didn’t check correctness at
large scale.
• There are many more tests that can be run.
• Other streaming engines can be added.
Conclusions
• The competition between near real time streaming systems is
heating up, and there is no clear winner at this point
• Each of the platforms studied here have their advantages and
disadvantages
• Other important factors:
– Security or integration with tools and libraries
• Active communities for these and other big data processing
projects continue to innovate and benefit from each other’s
advancements

Weitere ähnliche Inhalte

Was ist angesagt?

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com confluent
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkDataWorks Summit
 
美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化confluent
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Kai Wähner
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformJean-Paul Azar
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For WorkflowTimothy Spann
 
IoT Scale Event-Stream Processing for Connected Fleet at Penske
IoT Scale Event-Stream Processing for Connected Fleet at PenskeIoT Scale Event-Stream Processing for Connected Fleet at Penske
IoT Scale Event-Stream Processing for Connected Fleet at PenskeVMware Tanzu
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Distributed Lock Manager
Distributed Lock ManagerDistributed Lock Manager
Distributed Lock ManagerHao Chen
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the DataHao Chen
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpRunning Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpHostedbyConfluent
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
 
Apache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and ArchitecturesApache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and ArchitecturesKai Wähner
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기AWSKRUG - AWS한국사용자모임
 

Was ist angesagt? (20)

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com Data Streaming Ecosystem Management at Booking.com
Data Streaming Ecosystem Management at Booking.com
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化美团数据平台之Kafka应用实践和优化
美团数据平台之Kafka应用实践和优化
 
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
Confluent REST Proxy and Schema Registry (Concepts, Architecture, Features)
 
Kafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platformKafka Tutorial - introduction to the Kafka streaming platform
Kafka Tutorial - introduction to the Kafka streaming platform
 
Best Practices For Workflow
Best Practices For WorkflowBest Practices For Workflow
Best Practices For Workflow
 
IoT Scale Event-Stream Processing for Connected Fleet at Penske
IoT Scale Event-Stream Processing for Connected Fleet at PenskeIoT Scale Event-Stream Processing for Connected Fleet at Penske
IoT Scale Event-Stream Processing for Connected Fleet at Penske
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Distributed Lock Manager
Distributed Lock ManagerDistributed Lock Manager
Distributed Lock Manager
 
How Prometheus Store the Data
How Prometheus Store the DataHow Prometheus Store the Data
How Prometheus Store the Data
 
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan GünalpRunning Kafka as a Native Binary Using GraalVM with Ozan Günalp
Running Kafka as a Native Binary Using GraalVM with Ozan Günalp
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Machine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFiMachine Learning in the IoT with Apache NiFi
Machine Learning in the IoT with Apache NiFi
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Apache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and ArchitecturesApache Kafka in Financial Services - Use Cases and Architectures
Apache Kafka in Financial Services - Use Cases and Architectures
 
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
Spark + S3 + R3를 이용한 데이터 분석 시스템 만들기
 

Andere mochten auch

Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryJean-Paul Azar
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaJoe Stein
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformJean-Paul Azar
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 

Andere mochten auch (8)

Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceHBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
 
Developing Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache KafkaDeveloping Real-Time Data Pipelines with Apache Kafka
Developing Real-Time Data Pipelines with Apache Kafka
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 

Ähnlich wie Performance Comparison of Streaming Big Data Platforms

Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkFabian Hueske
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineMonal Daxini
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming apphadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
 

Ähnlich wie Performance Comparison of Streaming Big Data Platforms (20)

Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Apache Spark Components
Apache Spark ComponentsApache Spark Components
Apache Spark Components
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at UberWSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
WSO2Con USA 2017: Scalable Real-time Complex Event Processing at Uber
 
Data Stream Processing with Apache Flink
Data Stream Processing with Apache FlinkData Stream Processing with Apache Flink
Data Stream Processing with Apache Flink
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016Netflix keystone   streaming data pipeline @scale in the cloud-dbtb-2016
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 

Mehr von DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 

Mehr von DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Performance Comparison of Streaming Big Data Platforms

  • 1. Performance Comparison of Streaming Big Data Platforms Reza Farivar Capital One Inc. Kyle Knusbaum Yahoo Inc.
  • 2. Streaming Computation engines • Designed to process a continuous stream of data. • Designed to process data with low latency – data (ideally) doesn’t buffer up before being processed. Contrasts with batch processing - MapReduce. • Designed to handle big data. The systems are distributed by design.
  • 3. • Apache Storm has the TopologyBuilder API to create a directed graph (topology) through which streams of data flow. • “Spouts” are the entry point to the graph, and “bolts” perform the processing. • Data flows through the system as individual tuples. • Graphs are not necessarily acyclic (although that is often the case) Kafka Spout Database
  • 4. • Apache Flink has the DataStream API to perform operations on streams of data. (map, filter, reduce, join, etc.) • These operations are turned into a graph at job submission time by Flink. • Underlying graph works similarly to Storm’s model. • Also supports a Storm-compatible API Database
  • 5. • Apache Spark has the DStream API to perform operations on streams of data. (map, filter, reduce, join, etc.) Based on Spark’s RDD (Resilient Distributed Dataset) abstraction. • Similar to Flink’s API. • Streaming accomplished through micro-batches. • Spark streaming job consists of one small batch after another. RDDRDDRDDRDDRDD RDDRDDRDDRDDRDD RDDRDDRDDRDDRDD Spark Streaming Database
  • 6. Benchmark • We would like to compare the platforms, but which benchmark? – How to compare the relative effectiveness of these systems? • Throughput (events per second) • End-to-end latency (How long for an event to get through the system) • Completeness (Is the computation correct?) – Current benchmarks did not test with workloads similar to a real world use case • Speed of light tests only reveal so much information • So we created a new benchmark (on github) – A simple advertisement counting application – Mimic some common ETL operations on data streams
  • 7. Our Streaming benchmark • Goal is to correlate latency with throughput. • Simulation of an advertisement analytics pipeline. • Must be implemented and run in all three engines. • Initial data: – Some number of advertising campaigns. – Some number of ads per campaign. • Initial data stored in Redis. • Our producers read the initial data, and start generating various events. (view, click, purchase) • Events are then sent to a Kafka cluster. Benchmark Event Producer
  • 8. Flow of an event
  • 9. Measuring Latency – Windows periodically stored into Redis along with a timestamp of when the window was written into Redis. • Application given an SLA (Service-Level Agreement) as part of the simulation, demanding that tuples be processed in under 1 second. • The period of writes was chosen to meet the SLA. Writes to Redis were performed once per second. Spark is an exception. It wrote windows out once per batch
  • 10. Measuring Latency • Ten second window • First event generated • 10 seconds of events – 10’s of thousands of events per second • Last event generated near end of window • At some point later, the window is written into Redis. • We know the time of the end of the window, and the time the window was written. • This time gives us a data point of latency – length of time between event generation and being written in database. • Events processed late will cause their windows to be written at a later time, and will be reflected in the data. 10 s 1st event in window Last event in window Window data written into Redis Latency data point (Ideally less than SLA)
  • 11. Our methodology • Generate a particular throughput of events, then measure the latency. – Throughputs measured varied between 50,000 events/s and 170,000 events/s • 100 advertising campaigns • 10 ads per campaign • SLA set at 1 second • 10 second windows • 5 Kafka nodes with 5 topic partitions • 1 Redis node • 3 ZooKeeper nodes (cluster-coordination software) • 10 worker nodes (doing computation) • Handful of nodes used by the systems as masters, other non-compute servers.
  • 12. Our methodology 1. Totally clear Kafka of data 2. Populate Redis with initial data 3. Launch the advertising analytics application on Spark, Flink, or Storm 4. Wait a bit for all workers to finish launching 5. Start up producers with instructions to produce tuples at a given rate – this rate determines the throughput. – Ex: 5 producers writing 10,000 events per second generates a throughput of 50,000 events/s. 6. Let the system run for 30 minutes after starting the producers, then shut the producers down. 7. Run data gathering tool on the Redis database to generate latency points from the windows.
  • 13. Hardware Setup • Homogeneous nodes, each with two Intel E5530 @2.4GHz, 16 hyperthreading cores per node • 24GiB of memory • Machines on the same rack • Gigabit Ethernet switch • The cluster has 40 nodes, 20-25 used in benchmark • Multiple instances of Kafka producers to create load – individual producers fall behind at around 17,000 events per second • The use of 10 workers for a topology is near the average number we see being used by topologies internal to Yahoo – The Storm clusters are larger, but multi-tenant & run many topologies
  • 14. About the implementations • Apache Flink – Tested 0.10.1-SNAPSHOT (commit hash 7364ce1). – Application written in Java using the DataStream API. – Checkpointing – a feature that guarantees at-least-once processing – was disabled. • Apache Spark – Tested version 1.5 – Application written in Scala using the DStreams API. – At-least-once processing not implemented. • Apache Storm – Tested both versions 0.10 and 0.11-SNAPSHOT (commit hash a8d253a). – Application written using the Java API. – Acking provides at-least-once processing – turned off for high throughputs in 0.11-SNAPSHOT
  • 15. Flink • Most tuples finished within 1 second SLA. • Sharp curve indicates there was a very small number of straggling tuples that were written into Redis late. • Red dots mark 1st 10th 25th 50th 75th 90th 99th and 100th percentiles.
  • 16. Flink Late Tuples • Of late tuples, most were written within a few milliseconds of the SLA’s deadline. • This emphasizes only a very small number were significantly late. • Beyond about 170,000 tuples, Flink was unable to handle the throughput, and tuples backed up.
  • 17. Spark Streaming • Benchmark written in Scala, using DStreams (a.k.a streaming RDDs) and direct Kafka Consumer • Micro-batching – different than the pure streaming nature of Storm and Flink – To meet 1 sec SLA, the batch duration was set to 1 second • Forced to increase the batch duration for larger throughputs • Transformations (e.g. maps and filters) applied on the Dstreams • Joining data with Redis a special case – Should not create a separate connection to Redis for each record  use a mapPartitions operation that can give control of a whole RDD partition to our code • create one connection to Redis and use this single connection to query information from Redis for all the events in that RDD partition.
  • 18. Spark 2-dimensional Parameter Adjustment • Micro-batch duration – This is a control dimension that is not present in a pure streaming system like Storm – Increasing the duration increases latency while reducing overhead and therefore increasing maximum throughput – Finding optimal batch duration that minimizes latency while allowing spark to handle the throughput is a time consuming process • Set a batch duration, run the benchmark for 30 minutes, check the results  decrease/increase the duration • Parallelism – increasing parallelism is simpler said than done in Spark – In a true streaming system like Storm, one bolt instance can send its results to any number of subsequent bolt instances – In a micro batch system like Spark, perform a reshuffle operation • similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the cluster. • But the reshuffling itself introduces considerable overhead.
  • 19. Spark • Spark had more interesting results than Flink. • Due to the micro-batch design, it was unable to process events at low latencies • The overhead of scheduling and launching a task per batch is very high • Batch size had to be increased – this overcame the launch overhead.
  • 20. Spark • If we reduce the batch duration sufficiently, we get into a region where the incoming events are processed within 3 or 4 subsequent batches. • The system on the verge of falling behind, but is still manageable, and results in better latency.
  • 21. Spark Falling behind • Without increasing the batch size, Spark was unable to keep up with the throughput, tuples backed up, and latencies continuously increased until the job was shut down. • After increasing the batchsize, Spark handled larger throughputs than either Storm or Flink.
  • 22. Spark • Tuning the batch size was time-consuming, since it had to be done manually – this was one of the largest problems we faced in testing Spark’s Streaming capabilities. • If the batch size was set too high, latency numbers would be bad. If it was set too low, Spark would fall behind, tuples would back up, and latency numbers would be worse. • Spark had a new feature at the time called ‘backpressure’ which was supposed to help address this, but we were unable to make it work properly. In fact, enabling backpressure hindered our numbers in all cases.
  • 23. Storm Results • Benchmark uses Java API, One worker process per host, each worker has 16 tasks to run in 16 executors - one for each core. • In 0.11.0, Storm added a simple back pressure controller  avoid the overhead of acking – In 0.10.0 benchmark topology, acking was used for flow control but not for processing guarantees. • With acking disabled, Storm even beat Flink for latency at high throughput. – But no tuple failure handling Storm 0.10.0 Storm 0.11.0
  • 24. Storm • Storm behaved very similarly to Flink. • However, Storm was unable to handle more than 130,000 events/s with its acking system enabled. • Acking keeps track of successfully processed events within Storm. • With acking disabled, Storm achieved numbers similar to Flink at throughputs up to 170,000 events/s.
  • 25. Storm Late Tuples • Similar to Flink’s late tuple graph. • Tuples that were late were slightly less late than Flink’s.
  • 26. Three-way Comparison • Flink and Storm have similar linear performance profiles – These two systems process an incoming event as it becomes available • Spark Streaming has much higher latency, but is expected to handle higher throughputs – System behaves in a stepwise function, a direct result from its micro-batching nature
  • 27. Flink Spark Storm • Comparisons of 99-th percentile latencies are revealing. • Storm 0.11 consistently lower latency than Flink and Spark. • Flink’s latency comparable to Storm 0.10, but handled higher throughput with at-least- once guarantees. • Spark had the highest latency, but was able to handle higher throughput than either Storm or Flink
  • 28. Future work • Many variables involved – many we didn’t adjust. • Applications were not optimized – all were written in a fairly plain manner and configuration settings were not tweaked • SLA deadline of 1 second is very low. We did this to test the limits of the low-latency streaming systems. Higher SLA deadlines are reasonable, and testing those would be worthwhile – likely showing Spark being highly competitive with the others. • The throughputs we tested at were incredibly high. – 170,000 events/s comes to 14688000000 events per day – 1.4*1010 events per day • Didn’t test with exactly-once semantics. • Ran small tests and checked for correctness of computations, but didn’t check correctness at large scale. • There are many more tests that can be run. • Other streaming engines can be added.
  • 29. Conclusions • The competition between near real time streaming systems is heating up, and there is no clear winner at this point • Each of the platforms studied here have their advantages and disadvantages • Other important factors: – Security or integration with tools and libraries • Active communities for these and other big data processing projects continue to innovate and benefit from each other’s advancements

Hinweis der Redaktion

  1. Streaming computation engines – what are they. They are systems designed to process a continuous stream of data. They are designed to have very low latency. What this means is that – ideally – data gets processed as soon as it reaches the system; it doesn’t buffer up. This is in contrast to something like Hadoop’s MapReduce, where incoming data goes into a file somewhere, and every couple hours or so a job runs that processes it all in one big batch. These are so-called “big-data” systems. They’re designed to be distributed and handle massive quantities of data. We have three of them here that we’re going to look at today.
  2. The first one we’re going to look at is Apache Storm. Storm’s API gives users tools to create a directed graph, called a topology in Storm, through which data flows. Each node of this graph is a piece of user code that does some processing. Nodes are either spouts or bolts. Spouts are the entry point to the graph, and bolts perform the processing. The data moves through the system as individual tuples. It’s the job of the spout to take incoming data and turn it into tuples to pass on to the bolts. Storm’s graphs are not necessarily acyclic – which is interesting. Most use cases we’ve seen seem to involve acyclic data flows, but it is possible to have cycles.
  3. Flink! Flink has its DataStream API to perform operations on streams of data, operations like map, filter, reduce, join and so on. Instead of having the users construct a graph, users just describe what they want to happen to the data, and Flink builds a graph for them. The underlying graph works very similarly to Storm’s So similar, that Flink actually built a Storm-compatible API, and they claim you can run unmodified storm applications on Flink.
  4. Spark Streaming! Spark Streaming has the DStream API to perform operations on streams of data. It is based on Spark’s RDDs, or Resilient Distributed Datasets The API is super similar to Flink’s The underlying model, however, is very different than both Storm’s and Flink’s. Spark’s streaming capabilities are accomplished through something called micro-batching. Micro-batching is basically just running very small batch jobs in quick succession. So each one of these RDDs down here would be a tiny batch of data in a spark streaming job.
  5. We used our benchmark to correlate latency and throughput in the systems. We simulated an advertisement analytics pipeline, which counts clicks in ad-campaigns. The application needed to be implemented and run in all three engines. We started out with some initial data, which were some number of advertising campaigns, and some number of ads in each campaign. We made these numbers adjustable. - The initial data we stored in a Redis instance. - We had some producer processes then read the initial data out of Redis, and begin generating various events for advertisements like views, clicks, and purchases. - These events it then sent into Kafka - Kafka is a distributed pub/sub system. Events go into Kafka from publishers and go out of Kafka to subscribers.
  6. The application itself performs operations on each event, and they go like this: First: deserialize the JSON string and turn it into a native data structure. Second: Filter the events. We’re only counting clicks in this application, so we drop all events that don’t have an ad_type of “click”. Third: We take what’s called a projection of the events – That just means we drop all of the fields in the tuple that we aren’t interested in. We’re left with just ad_id and event_time. If you remember earlier I highlighted three fields that were important. We’re down to two important fields now because we already used ad_type and we’re done with it. All of our events have the same ad_type now, so we can drop it. Fourth: Go and pull the campaign_id assiciated with the ad_id out of redis. This is part of the initial data that we put into Redis. Join this field into the tuple. Fifth: Take a windowed count of events per campaign – so we keep track of how many clicks each campaign has gotten in each time window. Last: Periodically write these windows into Redis – This will be the data we use to calculate latencies. The system needs to be able to take late events into account – This is just a constraint we put on the application since it’s one we see often in the real world.
  7. As I mentioned, the windows are periodically written into Redis along with a timestamp of when the window was written into Redis. This last part is important. Each window has a timestamp like this, and it represents when that window was last written into Redis. The application is given an SLA or Service-Level Agreement as part of the simulation, which says that tuples must be processed completely end-to-end in under 1 second. This is just another constraint that we put on our application as part of simulating a real-world use case. The 1-second SLA is basically just a target end-to-end latency; it’s what the systems are trying to achieve. To this end, we had the applications write their windows out once per second. Spark is the exception here. Its computation model doesn’t allow us to write windows out once per second. Instead, we write the windows out once per batch.
  8. Now we actually get to look at how we acquire the latency data. For our experiment we ran with 10-second windows. - In every window the first event is generated basically right when the window begins - After that, it’s 10 seconds of events – 10’s of thousands of events per second. - The last event is generated very near the end of the window – within microseconds before it. The last event goes off to be processed… - Some time later, the window is written into Redis by the application. - Now, we know the time of the end of the window – where the last event was written, and we know the time when the window was written to Redis. - This gives us a latency data point. This chunk of time here is the amount of time that passed between the last event’s generation and when it was written into Redis. – This is the end-to-end latency of the application. - You can see how events that are processed late will cause their windows to be written at a later time, and will be reflected as higher end-to-end latency in the data. So that’s how we measure latency. Next
  9. Our methodology for testing was pretty simple. We have our producers generate a certain event throughput, and then we measure the latency of tuples going through the system. Throughputs measured varied between 50,000 events per second and 170,000 events per second. We had…
  10. Steps were:
  11. Now we’re going to look at the benchmark results from each system. - First is Flink: The version we tested was a 0.10.1-SNAPSHOT We wrote the application using the Java DataStream API. Checkpointing was disabled – so there were no processing guarantees. - Spark: The version we tested was 1.5 We wrote this one in Scala using the DStreams API. In addition, we did not implement at-least-once semantics. - Storm: For storm, we tested both versions 0.10 and a 0.11-SNAPSHOT Application written using the java TopologyBuilder API. Storm’s acking provides at-least-once processing and flow control, but a new feature allowed us to turn that off for high throughputs in 0.11
  12. Some things we noticed about flink: Most of the tuples were processed within the 1-second SLA we specified. The graph here shows percentiles - so the red dots in the middle there are the 50-th percentile mark – 50% of the tuples were in at about .75 seconds. The sharp curve at the end is interesting – shows that a small number were quite late.
  13. Here is a graph of the latency for late tuples in Flink. Late tuples are the ones that finished processing after the 1 second SLA. This graph emphasizes that most tuples were on time or very nearly on time. Only a small percentage were late by any significant amount.
  14. Initially, we thought our operations were CPU-bound, and so the benefits of reshuffling to a higher number of partitions would outweigh the cost of reshuffling. Instead, we found the bottleneck to be scheduling, and so reshuffling only added overhead. We suspect that at higher throughput rates or with operations that are CPU-bound, the reverse would be true.
  15. Spark was more difficult to get results out of, but the results were more interesting. The micro-batching prevented Spark from being able to meet the 1 second SLA for anything but very low throughputs. This was due to the large overhead of scheduling and launching a task for each micro-batch. Once we increased the batch size, spark was able to keep up with various throughput. This graph shows a spark streaming job that’s keeping up with the throughput.
  16. Spark was more difficult to get results out of, but the results were more interesting. The micro-batching prevented Spark from being able to meet the 1 second SLA for anything but very low throughputs. This was due to the large overhead of scheduling and launching a task for each micro-batch. Once we increased the batch size, spark was able to keep up with various throughput. This graph shows a spark streaming job that’s keeping up with the throughput.
  17. If we didn’t increase the batch size enough, Spark wasn’t able to keep up with the throughput, tuples got backed up and buffered in kafka, and the latency figures increased until the job was killed. This is a graph of a spark streaming job that’s falling behind in its processing duties, and latencies have grown to almost 70 seconds. However, after increasing the batch size enough, Spark was able to handle more throughput than either Storm or Flink.
  18. So… Tuning the batch size was very time consuming and frustrating. It was a manual trial-and-error process and was a big obstacle while we were testing Spark. If the batch size was too high, latency would be high, if batch size was set too low, Spark wouldn’t keep up with the throughput, tuples would back up, and latency would be even higher. We were trying to get fair numbers out of Spark, so we didn’t just want to turn the batch size way up. We wanted to find the lowest latency we could get for a particular throughput. When we benchmarked Spark, there was a new feature called “backpressure” which was supposed to help address this difficulty. We tried this, but unfortunately we were unable to get it to improve our latency or prevent Spark falling behind. Instead, Spark’s backpressure actually made our numbers worse whenever we enabled it.
  19. Storm – Storm had results very similar to Flink. The graphs look almost identical. The problem we found with storm was that beyond 130,00 events per second, storm couldn’t keep up with the throughput, tuples backed up, and latencies grew, just like in Spark. This was caused by the acking system, which keeps track of successfully-processed events within storm, and performs flow control. A new feature in 0.11 allowed us to disable acking, and it got numbers similar to Flink at throughputs up to 170,000 events per second.
  20. Storm’s late tuple graph is, again, almost identical to Flink’s. There aren’t really any surprises here.
  21. This is a graph comparing the 99-th percentile latencies of the various engines at different throughputs. We can see Storm 0.11 has consistently lower latency than Flink and Spark. Flink’s latency is comparable to Storm 0.10’s, but Flink was able to handle more throughput without falling over. Spark had the highest latency by far, but was able to handle higher throughput than either Storm or Flink.
  22. Future work! So, there are a lot of variables involved and many of them we didn’t adjust. We didn’t optimize any of the applications. They were written plainly and we didn’t really mess with the configs. The SLA is important. SLA of 1 second is super low. We did this to try and test the low-latency limits of the low-latency systems. Many real SLA’s are on the order of minutes, and it would be worth it to test with these SLA’s. We expect that Spark would be more competitive in these time frames. The throughputs we tested were incredibly high. Our highest throughput of 170,000 events per second is equivalent to 1.4 times ten to the ten events per day. Most workloads are many orders of magnitude less than that. Writing a benchmark that performs heavier computation on a smaller throughput might better reflect real workloads. We didn’t test exactly-once semantics. This is an important feature, and something that can add a lot of overhead. Testing competing implementations could yield interesting results. Correctness. We ran some small tests for each of the systems to ensure they were processing data correctly, but we didn’t check correctness when running the benchmarks at full scale. The Project is open-source so you can go run your own tests; there are many many more possible configurations. That also means you can add an implementation for your favorite streaming engine. There are a few other popular ones out there.
  23. How do we actually measure the latency? We start by having the producers write an integer timestamp representing the time of the event’s creation into the event. This becomes the field event_time. We next need to understand how the windowing scheme works. - The window an event belongs in is determined by truncating the event_time of incoming tuples. - (Example) - If these are timestamps representing seconds, what we have then are 10-second windows of events. So in our example window here, all events with timestamps in the range of 12340 – 12349 seconds will belong to the same window. - Window sizes can be adjusted by truncating more or fewer digits from the timestamps. If you cut off two digits, you end up with 100-second windows. If you don’t cut off any, you end up with 1-second windows.