Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Performance Comparison of Streaming Big Data Platforms

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Apache Spark Crash Course
Apache Spark Crash Course
Wird geladen in …3
×

Hier ansehen

1 von 29 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Anzeige

Ähnlich wie Performance Comparison of Streaming Big Data Platforms (20)

Weitere von DataWorks Summit/Hadoop Summit (20)

Anzeige

Aktuellste (20)

Performance Comparison of Streaming Big Data Platforms

  1. 1. Performance Comparison of Streaming Big Data Platforms Reza Farivar Capital One Inc. Kyle Knusbaum Yahoo Inc.
  2. 2. Streaming Computation engines • Designed to process a continuous stream of data. • Designed to process data with low latency – data (ideally) doesn’t buffer up before being processed. Contrasts with batch processing - MapReduce. • Designed to handle big data. The systems are distributed by design.
  3. 3. • Apache Storm has the TopologyBuilder API to create a directed graph (topology) through which streams of data flow. • “Spouts” are the entry point to the graph, and “bolts” perform the processing. • Data flows through the system as individual tuples. • Graphs are not necessarily acyclic (although that is often the case) Kafka Spout Database
  4. 4. • Apache Flink has the DataStream API to perform operations on streams of data. (map, filter, reduce, join, etc.) • These operations are turned into a graph at job submission time by Flink. • Underlying graph works similarly to Storm’s model. • Also supports a Storm-compatible API Database
  5. 5. • Apache Spark has the DStream API to perform operations on streams of data. (map, filter, reduce, join, etc.) Based on Spark’s RDD (Resilient Distributed Dataset) abstraction. • Similar to Flink’s API. • Streaming accomplished through micro-batches. • Spark streaming job consists of one small batch after another. RDDRDDRDDRDDRDD RDDRDDRDDRDDRDD RDDRDDRDDRDDRDD Spark Streaming Database
  6. 6. Benchmark • We would like to compare the platforms, but which benchmark? – How to compare the relative effectiveness of these systems? • Throughput (events per second) • End-to-end latency (How long for an event to get through the system) • Completeness (Is the computation correct?) – Current benchmarks did not test with workloads similar to a real world use case • Speed of light tests only reveal so much information • So we created a new benchmark (on github) – A simple advertisement counting application – Mimic some common ETL operations on data streams
  7. 7. Our Streaming benchmark • Goal is to correlate latency with throughput. • Simulation of an advertisement analytics pipeline. • Must be implemented and run in all three engines. • Initial data: – Some number of advertising campaigns. – Some number of ads per campaign. • Initial data stored in Redis. • Our producers read the initial data, and start generating various events. (view, click, purchase) • Events are then sent to a Kafka cluster. Benchmark Event Producer
  8. 8. Flow of an event
  9. 9. Measuring Latency – Windows periodically stored into Redis along with a timestamp of when the window was written into Redis. • Application given an SLA (Service-Level Agreement) as part of the simulation, demanding that tuples be processed in under 1 second. • The period of writes was chosen to meet the SLA. Writes to Redis were performed once per second. Spark is an exception. It wrote windows out once per batch
  10. 10. Measuring Latency • Ten second window • First event generated • 10 seconds of events – 10’s of thousands of events per second • Last event generated near end of window • At some point later, the window is written into Redis. • We know the time of the end of the window, and the time the window was written. • This time gives us a data point of latency – length of time between event generation and being written in database. • Events processed late will cause their windows to be written at a later time, and will be reflected in the data. 10 s 1st event in window Last event in window Window data written into Redis Latency data point (Ideally less than SLA)
  11. 11. Our methodology • Generate a particular throughput of events, then measure the latency. – Throughputs measured varied between 50,000 events/s and 170,000 events/s • 100 advertising campaigns • 10 ads per campaign • SLA set at 1 second • 10 second windows • 5 Kafka nodes with 5 topic partitions • 1 Redis node • 3 ZooKeeper nodes (cluster-coordination software) • 10 worker nodes (doing computation) • Handful of nodes used by the systems as masters, other non-compute servers.
  12. 12. Our methodology 1. Totally clear Kafka of data 2. Populate Redis with initial data 3. Launch the advertising analytics application on Spark, Flink, or Storm 4. Wait a bit for all workers to finish launching 5. Start up producers with instructions to produce tuples at a given rate – this rate determines the throughput. – Ex: 5 producers writing 10,000 events per second generates a throughput of 50,000 events/s. 6. Let the system run for 30 minutes after starting the producers, then shut the producers down. 7. Run data gathering tool on the Redis database to generate latency points from the windows.
  13. 13. Hardware Setup • Homogeneous nodes, each with two Intel E5530 @2.4GHz, 16 hyperthreading cores per node • 24GiB of memory • Machines on the same rack • Gigabit Ethernet switch • The cluster has 40 nodes, 20-25 used in benchmark • Multiple instances of Kafka producers to create load – individual producers fall behind at around 17,000 events per second • The use of 10 workers for a topology is near the average number we see being used by topologies internal to Yahoo – The Storm clusters are larger, but multi-tenant & run many topologies
  14. 14. About the implementations • Apache Flink – Tested 0.10.1-SNAPSHOT (commit hash 7364ce1). – Application written in Java using the DataStream API. – Checkpointing – a feature that guarantees at-least-once processing – was disabled. • Apache Spark – Tested version 1.5 – Application written in Scala using the DStreams API. – At-least-once processing not implemented. • Apache Storm – Tested both versions 0.10 and 0.11-SNAPSHOT (commit hash a8d253a). – Application written using the Java API. – Acking provides at-least-once processing – turned off for high throughputs in 0.11-SNAPSHOT
  15. 15. Flink • Most tuples finished within 1 second SLA. • Sharp curve indicates there was a very small number of straggling tuples that were written into Redis late. • Red dots mark 1st 10th 25th 50th 75th 90th 99th and 100th percentiles.
  16. 16. Flink Late Tuples • Of late tuples, most were written within a few milliseconds of the SLA’s deadline. • This emphasizes only a very small number were significantly late. • Beyond about 170,000 tuples, Flink was unable to handle the throughput, and tuples backed up.
  17. 17. Spark Streaming • Benchmark written in Scala, using DStreams (a.k.a streaming RDDs) and direct Kafka Consumer • Micro-batching – different than the pure streaming nature of Storm and Flink – To meet 1 sec SLA, the batch duration was set to 1 second • Forced to increase the batch duration for larger throughputs • Transformations (e.g. maps and filters) applied on the Dstreams • Joining data with Redis a special case – Should not create a separate connection to Redis for each record  use a mapPartitions operation that can give control of a whole RDD partition to our code • create one connection to Redis and use this single connection to query information from Redis for all the events in that RDD partition.
  18. 18. Spark 2-dimensional Parameter Adjustment • Micro-batch duration – This is a control dimension that is not present in a pure streaming system like Storm – Increasing the duration increases latency while reducing overhead and therefore increasing maximum throughput – Finding optimal batch duration that minimizes latency while allowing spark to handle the throughput is a time consuming process • Set a batch duration, run the benchmark for 30 minutes, check the results  decrease/increase the duration • Parallelism – increasing parallelism is simpler said than done in Spark – In a true streaming system like Storm, one bolt instance can send its results to any number of subsequent bolt instances – In a micro batch system like Spark, perform a reshuffle operation • similar to how intermediate data in a Hadoop MapReduce program are shuffled and merged across the cluster. • But the reshuffling itself introduces considerable overhead.
  19. 19. Spark • Spark had more interesting results than Flink. • Due to the micro-batch design, it was unable to process events at low latencies • The overhead of scheduling and launching a task per batch is very high • Batch size had to be increased – this overcame the launch overhead.
  20. 20. Spark • If we reduce the batch duration sufficiently, we get into a region where the incoming events are processed within 3 or 4 subsequent batches. • The system on the verge of falling behind, but is still manageable, and results in better latency.
  21. 21. Spark Falling behind • Without increasing the batch size, Spark was unable to keep up with the throughput, tuples backed up, and latencies continuously increased until the job was shut down. • After increasing the batchsize, Spark handled larger throughputs than either Storm or Flink.
  22. 22. Spark • Tuning the batch size was time-consuming, since it had to be done manually – this was one of the largest problems we faced in testing Spark’s Streaming capabilities. • If the batch size was set too high, latency numbers would be bad. If it was set too low, Spark would fall behind, tuples would back up, and latency numbers would be worse. • Spark had a new feature at the time called ‘backpressure’ which was supposed to help address this, but we were unable to make it work properly. In fact, enabling backpressure hindered our numbers in all cases.
  23. 23. Storm Results • Benchmark uses Java API, One worker process per host, each worker has 16 tasks to run in 16 executors - one for each core. • In 0.11.0, Storm added a simple back pressure controller  avoid the overhead of acking – In 0.10.0 benchmark topology, acking was used for flow control but not for processing guarantees. • With acking disabled, Storm even beat Flink for latency at high throughput. – But no tuple failure handling Storm 0.10.0 Storm 0.11.0
  24. 24. Storm • Storm behaved very similarly to Flink. • However, Storm was unable to handle more than 130,000 events/s with its acking system enabled. • Acking keeps track of successfully processed events within Storm. • With acking disabled, Storm achieved numbers similar to Flink at throughputs up to 170,000 events/s.
  25. 25. Storm Late Tuples • Similar to Flink’s late tuple graph. • Tuples that were late were slightly less late than Flink’s.
  26. 26. Three-way Comparison • Flink and Storm have similar linear performance profiles – These two systems process an incoming event as it becomes available • Spark Streaming has much higher latency, but is expected to handle higher throughputs – System behaves in a stepwise function, a direct result from its micro-batching nature
  27. 27. Flink Spark Storm • Comparisons of 99-th percentile latencies are revealing. • Storm 0.11 consistently lower latency than Flink and Spark. • Flink’s latency comparable to Storm 0.10, but handled higher throughput with at-least- once guarantees. • Spark had the highest latency, but was able to handle higher throughput than either Storm or Flink
  28. 28. Future work • Many variables involved – many we didn’t adjust. • Applications were not optimized – all were written in a fairly plain manner and configuration settings were not tweaked • SLA deadline of 1 second is very low. We did this to test the limits of the low-latency streaming systems. Higher SLA deadlines are reasonable, and testing those would be worthwhile – likely showing Spark being highly competitive with the others. • The throughputs we tested at were incredibly high. – 170,000 events/s comes to 14688000000 events per day – 1.4*1010 events per day • Didn’t test with exactly-once semantics. • Ran small tests and checked for correctness of computations, but didn’t check correctness at large scale. • There are many more tests that can be run. • Other streaming engines can be added.
  29. 29. Conclusions • The competition between near real time streaming systems is heating up, and there is no clear winner at this point • Each of the platforms studied here have their advantages and disadvantages • Other important factors: – Security or integration with tools and libraries • Active communities for these and other big data processing projects continue to innovate and benefit from each other’s advancements

Hinweis der Redaktion

  • Streaming computation engines – what are they.

    They are systems designed to process a continuous stream of data.

    They are designed to have very low latency. What this means is that – ideally – data gets processed as soon as it reaches the system; it doesn’t buffer up.
    This is in contrast to something like Hadoop’s MapReduce, where incoming data goes into a file somewhere, and every couple hours or so a job runs that processes it all in one big batch.

    These are so-called “big-data” systems. They’re designed to be distributed and handle massive quantities of data.

    We have three of them here that we’re going to look at today.
  • The first one we’re going to look at is Apache Storm.

    Storm’s API gives users tools to create a directed graph, called a topology in Storm, through which data flows. Each node of this graph is a piece of user code that does some processing.
    Nodes are either spouts or bolts. Spouts are the entry point to the graph, and bolts perform the processing.

    The data moves through the system as individual tuples. It’s the job of the spout to take incoming data and turn it into tuples to pass on to the bolts.

    Storm’s graphs are not necessarily acyclic – which is interesting. Most use cases we’ve seen seem to involve acyclic data flows, but it is possible to have cycles.
  • Flink!

    Flink has its DataStream API to perform operations on streams of data, operations like map, filter, reduce, join and so on.
    Instead of having the users construct a graph, users just describe what they want to happen to the data, and Flink builds a graph for them.
    The underlying graph works very similarly to Storm’s
    So similar, that Flink actually built a Storm-compatible API, and they claim you can run unmodified storm applications on Flink.
  • Spark Streaming!

    Spark Streaming has the DStream API to perform operations on streams of data. It is based on Spark’s RDDs, or Resilient Distributed Datasets
    The API is super similar to Flink’s

    The underlying model, however, is very different than both Storm’s and Flink’s.

    Spark’s streaming capabilities are accomplished through something called micro-batching.
    Micro-batching is basically just running very small batch jobs in quick succession.

    So each one of these RDDs down here would be a tiny batch of data in a spark streaming job.
  • We used our benchmark to correlate latency and throughput in the systems.

    We simulated an advertisement analytics pipeline, which counts clicks in ad-campaigns.
    The application needed to be implemented and run in all three engines.

    We started out with some initial data, which were some number of advertising campaigns, and some number of ads in each campaign. We made these numbers adjustable.
    -
    The initial data we stored in a Redis instance.
    -
    We had some producer processes then read the initial data out of Redis, and begin generating various events for advertisements like views, clicks, and purchases.
    -
    These events it then sent into Kafka - Kafka is a distributed pub/sub system. Events go into Kafka from publishers and go out of Kafka to subscribers.
  • The application itself performs operations on each event, and they go like this:

    First: deserialize the JSON string and turn it into a native data structure.
    Second: Filter the events. We’re only counting clicks in this application, so we drop all events that don’t have an ad_type of “click”.
    Third: We take what’s called a projection of the events – That just means we drop all of the fields in the tuple that we aren’t interested in. We’re left with just ad_id and event_time.
    If you remember earlier I highlighted three fields that were important. We’re down to two important fields now because we already used ad_type and we’re done with it.
    All of our events have the same ad_type now, so we can drop it.
    Fourth: Go and pull the campaign_id assiciated with the ad_id out of redis. This is part of the initial data that we put into Redis. Join this field into the tuple.
    Fifth: Take a windowed count of events per campaign – so we keep track of how many clicks each campaign has gotten in each time window.
    Last: Periodically write these windows into Redis – This will be the data we use to calculate latencies.

    The system needs to be able to take late events into account – This is just a constraint we put on the application since it’s one we see often in the real world.
  • As I mentioned, the windows are periodically written into Redis along with a timestamp of when the window was written into Redis.
    This last part is important. Each window has a timestamp like this, and it represents when that window was last written into Redis.

    The application is given an SLA or Service-Level Agreement as part of the simulation, which says that tuples must be processed completely end-to-end in under 1 second.
    This is just another constraint that we put on our application as part of simulating a real-world use case. The 1-second SLA is basically just a target end-to-end latency; it’s what the systems are trying to achieve.

    To this end, we had the applications write their windows out once per second. Spark is the exception here. Its computation model doesn’t allow us to write windows out once per second. Instead, we write the windows out once per batch.
  • Now we actually get to look at how we acquire the latency data.

    For our experiment we ran with 10-second windows.
    -
    In every window the first event is generated basically right when the window begins
    -
    After that, it’s 10 seconds of events – 10’s of thousands of events per second.
    -
    The last event is generated very near the end of the window – within microseconds before it.
    The last event goes off to be processed…
    -
    Some time later, the window is written into Redis by the application.
    -
    Now, we know the time of the end of the window – where the last event was written, and we know the time when the window was written to Redis.
    -
    This gives us a latency data point. This chunk of time here is the amount of time that passed between the last event’s generation and when it was written into Redis. – This is the end-to-end latency of the application.
    -
    You can see how events that are processed late will cause their windows to be written at a later time, and will be reflected as higher end-to-end latency in the data.

    So that’s how we measure latency. Next
  • Our methodology for testing was pretty simple.

    We have our producers generate a certain event throughput, and then we measure the latency of tuples going through the system.

    Throughputs measured varied between 50,000 events per second and 170,000 events per second.

    We had…
  • Steps were:
  • Now we’re going to look at the benchmark results from each system.
    -
    First is Flink:
    The version we tested was a 0.10.1-SNAPSHOT
    We wrote the application using the Java DataStream API.
    Checkpointing was disabled – so there were no processing guarantees.
    -
    Spark:
    The version we tested was 1.5
    We wrote this one in Scala using the DStreams API.
    In addition, we did not implement at-least-once semantics.
    -
    Storm:
    For storm, we tested both versions 0.10 and a 0.11-SNAPSHOT
    Application written using the java TopologyBuilder API.
    Storm’s acking provides at-least-once processing and flow control, but a new feature allowed us to turn that off for high throughputs in 0.11
  • Some things we noticed about flink:

    Most of the tuples were processed within the 1-second SLA we specified.

    The graph here shows percentiles - so the red dots in the middle there are the 50-th percentile mark – 50% of the tuples were in at about .75 seconds.

    The sharp curve at the end is interesting – shows that a small number were quite late.
  • Here is a graph of the latency for late tuples in Flink.
    Late tuples are the ones that finished processing after the 1 second SLA.

    This graph emphasizes that most tuples were on time or very nearly on time. Only a small percentage were late by any significant amount.
  • Initially, we thought our operations were CPU-bound, and so the benefits of reshuffling to a higher number of partitions would outweigh the cost of reshuffling. Instead, we found the bottleneck to be scheduling, and so reshuffling only added overhead. We suspect that at higher throughput rates or with operations that are CPU-bound, the reverse would be true.
  • Spark was more difficult to get results out of, but the results were more interesting.

    The micro-batching prevented Spark from being able to meet the 1 second SLA for anything but very low throughputs.
    This was due to the large overhead of scheduling and launching a task for each micro-batch.
    Once we increased the batch size, spark was able to keep up with various throughput.

    This graph shows a spark streaming job that’s keeping up with the throughput.
  • Spark was more difficult to get results out of, but the results were more interesting.

    The micro-batching prevented Spark from being able to meet the 1 second SLA for anything but very low throughputs.
    This was due to the large overhead of scheduling and launching a task for each micro-batch.
    Once we increased the batch size, spark was able to keep up with various throughput.

    This graph shows a spark streaming job that’s keeping up with the throughput.
  • If we didn’t increase the batch size enough, Spark wasn’t able to keep up with the throughput, tuples got backed up and buffered in kafka, and the latency figures increased until the job was killed.

    This is a graph of a spark streaming job that’s falling behind in its processing duties, and latencies have grown to almost 70 seconds.

    However, after increasing the batch size enough, Spark was able to handle more throughput than either Storm or Flink.
  • So… Tuning the batch size was very time consuming and frustrating. It was a manual trial-and-error process and was a big obstacle while we were testing Spark.

    If the batch size was too high, latency would be high, if batch size was set too low, Spark wouldn’t keep up with the throughput, tuples would back up, and latency would be even higher.
    We were trying to get fair numbers out of Spark, so we didn’t just want to turn the batch size way up. We wanted to find the lowest latency we could get for a particular throughput.

    When we benchmarked Spark, there was a new feature called “backpressure” which was supposed to help address this difficulty. We tried this, but unfortunately we were unable to get it to improve our latency or prevent Spark falling behind. Instead, Spark’s backpressure actually made our numbers worse whenever we enabled it.
  • Storm –

    Storm had results very similar to Flink. The graphs look almost identical.

    The problem we found with storm was that beyond 130,00 events per second, storm couldn’t keep up with the throughput, tuples backed up, and latencies grew, just like in Spark.
    This was caused by the acking system, which keeps track of successfully-processed events within storm, and performs flow control.

    A new feature in 0.11 allowed us to disable acking, and it got numbers similar to Flink at throughputs up to 170,000 events per second.
  • Storm’s late tuple graph is, again, almost identical to Flink’s. There aren’t really any surprises here.
  • This is a graph comparing the 99-th percentile latencies of the various engines at different throughputs.

    We can see Storm 0.11 has consistently lower latency than Flink and Spark.

    Flink’s latency is comparable to Storm 0.10’s, but Flink was able to handle more throughput without falling over.

    Spark had the highest latency by far, but was able to handle higher throughput than either Storm or Flink.
  • Future work!

    So, there are a lot of variables involved and many of them we didn’t adjust.

    We didn’t optimize any of the applications. They were written plainly and we didn’t really mess with the configs.

    The SLA is important. SLA of 1 second is super low. We did this to try and test the low-latency limits of the low-latency systems.
    Many real SLA’s are on the order of minutes, and it would be worth it to test with these SLA’s.
    We expect that Spark would be more competitive in these time frames.

    The throughputs we tested were incredibly high. Our highest throughput of 170,000 events per second is equivalent to 1.4 times ten to the ten events per day. Most workloads are many orders of magnitude less than that. Writing a benchmark that performs heavier computation on a smaller throughput might better reflect real workloads.

    We didn’t test exactly-once semantics. This is an important feature, and something that can add a lot of overhead. Testing competing implementations could yield interesting results.

    Correctness. We ran some small tests for each of the systems to ensure they were processing data correctly, but we didn’t check correctness when running the benchmarks at full scale.

    The Project is open-source so you can go run your own tests; there are many many more possible configurations.
    That also means you can add an implementation for your favorite streaming engine. There are a few other popular ones out there.

  • How do we actually measure the latency?

    We start by having the producers write an integer timestamp representing the time of the event’s creation into the event. This becomes the field event_time.
    We next need to understand how the windowing scheme works.
    -
    The window an event belongs in is determined by truncating the event_time of incoming tuples.
    -

    (Example)
    -
    If these are timestamps representing seconds, what we have then are 10-second windows of events. So in our example window here, all events with timestamps in the range of 12340 – 12349 seconds will belong to the same window.
    -
    Window sizes can be adjusted by truncating more or fewer digits from the timestamps. If you cut off two digits, you end up with 100-second windows. If you don’t cut off any, you end up with 1-second windows.

×