Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
2015 © Trivadis
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA
...
2015 © Trivadis
Guido Schmutz
§  Working for Trivadis for more than 18 years
§  Oracle ACE Director for Fusion Middlewar...
2015 © Trivadis
Trivadis is a market leader in IT consulting, system integration,
solution engineering and the provision o...
2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing i...
2015 © Trivadis
What is Stream Processing?
Infrastructure for continuous data processing
Computational model can be as gen...
2015 © Trivadis
Trivadis Stream Processing Demo System
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing ...
2015 © Trivadis
How to design a Stream Processing System?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processi...
2015 © Trivadis
How to scale a Stream Processing System?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processin...
2015 © Trivadis
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
Collecting
Process 1
H...
2015 © Trivadis
Collecting
Process 1
Collecting
Process 2
Processing
A
Process 2
Processing
B
Process 2
Processing
A
Proce...
2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Faults and stragglers inevitable in large cluste...
2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Solution 1: using active/passive system (hot rep...
2015 © Trivadis
How to make (stateful) Stream Processing System
reliable?
Solution 2: Upstream backup
•  Nodes buffer mess...
2015 © Trivadis
Processing Models
Batch Processing
•  Familiar concept of processing data en masse
•  Generally incurs a h...
2015 © Trivadis
Message Delivery Semantics
At most once [0,1]
•  Messages my be lost
•  Messages never redelivered
At leas...
2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing i...
2015 © Trivadis
Apache Storm
A platform for doing analysis on streams of data as they come in, so you
can react to data as...
2015 © Trivadis
Apache Storm – Core concepts
Tuple
•  Immutable Set of Key/value pairs
Stream
•  an unbounded sequence of ...
2015 © Trivadis
Apache Storm – Core concepts
Each Spout or Bolt are running N instances in parallel
Juli 2015
Apache Storm...
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms co...
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms co...
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms co...
2015 © Trivadis
Storm – How does it work ?
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms co...
2015 © Trivadis
Using a NoSQL datastore for persisting
results
Keep state in a NoSQL datastore
Using counter type columns ...
2015 © Trivadis
Storm Trident
High-Level abstraction on top of storm
•  Processing as a series of batches (micro-batches)
...
2015 © Trivadis
Storm Core vs. Storm Trident
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms ...
2015 © Trivadis
Agenda
1.  Introduction
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Unified Log (Enterprise Event Bu...
2015 © Trivadis
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing
•  The hot trend in...
2015 © Trivadis
Apache Spark
Spark Core
•  General execution engine for the Spark platform
•  In-memory computing capabili...
2015 © Trivadis
Apache Spark - Generality
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms com...
2015 © Trivadis
Apache Spark – Core concepts
Resilient Distributed Dataset (RDD)
•  Core Spark abstraction
•  Collections ...
2015 © Trivadis
RDD Lineage Example
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
...
2015 © Trivadis
Apache Spark Streaming – Core concepts
Discretized Stream (DStream)
•  Core Spark Streaming abstraction
• ...
2015 © Trivadis
Discretized Stream (DStream)
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms ...
2015 © Trivadis
Storm Core vs. Storm Trident vs. Spark Streaming
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream P...
2015 © Trivadis
Agenda
1.  Introduction / Motivation
2.  Apache Storm
3.  Apache Spark (Streaming)
4.  Stream Processing i...
2015 © Trivadis
Architectural Pattern: Standalone Event Stream
Processing
Juli 2015
Apache Storm vs. Spark Streaming - Two...
2015 © Trivadis
Hadoop Big Data
Infrastructure
Architectural Pattern: Event Stream Processing as part
of Lambda Architectu...
2015 © Trivadis
Hadoop Big Data
Infrastructure
Architectural Pattern: Event Stream Processing as part
of Kappa Architectur...
2015 © Trivadis
Unified Log (Event) Architecture
Stream processing
allows
for computing feeds
off of other feeds
Derived f...
2015 © Trivadis
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
41
Tweets
Filter
Per...
2015 © Trivadis
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared
42
Tweets
Filter
Per...
2015 © Trivadis
Central Unified Log for (real-time) subscription
Take all the organization’s data and put it into a centra...
2015 © Trivadis
Apache Kafka - Overview
•  A distributed publish-subscribe messaging system
•  Designed for processing of ...
2015 © Trivadis
Trivadis Stream Processing Demo System - Update
Juli 2015
Apache Storm vs. Spark Streaming - Two Stream Pr...
2015 © Trivadis
Questions and answers ...
2014 © Trivadis
BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREI...
Nächste SlideShare
Wird geladen in …5
×

Apache Storm vs. Spark Streaming - two stream processing platforms compared

4.141 Aufrufe

Veröffentlicht am

Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.

Veröffentlicht in: Software

Apache Storm vs. Spark Streaming - two stream processing platforms compared

  1. 1. 2015 © Trivadis BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared Juni 2015 Guido Schmutz Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 1
  2. 2. 2015 © Trivadis Guido Schmutz §  Working for Trivadis for more than 18 years §  Oracle ACE Director for Fusion Middleware and SOA §  Co-Author of different books §  Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data §  Member of Trivadis Architecture Board §  Technology Manager @ Trivadis §  More than 25 years of software development experience §  Contact: guido.schmutz@trivadis.com §  Blog: http://guidoschmutz.wordpress.com §  Twitter: gschmutz Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 2
  3. 3. 2015 © Trivadis Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. Trivadis O P E R A T I O N Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3
  4. 4. 2015 © Trivadis Agenda 1.  Introduction / Motivation 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 4
  5. 5. 2015 © Trivadis What is Stream Processing? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event Processing / Complex Event Processing (CEP) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 5
  6. 6. 2015 © Trivadis Trivadis Stream Processing Demo System Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 6 Use Hashtag #JFS2015 plus #storm and/or #spark
  7. 7. 2015 © Trivadis How to design a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 7 Event Stream event Collecting event Queue (Persist) Event Stream event Collecting event Processing event Processing result result Event Stream event Collecting/ Processing result
  8. 8. 2015 © Trivadis How to scale a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 8 Queue (Persist) Event Stream event Collecting Thread 1 event event Processing Thread 1 result Collecting Thread 2 Processing Thread 2 event event event result Collecting Thread n Processing Thread n
  9. 9. 2015 © Trivadis Collecting Process 1 Collecting Process 1 Collecting Process 1 Collecting Process 1 Collecting Process 1 How to scale a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 9 Queue 1 (Persist) Event Stream event Collecting Thread 1 event event Processing Process 1 result Collecting Thread 1 Processing Process 1 Queue 2 (Persist)event event result Processing Process 1 Queue n (Persist) event
  10. 10. 2015 © Trivadis Collecting Process 1 Collecting Process 2 Processing A Process 2 Processing B Process 2 Processing A Process 1 Processing B Process 1 How to scale a Stream Processing System? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 10 Event Stream Collecting Process 1 Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Processing A Thread 1 Q1 e Processing B Thread 1 Q1 e Processing A Process 2 Processing A Thread n Qn e
  11. 11. 2015 © Trivadis How to make (stateful) Stream Processing System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 11 Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e
  12. 12. 2015 © Trivadis How to make (stateful) Stream Processing System reliable? Solution 1: using active/passive system (hot replication) •  Both systems process the full load •  In case of a failure, automatically switch and use the “passive” system •  Stragglers slow down both active and passive system Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 12 State = State in-memory and/or on-disk Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Active Collecting Process 2 Processing A Process 2 Processing B Process 2 Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e Passive State State
  13. 13. 2015 © Trivadis How to make (stateful) Stream Processing System reliable? Solution 2: Upstream backup •  Nodes buffer messages and reply them to new node in case of failure •  Stragglers are treated as failures Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 13 State = State in-memory and/or on-disk buffer = Buffer for replay in-memory and/or on-disk Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Thread 2 Q2 e Processing B Thread 2 Q2 e State
  14. 14. 2015 © Trivadis Processing Models Batch Processing •  Familiar concept of processing data en masse •  Generally incurs a high-latency (Event-) Stream Processing •  A one-at-a-time processing model •  A datum is processed as it arrives •  Sub-second latency •  Difficult to process state data efficiently Micro-Batching •  A special case of batch processing with very small batch sizes (tiny) •  A nice mix between batching and streaming •  At cost of latency •  Allows Stateful computation, making windowing an easy task Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 14
  15. 15. 2015 © Trivadis Message Delivery Semantics At most once [0,1] •  Messages my be lost •  Messages never redelivered At least once [1 .. n] •  Messages will never be lost •  but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] •  Messages are never lost •  Messages are never redelivered •  Perfect message delivery •  Incurs higher latency for transactional semantics Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 15
  16. 16. 2015 © Trivadis Agenda 1.  Introduction / Motivation 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 16
  17. 17. 2015 © Trivadis Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. •  highly distributed real-time computation system •  Provides general primitives to do real-time computation •  To simplify working with queues & workers •  scalable and fault-tolerant Originated at Backtype, acquired by Twitter in 2011 Open Sourced late 2011 Part of Apache since September 2013 Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 17
  18. 18. 2015 © Trivadis Apache Storm – Core concepts Tuple •  Immutable Set of Key/value pairs Stream •  an unbounded sequence of tuples that can be processed in parallel by Storm Topology •  Wires data and functions via a DAG (directed acyclic graph) •  Executes on many machines similar to a MR job in Hadoop Spout •  Source of data streams (tuples) •  can be run in “reliable” and “unreliable” mode Bolt •  Consumes 1+ streams and produces new streams •  Complex operations often require multiple steps and thus multiple bolts Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 18 Spout Spout Bolt Bolt Bolt Bolt Source of Stream B Subscribes: A Emits: C Subscribes: A Emits: D Subscribes: A & B Emits: - Subscribes: C & D Emits: - T T T T T T T T
  19. 19. 2015 © Trivadis Apache Storm – Core concepts Each Spout or Bolt are running N instances in parallel Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 19 Split Text nth Text Spout Word Count nth Split Text 1th Word Count 1st Shuffle Fields Shuffle grouping is random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in the same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive Local or Shuffle grouping similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior. ReportGlobal
  20. 20. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 20 Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Sentence Splitter Twitter Spout Sentence Splitter … #barca Shuffle Grouping Sentence Splitter … #fcb bayern fcb juve real barca barca
  21. 21. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 21 Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Sentence Splitter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Shuffle Grouping … #barca … #fcb Fields Grouping real juve barca barca bayern fcb
  22. 22. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 22 Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Sentence Splitter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Shuffle Grouping real juve barca barca bayern fcb … #barca … #fcb Fields Grouping INCR barca INCR real INCR juve real = 1 juve = 1 INCR barca INCR bayern bayern = 1 barca = 1 barca = 2 INCR fcb fcb = 1
  23. 23. 2015 © Trivadis Storm – How does it work ? Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 23 Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Report real = 1 juve = 1 barca = 2 bayern = 1 Sentence Splitter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca Shuffle Grouping real juve barca barca bayern fcb … #barca … #fcb Fields Grouping Global Grouping real = 1 juve = 1 bayern = 1 barca = 2 30sec fcb = 1 fcb = 1
  24. 24. 2015 © Trivadis Using a NoSQL datastore for persisting results Keep state in a NoSQL datastore Using counter type columns of Cassandra Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 24 Twitter Stream Sentence Splitter Twitter Spout Word Counter Sentence Splitter Word Counter Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb #barca … #barca … #fcb real = 1 juve = 1 barca = 2 bayern = 1 INCR barca INCR real INCR juve INCR barca INCR bayern real juve barca barca bayern fcb fcb = 1 INCR fcb
  25. 25. 2015 © Trivadis Storm Trident High-Level abstraction on top of storm •  Processing as a series of batches (micro-batches) •  Stream is partitioned among nodes in cluster 5 kinds of operations in Trident •  Operations that apply locally to each partition and cause no network transfer •  Repartitioning operations that don‘t change the contents •  Aggregation operations that do network transfer •  Operations on grouped streams •  Merges and Joins Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 25 Twitter Stream tweet tweet Sentence Splitter Twitter Spout hashtag Sentence Normalizer Persistent Aggregate hashtag groupBylocal Bolt Bolt
  26. 26. 2015 © Trivadis Storm Core vs. Storm Trident Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 26 Storm Core Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, … Java, Clojure, Scala Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN
  27. 27. 2015 © Trivadis Agenda 1.  Introduction 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Unified Log (Enterprise Event Bus) 5.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 27
  28. 28. 2015 © Trivadis Apache Spark Apache Spark is a fast and general engine for large-scale data processing •  The hot trend in Big Data! •  Based on 2007 Microsoft Dryad paper •  Written in Scala, supports Java, Python, SQL and R •  Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk •  Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud •  One of the largest OSS communities in big data with over 200 contributors in 50+ organizations •  Originally developed 2009 in UC Berkley’s AMPLab •  Open Sourced in 2010 – since 2014 part of Apache Software foundation Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 28
  29. 29. 2015 © Trivadis Apache Spark Spark Core •  General execution engine for the Spark platform •  In-memory computing capabilities deliver speed •  General execution model supports wide variety of use cases •  DAG-based •  Ease of development – native APIs in Java, Scala and Python Spark Streaming •  Run a streaming computation as a series of very small, deterministic batch jobs •  Batch size as low as ½ sec, latency of about 1 sec •  Exactly-once semantics •  Potential for combining batch and streaming processing in same system •  Started in 2012, first alpha release in 2013 Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 29
  30. 30. 2015 © Trivadis Apache Spark - Generality Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 30 Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLlib, Spark R (Machine Learning) GraphX (Graph Processing) Spark Core API and Execution Model Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB Libraries Core Runtime Cluster Resource Managers Data Stores Adapted from C. Fregly: http://slidesha.re/11PP7FV
  31. 31. 2015 © Trivadis Apache Spark – Core concepts Resilient Distributed Dataset (RDD) •  Core Spark abstraction •  Collections of objects (partitions) spread across cluster •  Can be stored in-memory or on-disk (local) •  Enables parallel processing on data sets •  Build through parallel transformations •  Immutable, re-computable, fault tolerant •  Contains transformation history (“lineage”) for whole data set Operations •  Stateless Transformations (map, filter, groupBy) •  Actions (count, collect, save) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 31
  32. 32. 2015 © Trivadis RDD Lineage Example Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 32 HDFS File Input 1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS File Output HadoopRDD MappedRDD HDFS File Input 2 SparkContext.hadoopFile()   SparkContext.hadoopFile()  filter()   map()   map()   join()   SparkContext.saveAsHadoopFile()   Transformations (Lazy) Action (Execute Transformations) Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  33. 33. 2015 © Trivadis Apache Spark Streaming – Core concepts Discretized Stream (DStream) •  Core Spark Streaming abstraction •  micro batches of RDD’s •  Operations similar to RDD Input DStreams •  Represents the stream of raw data received from streaming sources •  Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. •  Custom Sources can be easily written for custom data sources Operations •  Same as Spark Core •  Additional Stateful transformations (window, reduceByWindow) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 33
  34. 34. 2015 © Trivadis Discretized Stream (DStream) Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 34 time 1 time 2 time 3 message   time n…. f(message  1)   RDD @time 1 f(message  2)   f(message  n)   …. message  1   RDD @time 1 message  2   message  n   …. result  1   result  2   result  n   …. message   message   message   f(message  1)   RDD @time 2 f(message  2)   f(message  n)   …. message  1   RDD @time 2 message  2   message  n   …. result  1   result  2   result  n   …. f(message  1)   RDD @time 3 f(message  2)   f(message  n)   …. message  1   RDD @time 3 message  2   message  n   …. result  1   result  2   result  n   …. f(message  1)   RDD @time n f(message  2)   f(message  n)   …. message  1   RDD @time n message  2   message  n   …. result  1   result  2   result  n   …. Input Stream DStream MappedDStream map()   saveAsHadoopFiles()   Time Increasing DStreamTransformationLineageActions TriggerSpark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  35. 35. 2015 © Trivadis Storm Core vs. Storm Trident vs. Spark Streaming Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 35 Storm Core Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 280 contributors Adoption *** * * Language Options Java, Clojure, Scala, Python, Ruby, … Java, Clojure, Scala Java, Scala Python (coming) Processing Models Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) Processing DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery Guarantees At most once / At least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE
  36. 36. 2015 © Trivadis Agenda 1.  Introduction / Motivation 2.  Apache Storm 3.  Apache Spark (Streaming) 4.  Stream Processing in the Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 36
  37. 37. 2015 © Trivadis Architectural Pattern: Standalone Event Stream Processing Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3737 Event Processing (ESP / CEP) State Store / Event Store EnterpriseEventBus (Ingress) Event Cloud Internet of Things Social Media Streams Enterprise EventBus 37 Analytical Applications DB Enterprise Service Bus Business Rule Management SystemRules Event Processing Result Store
  38. 38. 2015 © Trivadis Hadoop Big Data Infrastructure Architectural Pattern: Event Stream Processing as part of Lambda Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3838 Event Processing (ESP / CEP) State Store / Event Store EnterpriseEventBus (Ingress) Event Cloud Internet of Things Social Media Streams Enterprise EventBus 38 Analytical Applications DB Enterprise Service Bus Event Processing Map/ Reduce HDF S Result Store Result Store
  39. 39. 2015 © Trivadis Hadoop Big Data Infrastructure Architectural Pattern: Event Stream Processing as part of Kappa Architecture Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 3939 Event Processing (ESP / CEP) State Store / Event Store EnterpriseEventBus (Ingress) Event Cloud Internet of Things Social Media Streams 39 Analytical Applications DB Enterprise Service Bus Event Processing Replay HDF S Result Store
  40. 40. 2015 © Trivadis Unified Log (Event) Architecture Stream processing allows for computing feeds off of other feeds Derived feeds are no different than original feeds they are computed off Single deployment of “Unified Log” but logically different feeds Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 40 Meter Readings Collector Enrich / Transform Aggregate by Minute Raw Meter Readings Meter with Customer Meter by Customer by Minute Customer Aggregate by Minute Meter by Minute Persist Meter by Minute Persist Raw Meter Readings
  41. 41. 2015 © Trivadis Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 41 Tweets Filter Persist Filtered Tweets Persist Sensor Readings Tweet Distribution Layer Kafka Storm Cassandra ElasticsearchTitan Speed Layer Feature extractor Count Skill Matcher sensor reading Feature Occurrences Matches Feature counter Skill Unified Log/ Event Architecture for Trivadis Streaming Demo System
  42. 42. 2015 © Trivadis Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 42 Tweets Filter Persist Filtered Tweets Persist Sensor Readings Tweet Distribution Layer Kafka Storm Cassandra ElasticsearchTitan Speed Layer Feature extractor Count Skill Matcher sensor reading Feature Occurrences Matches Feature counter Skill Unified Log/ Event Architecture for Trivadis Streaming Demo System Storm Topology Splitter Kafka Spout Word Remover Splitter Word Remover Shuffle Fields Kafka Kafka Word Remover
  43. 43. 2015 © Trivadis Central Unified Log for (real-time) subscription Take all the organization’s data and put it into a central log for subscription Properties of the Unified Log: •  Unified: “Enterprise”, single deployment •  Append-Only: events are appended, no update in place => immutable •  Ordered: each event has an offset, which is unique within a shard •  Fast: should be able to handle thousands of messages / sec •  Distributed: lives on a cluster of machines Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 43 0 1 2 3 4 5 6 7 8 9 10 11 reads writes Collector Consumer System A (time = 6) Consumer System B (time = 10) reads
  44. 44. 2015 © Trivadis Apache Kafka - Overview •  A distributed publish-subscribe messaging system •  Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) •  Initially developed at LinkedIn, now part of Apache •  Does not follow JMS Standards and does not use JMS API •  Kafka maintains feeds of messages in topics Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 44 Kafka Cluster Consumer Consumer Consumer Producer Producer Producer 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Anatomy of a topic: Partition 0 Partition 1 Partition 2 Writes old new
  45. 45. 2015 © Trivadis Trivadis Stream Processing Demo System - Update Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 45
  46. 46. 2015 © Trivadis Questions and answers ... 2014 © Trivadis BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA Guido Schmutz Technology Manager Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 46

×