Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Apache Storm vs. Spark Streaming – 
Two Stream Processing Platforms compared 
DBTA Workshop on Stream Processing 
Berne, 3...
Guido Schmutz 
§ Working for Trivadis for more than 18 years 
§ Oracle ACE Director for Fusion Middleware and SOA 
§ Co...
Our company 
Trivadis is a market leader in IT consulting, system integration, 
solution engineering and the provision of ...
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processi...
What is Stream Processing? 
Infrastructure for continuous data processing 
Computational model can be as general as MapRed...
Why Stream Processing? 
Stream Processing 
2014 © Trivadis 
Response latency 
Milliseconds to minutes 
RPC 
Synchronous La...
How to design a Stream Processing System? 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Unified Log Process...
How to scale a Stream Processing System? 
event event event result 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströ...
How to scale a Stream Processing System? 
Collecting 
Process 1 
2014 © Trivadis 
Collecting 
Process 1 
Collecting 
Proce...
How to scale a Stream Processing System? 
Collecting 
Process 1 
Collecting 
Process 2 
2014 © Trivadis 
Processing A 
Pro...
How to make (stateful) Stream Processing System 
reliable? 
Faults and stragglers inevitable in large clusters running big...
How to make (stateful) Stream Processing System 
reliable? 
Solution 1: using active/passive system (hot replication) 
• B...
How to make (stateful) Stream Processing System 
reliable? 
Solution 2: Upstream backup 
• Nodes buffer sent messages and ...
Processing Models 
Batch Processing 
• Familiar concept of processing data en masse 
• Generally incurs a high-latency 
(E...
Message Delivery Semantics 
At most once [0,1] 
• Messages my be lost 
• Messages never redelivered 
At least once [1 .. n...
Requirements dictate the choice 
Latency 
• Is performance of streaming application paramount 
Development Cost 
• Is it d...
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processi...
Apache Storm 
A platform for doing analysis on streams of data as they come in, so you 
can react to data as it happens. 
...
Apache Storm – Core concepts 
Tuple 
• Core data structure in storm 
• Immutable Set of Key/value pairs 
• You can think o...
Apache Storm – Core concepts 
Topology 
• Wires data and functions via a DAG (directed acyclic graph) 
• Executes on many ...
Storm – How does it work ? 
2014 © Trivadis 
Superbowl 
Superbowl 
CAS Big Data - FH Bern | Stream- and Event-Processing |...
Storm – How does it work ? 
2014 © Trivadis 
Peyton 
Superbowl 
Superbowl 
CAS Big Data - FH Bern | Stream- and Event-Proc...
Storm – How does it work ? 
2014 © Trivadis 
Peyton 
Superbowl 
Superbowl 
CAS Big Data - FH Bern | Stream- and Event-Proc...
Storm - Topology 
Global Report 
Each Spout or Bolt are running N instances in parallel 
2014 © Trivadis 
CAS Big Data - F...
Storm - Creating Topology 
2014 © Trivadis 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Strea...
Using a NoSQL database for storing 
results (keeping state with counter type columns) 
2014 © Trivadis 
superbowl INCR 
Ap...
Storm Trident 
High-Level abstraction on top of storm 
Simplifies building topologies 
Core data model is the stream 
• Pr...
Storm Trident - Creating Topology 
2014 © Trivadis 
Bolt Bolt 
Apache Storm vs. Spark Streaming – Two Stream Processing Pl...
Trident Concepts - Function 
• takes in a set of input fields and emits zero or more tuples as output 
• fields of the out...
Storm Core vs. Storm Trident 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared...
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processi...
Apache Spark 
Apache Spark is a fast and general engine for large-scale data processing 
• The hot trend in Big Data! 
• B...
Apache Spark 
Spark Core 
• General execution engine for the Spark platform 
• In-memory computing capabilities deliver sp...
Apache Spark - Generality 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3...
Apache Spark – Core concepts 
Resilient Distributed Dataset (RDD) 
• Core Spark abstraction 
• Collections of objects (par...
RDD Lineage Example 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 
3rd Dec...
RDD Execution Example 
groupByKey() 
2014 © Trivadis 
ShuffledRDD 
…. 
FileRDD 
…. 
FileRDD 
ShuffledRDD 
MappedRDD 
Apach...
Apache Spark Streaming – Core concepts 
Discretized Stream (DStream) 
• Core Spark Streaming abstraction 
• micro batches ...
Discretized Stream (DStream) 
RDD @time 1 
2014 © Trivadis 
message 
1 
message 
2 
…. 
message 
n 
RDD @time 1 
…. 
…. 
R...
Spark Streaming Example 
2014 © Trivadis 
CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams...
Storm Core vs. Storm Trident vs. Spark Streaming 
2014 © Trivadis 
Apache Storm vs. Spark Streaming – Two Stream Processin...
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processi...
Unified Log 
That’s what most people think about logs 
137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/imag...
Central Unified Log for (real-time) subscription 
Take all the organization’s data and put it into a central log for subsc...
Apache Kafka - Overview 
• A distributed publish-subscribe messaging system 
• Designed for processing of real time activi...
Apache Kafka - Motivation 
LinkedIn’s motivation for Kafka was: 
§ “A unified platform for handling all the real-time dat...
Apache Kafka - Performance 
Kafka at LinkedIn 
Up to 2 million writes/sec on 3 cheap machines 
§ Using 3 producers on 3 d...
Apache Kafka - Partition offsets 
Offset: messages in the partitions are each assigned a unique (per 
partition) and seque...
Apache Kafka – two Options for Log Cleanup 
Retaining a window of data 
• Ideal for event data 
• Window can be defined in...
Data Flow Graphs using Unified Log 
Stream processing 
allows 
for computing feeds 
off of other feeds 
Derived feeds 
are...
2014 © Trivadis 
Agenda 
1. Introduction 
2. Apache Storm 
3. Apache Spark (Streaming) 
4. Unified Log 
5. Stream Processi...
Architectural Pattern: Standalone Event Stream 
Processing 
2014 © Trivadis 
Einheitlicher Umgang mit Ereignisströmen - Un...
Architectural Pattern: Event Stream Processing as part 
of Lambda Architecture 
2014 © Trivadis 
Hadoop Big Data 
Infrastr...
Architectural Pattern: Event Stream Processing as part 
of Kappa Architecture 
2014 © Trivadis 
Hadoop Big Data 
Infrastru...
Questions and answers ... 
Guido Schmutz 
Technology Manager 
BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. ...
Nächste SlideShare
Wird geladen in …5
×

Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared

15.885 Aufrufe

Veröffentlicht am

Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.

Veröffentlicht in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared

  1. 1. Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared DBTA Workshop on Stream Processing Berne, 3.12.2014 Guido Schmutz BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 1
  2. 2. Guido Schmutz § Working for Trivadis for more than 18 years § Oracle ACE Director for Fusion Middleware and SOA § Co-Author of different books § Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data § Member of Trivadis Architecture Board § Technology Manager @ Trivadis § More than 25 years of software development experience § Contact: guido.schmutz@trivadis.com § Blog: http://guidoschmutz.wordpress.com § Twitter: gschmutz 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 2
  3. 3. Our company Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria. We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems. 2014 © Trivadis O P E R A T I O N Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 3
  4. 4. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 4
  5. 5. What is Stream Processing? Infrastructure for continuous data processing Computational model can be as general as MapReduce but with the ability to produce low-latency results Data collected continuously is naturally processed continuously aka. Event Processing / Complex Event Processing (CEP) 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 5
  6. 6. Why Stream Processing? Stream Processing 2014 © Trivadis Response latency Milliseconds to minutes RPC Synchronous Later. Possibly much later. Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 6
  7. 7. How to design a Stream Processing System? 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 7 Event Stream event Collecting event Queue (Persist) Event Stream event Collecting event Processing event Processing result result Event Stream event Collecting/ Processing result
  8. 8. How to scale a Stream Processing System? event event event result 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 8 Queue (Persist) Event Stream event Collecting Thread 1 event event Processing Thread 1 result Collecting Thread 2 Processing Thread 2 Collecting Thread n Processing Thread n
  9. 9. How to scale a Stream Processing System? Collecting Process 1 2014 © Trivadis Collecting Process 1 Collecting Process 1 event event result Collecting Process 1 Collecting Process 1 Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 9 Queue 1 (Persist) Event Stream event Collecting Thread 1 event event Processing Process 1 result Collecting Thread 1 Processing Process 1 Queue 2 event (Persist) Processing Process 1 Queue n (Persist)
  10. 10. How to scale a Stream Processing System? Collecting Process 1 Collecting Process 2 2014 © Trivadis Processing A Process 2 Processing B Process 2 Processing A Process 1 Processing B Process 1 e e e Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Event Stream 10 Collecting Process 1 Collecting Process 2 Processing A Q2 Thread 2 Processing B e e Q2 Thread 2 Processing A Q1 Thread 1 Processing B Q1 Thread 1 Processing A Process 2 Processing A Qn Thread n
  11. 11. How to make (stateful) Stream Processing System reliable? Faults and stragglers inevitable in large clusters running big data applications Streaming applications must recover from them quickly 2014 © Trivadis e e Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 11 Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Q2 Thread 2 Processing B e Q2 Thread 2 Collecting Process 2 Processing A Process 2 e Event Collecting Processing A Processing Processing B B Stream Process 2 Q2 Thread 2 Q2 Thread Process 2 2
  12. 12. How to make (stateful) Stream Processing System reliable? Solution 1: using active/passive system (hot replication) • Both systems process the full load • In case of a failure, automatically switch and use the “passive” system • Stragglers slow down both active and passive system 2014 © Trivadis e e State Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 12 e e State = State in-memory and/or on-disk Collecting Process 2 Processing A Process 2 Processing B Process 2 Event Stream Collecting Process 2 Processing A Q2 Thread 2 Processing B Q2 Thread 2 Active Collecting Process 2 Processing A Process 2 Processing B Process 2 Collecting Process 2 Processing A Q2 Thread 2 Processing B Q2 Thread 2 Passive State
  13. 13. How to make (stateful) Stream Processing System reliable? Solution 2: Upstream backup • Nodes buffer sent messages and reply them to new node in case of failure • Stragglers are treated as failures Collecting Process 2 Processing A Process 2 e e Event Collecting Processing A Processing B Stream Process 2 Q2 Thread 2 Process 2 buffer = Buffer for replay in-memory and/or on-disk 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 13 State = State in-memory and/or on-disk Processing B Q2 Thread 2 State
  14. 14. Processing Models Batch Processing • Familiar concept of processing data en masse • Generally incurs a high-latency (Event-) Stream Processing • A one-at-a-time processing model • A datum is processed as it arrives • Sub-second latency • Difficult to process state data efficiently Micro-Batching • A special case of batch processing with very small batch sizes (tiny) • A nice mix between batching and streaming • At cost of latency • Gives stateful computation, making windowing an easy task 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 14
  15. 15. Message Delivery Semantics At most once [0,1] • Messages my be lost • Messages never redelivered At least once [1 .. n] • Messages will never be lost • but messages may be redelivered (might be ok if consumer can handle it) Exactly once [1] • Messages are never lost • Messages are never redelivered • Perfect message delivery • Incurs higher latency for transactional semantics 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 15
  16. 16. Requirements dictate the choice Latency • Is performance of streaming application paramount Development Cost • Is it desired to have similar code bases for batch and stream processing => lambda architecture Message Delivery Guarantees • Is there high importance on processing every single record, or is some normal amount of data loss acceptable Process Fault Tolerance • Is high-availability of primary concern 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 16
  17. 17. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 17
  18. 18. Apache Storm A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. • A highly distributed real-time computation system • Provides general primitives to do real-time computation • To simplify working with queues & workers • scalable and fault-tolerant • complementary to Hadoop • Written in Clojure, supports Java, Clojure • Originated at Backtype, acquired by Twitter in 2011 • Open Sourced late 2011 • Part of Apache Incubator since September 2013 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 18
  19. 19. Apache Storm – Core concepts Tuple • Core data structure in storm • Immutable Set of Key/value pairs • You can think of Storm tuples as events • Values must be serializable Stream • Key abstraction of Storm • an unbounded sequence of tuples that can be processed in parallel by Storm • Each stream is given ID and bolts can produce and consume tuples from these streams on the basis of their ID • Each stream also has an associated schema of the tuples that will flow through it 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 19 T T T T T T T T
  20. 20. Apache Storm – Core concepts Topology • Wires data and functions via a DAG (directed acyclic graph) • Executes on many machines similar to a MR job in Hadoop Spout • Source of data streams (tuples) • can be run in “reliable” and “unreliable” mode Bolt • Consumes 1+ streams and potentially produces new streams • Complex operations often require multiple steps and thus multiple bolts • Calculate, Filter, Aggregate, Join, Talk to database 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 20 Spout Spout Bolt Bolt Bolt Subscribes: C & D Emits: - Bolt Source of Stream B Subscribes: A Emits: C Subscribes: A Emits: D Subscribes: A & B Emits: -
  21. 21. Storm – How does it work ? 2014 © Trivadis Superbowl Superbowl CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 NFL: Peyton Manning and Denver’s elite offense fall flat in #Superbowl XLVIII 21 ow.ly/tdQZn #seahawks #broncos #Superbowl Split Sentence Twitter Spout Word Count Split Sentence Word Count NFL Manning … #Superbowl Peyton ...
  22. 22. Storm – How does it work ? 2014 © Trivadis Peyton Superbowl Superbowl CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 22 Split Sentence Twitter Spout Word Count Split Sentence Word Count INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning = 1 1 … #Superbowl INCR Superbowl NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdQZn #seahawks #broncos #Superbowl Superbowl = 2 NFL Manning ... INCR Peyton Peyton = 1
  23. 23. Storm – How does it work ? 2014 © Trivadis Peyton Superbowl Superbowl CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 23 Split Sentence Twitter Spout Word Count Split Sentence Word Count INCR Superbowl INCR NFL INCR Manning NFL = 1 Manning= 1 1 … #Superbowl INCR Superbowl NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdQZn #seahawks #broncos #Superbowl Superbowl = 2 NFL Manning ... INCR Peyton Peyton = 1 Report Peyton= 1 Superbowl = 2 NFL = 1 Manning = 1
  24. 24. Storm - Topology Global Report Each Spout or Bolt are running N instances in parallel 2014 © Trivadis CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 24 Split Sentence Twitter Spout Word Count Split Sentence Word Count Shuffle Fields Shuffle grouping is random grouping Fields grouping is grouped by value, such that equal value results in equal task All grouping replicates to all tasks Global grouping makes all tuples go to one task None grouping makes bolt run in the same thread as bolt/spout it subscribes to Direct grouping producer (task that emits) controls which consumer will receive Local or Shuffle grouping similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior.
  25. 25. Storm - Creating Topology 2014 © Trivadis CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 25
  26. 26. Using a NoSQL database for storing results (keeping state with counter type columns) 2014 © Trivadis superbowl INCR Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 Twitter Stream 26 Hashtag Splitter Twitter Spout Hashtag Counter Hashtag Splitter Hashtag Counter seahawks broncos superbowl INCR seahawks INCR broncos superbowl = 1 seahawks= 1 broncos = 1 superbowl … #Superbowl INCR superbowl NFL: Peyton Manning and Denver’s elite offense fall flat in #SuperBowl XLVIII ow.ly/tdQZn #seahawks #broncos #Superbowl 2
  27. 27. Storm Trident High-Level abstraction on top of storm Simplifies building topologies Core data model is the stream • Processed as a series of batches (micro-batches) • Stream is partitioned among nodes in cluster 5 kinds of operations in Trident • Operations that apply locally to each partition and cause no network transfer • Repartitioning operations that don‘t change the contents • Aggregation operations that do network transfer • Operations on grouped streams • Merges and Joins 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 27
  28. 28. Storm Trident - Creating Topology 2014 © Trivadis Bolt Bolt Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 Twitter Stream 28 tweet tweet Hashtag Splitter Twitter Spout hashtag Hashtag Normalizer Persistent Aggregate hashtag local groupBy
  29. 29. Trident Concepts - Function • takes in a set of input fields and emits zero or more tuples as output • fields of the output tuple are appended to the original input tuple in the stream • If a function emits no tuples, the original input tuple is filtered out • Otherwise the input tuple is duplicated for each output tuple 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 29
  30. 30. Storm Core vs. Storm Trident 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 30 Core Storm Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala, Python, Ruby, … Java, Clojure, Scala Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees At most once / At least once Exactly Once Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN
  31. 31. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 31
  32. 32. Apache Spark Apache Spark is a fast and general engine for large-scale data processing • The hot trend in Big Data! • Based on 2007 Microsoft Dryad paper • Written in Scala, supports Java, Python, SQL and R • Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk • Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud • One of the largest OSS communities in big data with over 200 contributors in 50+ organizations • Originally developed 2009 in UC Berkley’s AMPLab • Open Sourced in 2010 – since 2014 part of Apache Software foundation 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 32
  33. 33. Apache Spark Spark Core • General execution engine for the Spark platform • In-memory computing capabilities deliver speed • General execution model supports wide variety of use cases • DAG-based • Ease of development – native APIs in Java, Scala and Python Spark Streaming • Run a streaming computation as a series of very small, deterministic batch jobs • Batch size as low as ½ sec, latency of about 1 sec • Exactly-once semantics • Potential for combining batch and streaming processing in same system • Started in 2012, first alpha release in 2013 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 33
  34. 34. Apache Spark - Generality 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 34 Spark SQL (Batch Processing) Blink DB (Approximate Querying) Spark Streaming (Real-Time) MLLib, Spark R (Machine Learning) GraphX (Graph Processing) Spark Core API and Execution Model Spark Standalone MESOS YARN HDFS Elastic Search Cassandra S3 / DynamoDB Libraries Core Runtime Cluster Resource Managers Data Stores Adapted from C. Fregly: http://slidesha.re/11PP7FV
  35. 35. Apache Spark – Core concepts Resilient Distributed Dataset (RDD) • Core Spark abstraction • Collections of objects (partitions) spread across cluster • Partitions can be stored in-memory or on-disk (local) • Enables parallel processing on data sets • Build through parallel transformations • Immutable, recomputable, fault tolerant • Contains transformation history (“lineage”) for whole data set Operations • Stateless Transformations (map, filter, groupBy) • Actions (count, collect, save) 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 35
  36. 36. RDD Lineage Example 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 36 HDFS File Input 1 HadoopRDD FilteredRDD MappedRDD ShuffledRDD HDFS File Output HDFS File Input 2 HadoopRDD MappedRDD SparkContext.hadoopFile() filter() SparkContext.hadoopFile() map() map() join() SparkContext.saveAsHadoopFile() Transformations (Lazy) Action (Execute Transformations) Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  37. 37. RDD Execution Example groupByKey() 2014 © Trivadis ShuffledRDD …. FileRDD …. FileRDD ShuffledRDD MappedRDD Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 Partition 1 37 FileRDD Partition 2 …. Partition 5 Partition 1 Partition 2 Partition 5 Partition 1 Partition 2 Partition 5 FileRDD Partition 1 Partition 2 Partition 1 Partition 2 Partition 1 Partition 2 …. Partition 5 ShuffledRDD Partition 1 Partition 2 …. Partition 5 Partition 1 Partition 2 filter() map() join() join()
  38. 38. Apache Spark Streaming – Core concepts Discretized Stream (DStream) • Core Spark Streaming abstraction • micro batches of RDD’s • Operations similar to RDD Input DStreams • Represents the stream of raw data received from streaming sources • Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter, ZeroMQ, TCP Socket, Akka actors, etc. • Custom Sources can be easily written for custom data sources Operations • Same as Spark Core • Additional Stateful transformations (window, reduceByWindow) 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 38
  39. 39. Discretized Stream (DStream) RDD @time 1 2014 © Trivadis message 1 message 2 …. message n RDD @time 1 …. …. RDD @time 2 message 1 message 2 …. message n RDD @time 2 …. …. Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 39 time 1 time 2 time 3 message …. time n f(message 1) f(message 2) f(message n) result 1 result 2 result n message message message f(message 1) f(message 2) f(message n) result 1 result 2 result n RDD @time 3 message 1 message 2 …. message n RDD @time 3 f(message 1) f(message 2) …. f(message n) result 1 result 2 …. result n RDD @time n message 1 message 2 …. message n RDD @time n f(message 1) f(message 2) …. f(message n) result 1 result 2 …. result n Input Stream DStream MappedDStream map() saveAsHadoopFiles() Time Increasing Actions Trigger DStream Transformation Lineage Spark Jobs Adapted from Chris Fregly: http://slidesha.re/11PP7FV
  40. 40. Spark Streaming Example 2014 © Trivadis CAS Big Data - FH Bern | Stream- and Event-Processing | Processing Event Streams - Apache Storm August 2014 40
  41. 41. Storm Core vs. Storm Trident vs. Spark Streaming 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 41 Core Storm Storm Trident Spark Streaming Community > 100 contributors > 100 contributors > 280 contributors Adoption *** * * Language Java, Clojure, Scala, Java, Clojure, Java, Scala Options Python, Ruby, … Scala Python (coming) Processing Models Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core) Processing DSL No Yes Yes Stateful Ops No Yes Yes Distributed RPC Yes Yes No Delivery At most once / At Guarantees least once Exactly Once Exactly Once Latency sub-second seconds seconds Platform Storm Cluster, YARN Storm Cluster, YARN YARN, Mesos Standalone, DataStax EE
  42. 42. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 42
  43. 43. Unified Log That’s what most people think about logs 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-admin/images/date-button.gif HTTP/1.1" 200 111 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 200 13593 137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114 137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 - 137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 - 137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809 But this is what we mean here by Log • a structured log (records are numbered beginning with 0 based on order they 2014 © Trivadis are written) • aka. commit log or journal 1st record Next record Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 43 written 0 1 2 3 4 5 6 7 8 9 10 11
  44. 44. Central Unified Log for (real-time) subscription Take all the organization’s data and put it into a central log for subscription Properties of the Unified Log: • Unified: “Enterprise”, single deployment • Append-Only: events are appended, no update in place => immutable • Ordered: each event has an offset, which is unique within a shard • Fast: should be able to handle thousands of messages / sec • Distributed: lives on a cluster of machines 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 44 Collector 0 1 2 3 4 5 6 7 8 9 10 11 reads writes Consumer System A (time = 6) reads Consumer System B (time = 10)
  45. 45. Apache Kafka - Overview • A distributed publish-subscribe messaging system • Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …) • Initially developed at LinkedIn, now part of Apache • Does not follow JMS Standards and does not use JMS API • Kafka maintains feeds of messages in topics Producer Producer Producer 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 45 Kafka Cluster Consumer Consumer Consumer 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 Anatomy of a topic: Partition 0 Partition 1 Partition 2 Writes old new
  46. 46. Apache Kafka - Motivation LinkedIn’s motivation for Kafka was: § “A unified platform for handling all the real-time data feeds a large company might have.” 2014 © Trivadis Must haves § High throughput to support high volume event feeds. § Support real-time processing of these feeds to create new, derived feeds. § Support large data backlogs to handle periodic ingestion from offline systems. § Support low-latency delivery to handle more traditional messaging use cases. § Guarantee fault-tolerance in the presence of machine failures. Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 46
  47. 47. Apache Kafka - Performance Kafka at LinkedIn Up to 2 million writes/sec on 3 cheap machines § Using 3 producers on 3 different machines 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 47 10+ billion writes per day 172k messages per second (average) 55+ billion messages per day to real-time consumers http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  48. 48. Apache Kafka - Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset • Consumers track their pointers via (offset, partition, topic) tuples 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 48 Consumer group C1
  49. 49. Apache Kafka – two Options for Log Cleanup Retaining a window of data • Ideal for event data • Window can be defined in time (days) or space (GBs) – defaults to 1 week Retain a complete log (log compaction) • Ideal for keyed data • Keep a space-efficient complete 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 49 log of changes • Log compaction runs in the background • Ensures that always at least the last known value for each message key within the log of data is retained
  50. 50. Data Flow Graphs using Unified Log Stream processing allows for computing feeds off of other feeds Derived feeds are no different than original feeds they are computed off Single deployment of “Unified Log” but logically different feeds 2014 © Trivadis Customer Aggregate Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 50 Meter Readings Collector Enrich / Transform Aggregate by Minute Raw Meter Readings Meter with Customer Meter by Customer by Minute by Minute Meter by Minute Persist Meter by Minute Persist Raw Meter Readings
  51. 51. 2014 © Trivadis Agenda 1. Introduction 2. Apache Storm 3. Apache Spark (Streaming) 4. Unified Log 5. Stream Processing Architectures 6. Summary Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 51
  52. 52. Architectural Pattern: Standalone Event Stream Processing 2014 © Trivadis Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Social Media 52 Event Processing (ESP / CEP) State Store / Event Store Enterprise Event Bus (Ingress) Event Cloud Streams Internet of Things Enterprise Event Bus Analytical Applications 52 DB Enterprise Service Bus Business Rule Management Rules System Event Processing Result Store
  53. 53. Architectural Pattern: Event Stream Processing as part of Lambda Architecture 2014 © Trivadis Hadoop Big Data Infrastructure Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Social Media 53 Event Processing (ESP / CEP) State Store / Event Store Enterprise Event Bus (Ingress) Event Cloud Streams Internet of Things Enterprise Event Bus Analytical Applications 53 DB Enterprise Service Bus Event Processing Map/ HDFS Reduce Result Store Result Store
  54. 54. Architectural Pattern: Event Stream Processing as part of Kappa Architecture 2014 © Trivadis Hadoop Big Data Infrastructure Einheitlicher Umgang mit Ereignisströmen - Unified Log Processing Architecture August 2014 Social Media 54 Event Processing (ESP / CEP) State Store / Event Store Enterprise Event Bus (Ingress) Event Cloud Streams Internet of Things Analytical Applications 54 DB Enterprise Service Bus Event Processing HDFS Replay Result Store
  55. 55. Questions and answers ... Guido Schmutz Technology Manager BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared 3rd December 2014 55

×