Stream Processing Frameworks

•Als PPTX, PDF herunterladen•

2 gefällt mir•1,786 views

SirKetchup

An overview of the most use stream processing frameworks in the industry today.

Software

Stream Processing
DAVID OSTROVSKY | COUCHBASE

Streaming Data
Stream Processing
Stream
Processing
Engines
Complex Event
Processing
Engines

Types of Data Processing
Throughput / sec
Time frame
100s
1000s
100000s
daysec min hrms
Real-Time
Processing
(CEP, ESP)
Interactive
Query
DBMS
In-Memory
Computing
Batch
Processing
(MapReduce)

Processing Model
Operator
Events
OperatorOperator
Operator
Operator
Events
OperatorOperator
Operator
Collector
Batches
(Time Window)
Continuous Micro-Batching

Programming Model
Continuous Micro-Batch Micro-Batch Continuous Continuous*
* Has a batch abstraction on top of streaming

$API and Expressiveness public class PrinterBolt extends BaseBasicBolt { public void execute(Tuple tuple, ...) { System.out.println(tuple); } } topology.setBolt("print", new PrinterBolt()) .shuffleGrouping("twitter"); val ssc = new StreamingContext(conf, Seconds(1)) ssc.socketTextStream("localhost", 9999) .flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .print() Compositional Declarative$

API and Expressiveness
Compositional Compositional Declarative Compositional Declarative
JVM, Python,
Ruby, JS, Perl
JVM JVM, Python JVM JVM, Python*
* Only for the DataSet API (batch)

Storm + Trident
Topology:
◦ Spouts
◦ Bolts
Stream Groupings:
◦ Shuffle
◦ Fields
◦ All
◦ …
Nimbus (Master)
◦ Workers

Spark Streaming
Resilient Distributed Datasets (RDD)
DStreams – sequences of RDDs

Samza
Uses Kafka for streaming
◦ Topics (streams)
◦ Partitioned across Brokers
◦ Producers
◦ Consumers
Uses YARN for resource management
◦ ResourceManager
◦ NodeManager
◦ ApplicationMaster

Flink
Dataflows
◦ Streams
◦ Source(s)
◦ Sink(s)
◦ Transformations (operators)

Orleans
Virtual Actor System in .NET
◦ Grains (operators)
◦ Silos (containers)
◦ Streams

Message Delivery Guarantees
At Most Once At Least Once Exactly Once
Source
Sockets
Twitter Streaming API
Any non-repeatable
Files
Simple Queues
Any forward-only
Kafka, RabbitMQ
Collections
Stateful
Sink
Data Stores
Sockets
Files
HDFS rolling sink

Highest Possible Guarantee
At least once Exactly once* Exactly once** At least once Exactly once*
* Doesn’t apply to side-effects
** Only at the batch level

Reliability and Fault Tolerance
ACK per tuple RDD checkpoints
Partition offset
checkpoints
Barrier
checkpoints

State Management
Manual
Dedicated state
providers
(memory,
external)
RDD with per-key
state
Local K/V store
+ changelog in
Kafka
Stored with
snapshots,
configurable
backends

Performance
Latency Low Medium Medium-High* Low Low**
Throughput Medium Medium High High High
* Depends on batching
** For streaming, not micro-batching

Extended Ecosystem
SAMOA (ML) Trident-ML
Spark SQL,
MLlib
GraphX
SAMOA (ML)
CEP
Gelly*
FlinkML*
Table API (SQL)*
* DataSet API (batch)
** Currently v0.0.4

Production and Maturity
Mature,
many users,
224 contributors
Relatively mature,
many users
957 contributors*
Newer,
built on mature
components,
fewer users,
57 contributors
New,
high momentum,
few users,
219 contributors
* Spark, not just spark streaming
** Contributor numbers as of 5/9/2016

Empfohlen

Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent

Apache flinkpranay kumar

Real-time Stream Processing with Apache FlinkDataWorks Summit

Stream Processing – Concepts and FrameworksGuido Schmutz

Apache Flink Training: System OverviewFlink Forward

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Empfohlen

Apache Kafka Streams + Machine Learning / Deep LearningKai Wähner

Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann

Event Sourcing, Stream Processing and Serverless (Benjamin Stopford, Confluen...confluent

Apache flinkpranay kumar

Real-time Stream Processing with Apache FlinkDataWorks Summit

Stream Processing – Concepts and FrameworksGuido Schmutz

Apache Flink Training: System OverviewFlink Forward

Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

Apache Flink Deep DiveDataWorks Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit

Apache Spark ArchitectureAlexey Grishchenko

Apache Druid 101Data Con LA

Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

HBase in Practicelarsgeorge

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

Cassandra Introduction & FeaturesDataStax Academy

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services

Introduction to RedisDvir Volk

Unique ID generation in distributed systemsDave Gardner

Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen

Dataservices: Processing (Big) Data the Microservice WayQAware GmbH

Weitere ähnliche Inhalte

Was ist angesagt?

Integrating Apache Spark and NiFi for Data LakesDataWorks Summit/Hadoop Summit

Apache Flink Deep DiveDataWorks Summit

Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks

Making Apache Spark Better with Delta LakeDatabricks

Big Data Business Wins: Real-time Inventory Tracking with HadoopDataWorks Summit

Apache Spark ArchitectureAlexey Grishchenko

Apache Druid 101Data Con LA

Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent

HBase in Practicelarsgeorge

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit

Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks

HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz

Cassandra Introduction & FeaturesDataStax Academy

A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit

(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services

Introduction to RedisDvir Volk

Unique ID generation in distributed systemsDave Gardner

Was ist angesagt? (20)

Integrating Apache Spark and NiFi for Data Lakes

Apache Flink Deep Dive

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Making Apache Spark Better with Delta Lake

Big Data Business Wins: Real-time Inventory Tracking with Hadoop

Apache Spark Architecture

Apache Druid 101

Stream processing with Apache Flink (Timo Walther - Ververica)

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...

HBase in Practice

Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...

Stephan Ewen - Experiences running Flink at Very Large Scale

Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...

HBase and HDFS: Understanding FileSystem Usage in HBase

Cassandra Introduction & Features

A Thorough Comparison of Delta Lake, Iceberg and Hudi

The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Introduction to Redis

Unique ID generation in distributed systems

Ähnlich wie Stream Processing Frameworks

Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen

Dataservices: Processing (Big) Data the Microservice WayQAware GmbH

Apache Flink Stream ProcessingSuneel Marthi

Intelligent MonitoringIntelie

Spark Summit - Stratio Streaming Stratio

Server side JavaScript: going all the wayOleg Podsechin

Flink Streaming Hadoop Summit San JoseKostas Tzoumas

Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev

Capacity Planning for Linux SystemsRodrigo Campos

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann

Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CARobert Metzger

Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal

Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks

Apache Flink Overview at SF Spark and FriendsStephan Ewen

Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit

An Architect's guide to real time big data systemsRaja SP

Intermachine ParallelismSri Prasanna

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

Real-time Stream Processing with Apache Flink @ Hadoop SummitGyula Fóra

Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan

Ähnlich wie Stream Processing Frameworks (20)

Flink 0.10 @ Bay Area Meetup (October 2015)

Dataservices: Processing (Big) Data the Microservice Way

Apache Flink Stream Processing

Intelligent Monitoring

Spark Summit - Stratio Streaming

Server side JavaScript: going all the way

Flink Streaming Hadoop Summit San Jose

Real-Time Big Data with Storm, Kafka and GigaSpaces

Capacity Planning for Linux Systems

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA

Distributed Real-Time Stream Processing: Why and How 2.0

Building Continuous Application with Structured Streaming and Real-Time Data ...

Apache Flink Overview at SF Spark and Friends

Apache Beam: A unified model for batch and stream processing data

An Architect's guide to real time big data systems

Intermachine Parallelism

Apache Flink: API, runtime, and project roadmap

Real-time Stream Processing with Apache Flink @ Hadoop Summit

Real time stream processing presentation at General Assemb.ly

Kürzlich hochgeladen

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

TECUNIQUE: Success Stories: IT Service providermohitmore19

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

What is Binary Language? Computer Number SystemsJheuzeDellosa

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

DNT_Corporate presentation know about usDynamic Netsoft

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Professional Resume Template for Software DevelopersVinodh Ram

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Kürzlich hochgeladen (20)

Unlocking the Future of AI Agents with Large Language Models

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

TECUNIQUE: Success Stories: IT Service provider

Active Directory Penetration Testing, cionsystems.com.pdf

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

What is Binary Language? Computer Number Systems

How To Use Server-Side Rendering with Nuxt.js

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

DNT_Corporate presentation know about us

Exploring iOS App Development: Simplifying the Process

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Cloud Management Software Platforms: OpenStack

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Professional Resume Template for Software Developers

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Der Spagat zwischen BIAS und FAIRNESS (2024)

Stream Processing Frameworks

1. Stream Processing DAVID OSTROVSKY | COUCHBASE

2. Why Streaming?

3. Streaming Data Stream Processing Stream Processing Engines Complex Event Processing Engines

4. Types of Data Processing Throughput / sec Time frame 100s 1000s 100000s daysec min hrms Real-Time Processing (CEP, ESP) Interactive Query DBMS In-Memory Computing Batch Processing (MapReduce)

5. All Apache, all the Time

6. No Love for Microsoft? Orleans

7. Processing Model Operator Events OperatorOperator Operator Operator Events OperatorOperator Operator Collector Batches (Time Window) Continuous Micro-Batching

8. Programming Model Continuous Micro-Batch Micro-Batch Continuous Continuous* * Has a batch abstraction on top of streaming

9. API and Expressiveness public class PrinterBolt extends BaseBasicBolt { public void execute(Tuple tuple, ...) { System.out.println(tuple); } } topology.setBolt("print", new PrinterBolt()) .shuffleGrouping("twitter"); val ssc = new StreamingContext(conf, Seconds(1)) ssc.socketTextStream("localhost", 9999) .flatMap(_.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) .print() Compositional Declarative

10. API and Expressiveness Compositional Compositional Declarative Compositional Declarative JVM, Python, Ruby, JS, Perl JVM JVM, Python JVM JVM, Python* * Only for the DataSet API (batch)

11. Storm + Trident Topology: ◦ Spouts ◦ Bolts Stream Groupings: ◦ Shuffle ◦ Fields ◦ All ◦ … Nimbus (Master) ◦ Workers

12. Spark Streaming Resilient Distributed Datasets (RDD) DStreams – sequences of RDDs

13. Samza Uses Kafka for streaming ◦ Topics (streams) ◦ Partitioned across Brokers ◦ Producers ◦ Consumers Uses YARN for resource management ◦ ResourceManager ◦ NodeManager ◦ ApplicationMaster

14. Flink Dataflows ◦ Streams ◦ Source(s) ◦ Sink(s) ◦ Transformations (operators)

15. Orleans Virtual Actor System in .NET ◦ Grains (operators) ◦ Silos (containers) ◦ Streams

16. Message Delivery Guarantees At Most Once At Least Once Exactly Once Source Sockets Twitter Streaming API Any non-repeatable Files Simple Queues Any forward-only Kafka, RabbitMQ Collections Stateful Sink Data Stores Sockets Files HDFS rolling sink

17. Highest Possible Guarantee At least once Exactly once* Exactly once** At least once Exactly once* * Doesn’t apply to side-effects ** Only at the batch level

18. Reliability and Fault Tolerance ACK per tuple RDD checkpoints Partition offset checkpoints Barrier checkpoints

19. State Management Manual Dedicated state providers (memory, external) RDD with per-key state Local K/V store + changelog in Kafka Stored with snapshots, configurable backends

20. Performance Latency Low Medium Medium-High* Low Low** Throughput Medium Medium High High High * Depends on batching ** For streaming, not micro-batching

21. Extended Ecosystem SAMOA (ML) Trident-ML Spark SQL, MLlib GraphX SAMOA (ML) CEP Gelly* FlinkML* Table API (SQL)* * DataSet API (batch) ** Currently v0.0.4

22. Production and Maturity Mature, many users, 224 contributors Relatively mature, many users 957 contributors* Newer, built on mature components, fewer users, 57 contributors New, high momentum, few users, 219 contributors * Spark, not just spark streaming ** Contributor numbers as of 5/9/2016

Hinweis der Redaktion

Talk about sources and use-cases of streaming data: web/social, fraud detection, log and machine data, real-time aggregation, etc 6k+ tweets p/s 50k+ google searches p/s 120k+ youtube videos viewed p/s 200+ MILLION emails per second (mostly spam) Not all data has value. Value of data decays over time, sometimes very fast. Newer data often supersedes older. It can be enough to process data without processing, especially since it’s often impractical to store so much data.
Stream processing is not a new concept. Complex event processing engines have been around for a long time (early 90s), although they mostly derive their origins from stock market related use-cases. The main differences between CEP and ESP engines are that CEP engines tend to focus more on higher level querying of multiple data streams, such as with SQL, whereas ESP engines have been more geared towards running (ordered) events through a processing operator graph. This isn’t a clear distinction, and it’s coming more and more blurred as things like Spark SQL and Flink CEP come into play.
Newer frameworks include: Apache Apex , Apache Beam (formerly part of Google Dataflow), Kafka Streams Source: http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them.Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs.As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python.Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported.And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like.
Source: http://www.cakesolutions.net/teamblogs/comparison-of-apache-stream-processing-frameworks-part-1 Apache Storm was originally created by Nathan Marz and his team at BackType in 2010. Later it was acquired and open-sourced by Twitter and it became apache top-level project in 2014. Without any doubts, Storm was a pioneer in large scale stream processing and became de-facto industrial standard. Storm is a native streaming system and provides low-level API. Also, storm uses Thrift for topology definition and it also implements Storm multi-language protocol this basically allows to implement our solutions in large number of languages, which is pretty unique and Scala is of course of them.Trident is a higher level micro-batching system build atop Storm. It simplifies topology building process and also adds higher level operations like windowing, aggregations or state management which are not natively supported in Storm. In addition to Storm's at most once, Trident provides exactly once delivery, on the contrary of Storm’s at most once guarantee. Trident has Java, Clojure and Scala APIs.As we all know, Spark is very popular batch processing framework these days with a couple of built-in libraries like SparkSQL or MLlib and of course Spark Streaming. Spark’s runtime is build for batch processing and therefore spark streaming, as it was added a little bit later, does micro-batching. The stream of input data is ingested by receivers which create micro-batches and these micro-batches are processed in similar way as other Spark’s jobs. Spark streaming provides high-level declarative API in Scala, Java and Python.Samza was originally developed in LinkedIn as proprietary streaming solution and with Kafka, which is another great linkedIn contribution to our community, it became key part of their infrastructure. As you’re going to see a little bit later, Samza builds heavily on Kafka’s log based philosophy and both together integrates very well. Samza provides compositional api and of course Scala is supported.And the last but least, Flink. Flink is pretty old project, it has it’s origins in 2008, but right now is getting quite a lot of attention. Flink is native streaming system and provides a high level API. Flink also provides API for batch processing like Spark, but there is a fundamental distinction between those two. Flink handles batch as a special case of streaming. Everything is a stream and this is definitely better abstraction, because this is how the world really looks like.
Continuous model generally provides lower latency processing, better expressiveness, and easier state management. On the other hand, it has lower throughput and expensive fault tolerance due to per-event overhead, and is harder to load-balance. Micro-batching provides higher throughput and simpler load balancing, but has higher latency (depending on the batch interval) and makes it harder to maintain state due to the fact that state updates aren’t per-event.
Compositional approach provides basic building blocks like sources or operators and they must be tied together in order to create expected topology. New components can be usually defined by implementing some kind of interfaces. Provides low level control over execution and parallelism. On the contrary, operators in declarative API are defined as higher order functions. It allows us to write functional code with abstract types and all its fancy stuff and the system creates and optimizes topology itself. Also declarative APIs usually provides more advanced operations like windowing or state management out of the box. Less control over precise execution parameters, but usually has support for advanced abstractions, like windowing (batching), etc.
Topology – a directed acyclic graph (DAG) of operators, each operator can have multiple instances which execute in parallel Spout – a source of streaming data (tuples), can be reliable or unreliable, that is can re-send data from a specified point or not. Bolt – a custom operator that consumes 1 or more streams and potentially emits new streams Stream groupings Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt's tasks. There are eight built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGroupinginterface: Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the "user-id" field, tuples with the same "user-id" will always go to the same task, but tuples with different "user-id"'s may go to different tasks. Partial Key grouping: The stream is partitioned by the fields specified in the grouping, like the Fields grouping, but are load balanced between two downstream bolts, which provides better utilization of resources when the incoming data is skewed. This paper provides a good explanation of how it works and the advantages it provides. All grouping: The stream is replicated across all the bolt's tasks. Use this grouping with care. Global grouping: The entire stream goes to a single one of the bolt's tasks. Specifically, it goes to the task with the lowest id. None grouping: This grouping specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible). Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](javadocs/org/apache/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emit method in OutputCollector (which returns the task ids that the tuple was sent to). Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. This class contains the basic operations available on all RDDs, such as map, filter, and persist. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.
Different colors == different machines YARN ResourceManager (RM) YARN NodeManager (NM) Samza ApplicationMaster (AM) The Samza client uses YARN to run a Samza job: YARN starts and supervises one or more SamzaContainers, and your processing code (using the StreamTask API) runs inside those containers. The input and output for the Samza StreamTasks come from Kafka brokers that are (usually) co-located on the same machines as the YARN NMs.
At least once semantics message guarantee that every message will be processed (eventually), but some may be processed more than once due to various factors, such as timing, concurrency, or failures. At most once
At least once semantics message guarantee that every message will be processed (eventually), but some may be processed more than once due to various factors, such as timing, concurrency, or failures. At most once
Storm spouts keep a record of all in-flight tuples until every operator sends back an acknowledgement that it has processed the tuple successfully. The ACKs are handled by ACKer tasks. Each acker task holds a mapping from each spout tuple to an id and an ‘ack val’. The ack val is the XOR of all the spout and tuple ids anchored to the entire tuple tree derived from the source tuple, which have been emitted and/or acked. When the ack val becomes 0, that means every tuple id that was emitted has also been acked. If it doesn’t do so after a certain time, the spout tuple is replayed. Spark checkpointing is only relevant for stateful Dstreams. It persists each batch to HDFS (by default) every X seconds. Typically the checkpoint interval should be set to 5-10 times the sliding window interval. Samza uses Kafka’s partitioned, offset-based messaging system for fault tolerance. Each Samza job container has one or more stream tasks, which correspond to message partitions in the kafka topic. Each task periodically checkpoints the offset in each partition it’s processing and can then replay messages back from the last stored offset if needed. Flink splits streams into discrete segments, or snapshots, by injecting a barrier marker into streams at certain intervals. Each barrier carries the ID of the snapshot whose records are pushed in front of it. When an intermediate operator has received a barrier for a particular snapshot from ALL of its input streams, it emits a new barrier for that snapshot into all of its outgoing streams. Once a sink operator receives barrier N from all input streams, it acknowledges that snapshot N to the checkpoint coordinator. When all sinks do that, it’s considered completed. (Operators can align input streams, buffering some until all get to snapshot N.)
Storm provides no built-in state mechanism, so it’s quite common to use an external state (aka. Database), particularly fast key-value stores. Trident adds a dedicated state operator, such as persistentAggregate, which can use one of several state providers, including MemoryState, which is replicated periodically, MemcachedState, and other custom providers, such as Kafka or Cassandra. Spark can attach state to keyed RDDs, which is then stored together with the checkpoints. Version 1.6 introduced a brand new mechanism, mapwithState, which has much higher performance than updateStateByKey. Samza uses a combination of local state (LevelDB) together with a compacted changelog stored as a kafka topic. The state locality improves performance, especially in memory, and the changelog can be used to restore the local state store on a new machine in the event of failure. Each task explicitly gets a reference to the state and uses it as a normal K/V store. Flink lets you register any instance field in an operator as a managed state by implementing an interface. It also has a built-in key/value API for tracking state. Local state is stored per-operator, while partitioned state is stored pet-key globally. Can use a MemoryStateBackend, which is replicated to the master, FsStateBackend which can write to file or HDFS, or RocksDBStateBackend
Trident-ML currently supports : Linear classification (Perceptron, Passive-Aggressive, Winnow, AROW) Linear regression (Perceptron, Passive-Aggressive) Clustering (KMeans) Feature scaling (standardization, normalization) Text feature extraction Stream statistics (mean, variance) Pre-Trained Twitter sentiment classifier
Storm is the de-factor standard streaming framework today. Interesting to see what Twitter’s Heron does if/when they opensource it. Spark is hugely popular and included in everything Hadoop related today. Samza is built on top of Kafka, which is a hugely popular and mature message queue. Flink is very promising, fixes a lot of pain points from older technologies like Storm, seems to have impressive performance.