a simple presentation about different big data stream processing systems such as SPARK, SAMZA and STORM and the difference between their architectures and purpose, in addition we talk about streaming layers tools such as Kafka and rabbitMQ, this presentation refer to this paper
https://vsis-www.informatik.uni-hamburg.de/getDoc.php/publications/561/Real-time%20stream%20processing%20for%20Big%20Data.pdf and other useful links.
2. INTRODUCTION
Rise of the web 2.0 and the Internet of things.
Huge amounts of data. (ex sensors, social media, online marketing).
Track all kinds of information that are only valuable for a short time and therefore have to be
processed immediately.
Monitoring user activity to optimize product or video recommendations for the current user
context.
Traditional batch-oriented approaches.
Complex Event Processing (CEP) engines and DBMSs.
Distributed data processing.
MapReduce.
3. Real-time analytics: Big Data in motion
Real time Data infrastructure:
Built from distributed components.
Communicate via asynchronous network.
Engineered on top of the JVM(Java Virtual Machine).
Real time Big Data Basic Architecture Model:
Collecting data from various places.
Moving data to streaming layer.
Analyze data in stream processor.
Forwarding outputs to serving layer.
4. Real-time analytics: Big Data in motion
Big Data Architecture Model:
Collecting Data
Streaming Data
Batch processing
Store Data
Stream processing
Serving Layer
Lambda Architecture
5. Real-time analytics: Big Data in motion
Big Data Architecture Models:
Collecting Data
Streaming Data
Stream processing
Serving Layer
Kappa Architecture
Store, retain Data
6. Real-time streamers
RabbitMQ.
Broker centric, message Acknowledgement.
focused around delivery guarantees between producers and consumers.
fall over if your consumers were too slow.
Producer ConsumerBROKER
Message
Ack
8. Real-time processors:
Latency Throughput & Efficiency
Handling data items
immediately as they arrive.
buffering and processing them in
batches increased efficiency.
Low Latency High Throughput
SAMZA
STORM
SPARK
SPARK Streaming
Trident
Stream BatchMicro - Batch
groups tuples into batches
Restrict batch size
9. Real-time processors
STORM
Storm was developed by
Nathan Marz as a BackType
project which was later
acquired by Twitter in the
year 2011.
initially promoted as the
“Hadoop of real-time”.
The vital parts of a Storm
deployment are a ZooKeeper
cluster for reliable coordination.
10. Real-time processors
STORM
Topology:
network made of spout and bolts
Similar to hadoop Map reduce.
Stream:
an unbounded pipeline of tuples
Spout & bolts:
receiving data continuously,
transforming those data into
actual stream of tuples and
finally sending them to the
bolts to be processed.
12. Real-time processors
STORM
Nodes
Worker Node:
runs a daemon called
‘Supervisor’.
run one or more worker
processes on its node.
Apache Zookeeper facilitates communication between
Nimbus and Supervisors with the help of message
acknowledgements and processing status.
13. Real-time processors
SAMZA
It was initially created at LinkedIn, submitted to the Apache
Incubator in July 2013.
Samza was co-developed with the queueing system Kafka.
Samza requires a little more work than storm to deploy as it does
not only depend on a ZooKeeper cluster, but also runs on top of
Hadoop YARN.
14. Real-time processors
SAMZA - YARN
cluster scheduler. It allows you to allocate a number
of containers (processes) in a cluster of machines, and execute
arbitrary commands on them, The Samza client uses YARN to run a
Samza job.
NodeManager: is responsible for launching processes on the
machine.
ResourceManager: Talks to all of the NodeManagers to tell
them what to run.
ApplicationMaster: is responsible for managing the
application’s workload, asking for containers, and handling
notifications when one of its containers fails.
15. Real-time processors
SAMZA
decouples individual processing
steps.
buffering data between
processing steps makes
(intermediate) results available
to unrelated parties.
Prevent data loss by periodically
checkpointing current progress
and reprocessing all data from
failure point.
16. Real-time processors
SPARK
Is a batch-processing framework that is often mentioned as the in
official successor of Hadoop as it offers several benefits in
comparison.
significant performance improvements through in-memory
caching.
Spark provides a variety of machine learning algorithms out-of-the-box
through the MLlib library.
18. Discussion
SPARKSAMZASTORM
Achievable latency
processing model
ordering guarantees
<< 100 ms < 100 ms < 1 s
one-at-a-time one-at-a-time Micro-batch
between batcheswithin stream partitionsNo
elasticity Yes YesNo
All these different systems show that low latency is involved in a
number of trade-offs with other desirable properties such as
throughput, fault-tolerance, reliability (processing guarantees) and
ease of development.