SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Apache Frameworks
for Big and Fast Data
- Naveen Korakoppa
Traditional way of request / response
Traditional way of request / response
● Request/response model — API consumers send requests to an API server and receive a response.
● Pull-based interaction — API consumers send an API request when data or functionality is required (e.g.
user interface, at a pre-scheduled time).
● Synchronous — API consumers receive the response after a request is sent.
● Multiple content types — since REST APIs are built upon HTTP, responses may be JSON, XML, or other
content types as necessary to support consumer needs (e.g. CSV, PDF).
● Internal and external access — REST APIs may be restricted for internal use or for external use by
partners or public developers.
Traditional way of request / response
● Flexible interactions — Building upon the available HTTP verbs, consumers may interact
with REST-based APIs through resources in a variety of ways: queries/search, creating new
resources, modifying existing resources, and deleting resources. We can also build complex
workflows by combining these interactions into higher-level processes.
● Caching and concurrency protocol support — HTTP has caching semantics built-in, allow
for caching servers to be placed between the consumer and API server, as well as cache
control of responses and eTags for concurrency control to prevent overwriting content.
Modern way of data stream
Modern way of data stream
Modern way of data stream
● Publish/subscribe model — Apps or APIs publish messages to a topic which may have zero, one, or many
subscribers rather than a request/response model.
● Subscriber notification interaction — Apps receive notification when a new message is available, such as
when data is modified or new data is available.
● Asynchronous — Unlike REST APIs, apps cannot use message streams to submit a request and receive a
response back without complex coordination between parties.
● Single content-type — At Capital One, our message streaming is built upon Avro, a compact binary
format useful for data serialization. Unlike HTTP, Avro doesn’t support other content types (e.g. CSV,
PDF).
Modern way of data stream
● Replay ability — Message streaming is built on Kafka, subscribers may revisit and replay
previous messages sequentially.
● No caching or concurrency protocol support — Message streaming doesn’t offer caching
semantics, cache-control, or concurrency control between publisher and subscriber.
● Internal access only — Subscribers must be internal to the organization, unlike HTTP which
may be externalized to partner or public consumers.
Important concepts for Big & Fast data
architectures
Big data architecture is the overarching system used to ingest and process enormous
amounts of data (often referred to as "big data") so that it can be analyzed for business
purposes. The architecture can be considered the blueprint for a big data solution based
on the business needs of an organization. Big data architecture is designed to handle the
following types of work:
● Batch processing of big data sources.
● Real-time processing of big data.
● Predictive analytics and machine learning.
A well-designed big data architecture can save your company money and help you predict
future trends so you can make good business decisions.
Concepts of Big data architecture
Types :
● Lambda Architecture ( Batch-first-approach )
- batching is used as the primary processing method with streams used to
supplement and provide early but unrefined results
● Kappa Architecture ( Stream-first-approach )
- streams are used for everything, simplifies the model and has only
recently become possible as stream processing engines have grown more
sophisticated.
Lambda Architecture
Lambda Architecture
This architecture was introduced by Nathan Marz in which we have three layers to
provide real-time streaming and compensate any data error occurs if any. The three
layers are Batch Layer, Speed layer, and Serving Layer.
So data is routed to batch layer and speed layer by our data collector concurrently. So
Hadoop is our batch layer and Apache Storm is our speed layer. And NoSQL datastore like
Cassandra, MongoDB is our serving layer in which analyzed results will be stored.
So the idea behind these layers was that the speed layer will be providing real-time
results into serving layer and if any data errors or any data is missed while stream
processing, then batch job will compensate that and the MapReduce job will run after
the regular interval and updates our serving layer, so providing accurate results.
Kappa Architecture
Kappa Architecture
Now the above lambda architecture solves our problem for data error and also provide
flexibility to provide real-time and accurate results to the user.
But Apache Kafka founders raises the question on this lambda architecture, they loved
the benefits provide by the lambda architecture, but they also state that it is very hard to
build the pipeline and maintain analysis logic in both batch and speed layer.
So If we use frameworks like Apache spark streaming, Flink, Beam they provide support
for both batch and real-time streaming. So it will be very easy for developers to
maintain the logical part of the data pipeline.
Data Ingestion tools
Data Ingestion tools
List of all Data Ingestion tools as a open source :
1. Apache Kafka
2. Apache Flume
3. Apache sqoop
4. Apache NIFI
Apache Flume
● Flume is a distributed system that can
be used to collect, aggregate, and
transfer streaming events into Hadoop.
● Flume is configuration-based and has
interceptors to perform simple
transformations on in-flight data.
● It comes with many built-in sources,
channels, and sinks, for example, Kafka
Channel and Avro sink.
● Flume data load can be driven by an
event.
● In order to load streaming data such as
tweets generated on Twitter or log files
of a web server, Flume should be used.
Flume agents are built for fetching
streaming data.
Apache Sqoop
● Sqoop is used for importing data from
structured data sources such as
RDBMS.
● Sqoop has a connector based
architecture. Connectors know how to
connect to the respective data source
and fetch the data.
● HDFS is a destination for data import
using Sqoop.
● Sqoop data load is not event-driven.
● In order to import data from
structured data sources, one has to
use Sqoop only, because its
connectors know how to interact with
structured data sources and fetch data
from them.
Apache Kafka
Apache Kafka
● Kafka is a distributed, high-throughput message bus that decouples data producers
from consumers. Messages are organized into topics, topics are split into partitions,
and partitions are replicated across the nodes — called brokers — in the cluster.
● Compared to Flume, Kafka offers better scalability and message durability.
● Kafka now comes in two flavors: the “classic” producer/consumer model, and the
new Kafka-connect, which provides configurable connectors (sources/sinks) to
external data stores.
● Kafka can be used for event processing and integration between components of
large software systems.
● Because messages are persisted on disk as well as replicated within the cluster, data
loss scenarios are less common than with Flume.
Apache NIFI
Apache NIFI
● Unlike Flume and Kafka, NiFi can handle messages with arbitrary sizes. Behind a drag-and-drop
Web-based UI, NiFi runs in a cluster and provides real-time control that makes it easy to
manage the movement of data between any source and any destination.
● It supports disparate and distributed sources of differing formats, schema, protocols, speeds,
and sizes.
● NiFi can be used in mission-critical data flows with rigorous security & compliance
requirements, where we can visualize the entire process and make changes immediately, in
real-time.
● Some of NiFi’s key features are prioritized queuing, data traceability and back-pressure
threshold configuration per connection.
● Although it is used to create fault-tolerant production pipelines, NiFi does not yet replicate data
like Kafka. If a node goes down, the flow can be directed to another node, but data queued for
the failed node will have to wait until the node comes back up.
● NiFi is not a full-fledged ETL tool, nor ideal for complex computations and event processing
(CEP). For that, it should instead connect to a streaming framework like Apache Flink, Spark
Streaming or Storm.
Data Ingestion tools
Data Computation and analytics tools
Batch-only frameworks Stream-only frameworks
Hybrid
frameworks
Data computation and analytics tools
List of all tools available as a open source:
1. Apache Hadoop
2. Apache Storm
3. Apache Spark
4. Apache Samza
5. Apache Flink
6. Apache Beam
7. Esper tool
Apache Hadoop - Batch-approach
● Distributed Batch processing of large volume and unstructured dataset.
● It has High Latency (Slow Computation).
● Processing framework used by Hadoop is a distributed batch processing which uses
MapReduce engine for computation which follows a map, sort, shuffle, reduce
algorithm.
● MapR jobs are executed in a sequential manner still it is completed.
● Architecture is based on a topology of Spouts and bolts.
● Speed: Due to batch processing on a large volume of data Hadoop take longer
computation time which means latency is more hence Hadoop is relatively slow.
Apache Storm - Stream-approach
● Distributed real-time processing of data having a large volume and high velocity.
● It has Low Latency (Fast Computation).
● Architecture consists of HDFS and MapReduce
● Processing framework used by Storm is distributed real-time data processing which
uses DAGs in a framework to generate topologies which are composed of Stream,
Spouts, and Bolts.
● Speed: Due to near real-time processing Storm handle data with very low latency to
give a result with minimum delay.
Apache Spark - Batch-first-approach ( 3G of bigdata )
● Apache spark is Batch Processing as well as Real Time Data Processing. ( Lambda Architecture )
● Apache Spark has the ability to support multiple languages like Java, Scala, Python and R
● Apache Spark streaming have higher latency comparing Apache Storm
● Speed: Apache Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk.
● Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster
Apache Flink - Stream-first-approach ( 4G of bigdata )
● Apache Flink is a stream processing framework that can also handle batch tasks.( Kappa architecture)
● Flink’s stream-first approach offers low latency, high throughput, and real entry-by-entry processing.
● Flink is currently a unique option in the processing framework world. While Spark performs batch and
stream processing, its streaming is not appropriate for many use cases because of its micro-batch
architecture
● Flink manages many things by itself. Somewhat unconventional, it manages its own memory instead of
relying on the native Java garbage collection mechanisms for performance reasons. Unlike Spark, Flink
does not require manual optimization and adjustment when the characteristics of the data it
processes change. It handles data partitioning and caching automatically as well.
● Flink has less APIs compared with Spark.
● One of the largest drawbacks of Flink at the moment is that it is still a very young project. Large scale
deployments in the wild are still not as common as other processing frameworks and there hasn’t
been much research into Flink’s scaling limitations.
EsperTech
● Esper is a streaming engine.
● Esper appears to be based primarily on streams, so of the two choices, it is most similar to
Flink.
● Esper has data storage/database functionality integrated, while Flink and Spark are pure
processing engines intended to work with external data stores
● Esper has reactive programming built in. Spark will have a very hard time supporting this,
while Flink should make it somewhat easier but still nontrivial.
● Esper’s integrations appear to target the enterprise, while Flink’s integrations target open-
source tools popular in Silicon Valley (e.g. Kafka)
● Esper appears to be much more mature, having had stable releases at least since 2008. I
believe Flink’s first stable release was in 2015.
● Esper started as an enterprise product, while Flink started with open source, bringing many
cultural differences
Differences between frameworks :
1. Spark & Flink : https://www.educba.com/apache-spark-vs-apache-flink/
2. Hadoop & Storm : https://www.educba.com/apache-hadoop-vs-apache-storm/
3. Storm & Spark : https://www.educba.com/apache-storm-vs-apache-spark/
4. Hadoop & Spark : https://www.educba.com/apache-storm-vs-apache-spark/
Hadoop , Spark & Flink : https://data-flair.training/blogs/hadoop-vs-spark-vs-flink/ ( IMPORTANT )
Conclusions
The most important part is choosing the best Streaming Framework
And the honest answer is: it depends :)
1. For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is likely less
expensive to implement than some other solutions.
2. For stream-only workloads, Storm has wide language support and can deliver very low latency
processing, but can deliver duplicates and cannot guarantee ordering in its default
configuration. Samza integrates tightly with YARN and Kafka in order to provide flexibility, easy
multi-team usage, and straightforward replication and state management.
3. For mixed workloads, Spark provides high speed batch processing and micro-batch processing
for streaming. It has wide support, integrated libraries and tooling, and flexible integrations.
Flink provides true stream processing with batch processing support. It is heavily optimized,
can run tasks written for other platforms, and provides low latency processing, but is still in the
early days of adoption.
Lambda use case : Social media & network analysis
Kappa use case : Fraud Detection
Data Ingestion
References and blogs links
• https://dzone.com/articles/big-data-ingestion-flume-kafka-and-nifi
• https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and-
flink-big-data-frameworks-compared
• https://www.quora.com/What-is-the-closest-option-to-Esper-Apache-Spark-or-
Apache-Flink
• https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka-
streams-vs-samza-choose-your-stream-processing-91ea3f04675b
• https://www.slideshare.net/gschmutz/big-data-architecture-
53231252?qid=3785d5c7-bd9c-408b-8714-ef9064a2be3b&v=&b=&from_search=1 [
IMPORTANT ]
System Requirements
Apache Kafka :
· at least 8 GB RAM
· at least 500 GB Storage
· Ubuntu 14.04 or later, RHEL 6, RHEL 7, or equivalent
· Access to Kafka (specifically, the ability to consume messages and to
communicate with Zookeeper)
· Access to Kafka Connect instances (if you want to configure Kafka Connect)
· Ability to connect to the server from the user’s web browser.
Docker image : https://hub.docker.com/r/bitnami/kafka/
Apache Hadoop :
System Requirements: Per Cloudera page, the VM takes 4GB RAM and 3GB of disk
space. This means your laptop should have more than that (I'd recommend 8GB+).
Storage-wise, as long as you have enough to test with small and medium-sized data
sets (10s of GB), you'll be fine. As for the CPU, if your machine has that amount of
RAM you'll most likely be fine. I'm using a single-node crappy Pentium G3210 with
4GB of ram for testing my small jobs and it works just fine.
Docker image : https://hub.docker.com/r/apache/hadoop
Apache NIFI :
NiFi Registry has the following minimum system requirements:
● Requires Java Development Kit (JDK) 8, newer than 1.8.0_45
● Supported Operating Systems:
○ Linux
○ Unix
○ Mac OS X
● Supported Web Browsers:
○ Google Chrome: Current & (Current - 1)
○ Mozilla FireFox: Current & (Current - 1)
○ Safari: Current & (Current - 1)
Docker image : https://hub.docker.com/r/apache/nifi/
System Requirements
Apache Spark :
Hardware
We used a virtual machine with the following setup:
* CPU core count: 32 virtual cores (16 physical cores), Intel Xeon CPU E5-2686 v4 @
2.30GHz
* System memory: 244 GB
* Total local disk space for shuffle: 4 x 1900 GB NVMe SSD
Software
● OS: Ubuntu 16.04
● Spark: Apache Spark 2.3.0 in local cluster mode
● Pandas version: 0.20.3
● Python version: 2.7.12
Docker image : https://hub.docker.com/r/sequenceiq/spark/
Apache Flink :
Recommended Operating System
● Microsoft Windows 10
● Ubuntu 16.04 LTS
● Apple macOS 10.13/High Sierra
Memory Requirement
● Memory - Minimum 4 GB, Recommended 8 GB
● Storage Space - 30 GB
Note − Java 8 must be available with environment variables already set.
Docker image : https://hub.docker.com/_/flink
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...DataStax Academy
 
How to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudHow to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudAttunity
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLSingleStore
 
Mainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft AzureMainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft AzurePrecisely
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftAttunity
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data IntegrationsPat Patterson
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...Cask Data
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...HostedbyConfluent
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builderTimothy Spann
 
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDScalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDHostedbyConfluent
 
Break Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftBreak Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftAttunity
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakePat Patterson
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask Cask Data
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
 
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Pat Patterson
 

Was ist angesagt? (20)

Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
 
How to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the CloudHow to Operationalise Real-Time Hadoop in the Cloud
How to Operationalise Real-Time Hadoop in the Cloud
 
About CDAP
About CDAPAbout CDAP
About CDAP
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Mainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft AzureMainframe Modernization with Precisely and Microsoft Azure
Mainframe Modernization with Precisely and Microsoft Azure
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Building Custom Big Data Integrations
Building Custom Big Data IntegrationsBuilding Custom Big Data Integrations
Building Custom Big Data Integrations
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to..."Who Moved my Data? - Why tracking changes and sources of data is critical to...
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
 
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
Qlik and Confluent Success Stories with Kafka - How Generali and Skechers Kee...
 
HIPAA Compliance in the Cloud
HIPAA Compliance in the CloudHIPAA Compliance in the Cloud
HIPAA Compliance in the Cloud
 
Continus sql with sql stream builder
Continus sql with sql stream builderContinus sql with sql stream builder
Continus sql with sql stream builder
 
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigIDScalable Data Management for Kafka and Beyond | Dan Rice, BigID
Scalable Data Management for Kafka and Beyond | Dan Rice, BigID
 
Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn Lambda-less Stream Processing @Scale in LinkedIn
Lambda-less Stream Processing @Scale in LinkedIn
 
Break Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and MicrosoftBreak Free From Oracle with Attunity and Microsoft
Break Free From Oracle with Attunity and Microsoft
 
Dealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data LakeDealing with Drift: Building an Enterprise Data Lake
Dealing with Drift: Building an Enterprise Data Lake
 
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask #BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
#BDAM: EDW Optimization with Hadoop and CDAP, by Sagar Kapare from Cask
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
Project Ouroboros: Using StreamSets Data Collector to Help Manage the StreamS...
 

Ähnlich wie Apache frameworks for Big and Fast Data

Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingPalani Kumar
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureTimothy Spann
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationGeorge Long
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Data Con LA
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationKnoldus Inc.
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationKnoldus Inc.
 
Data ingestion
Data ingestionData ingestion
Data ingestionnitheeshe2
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...Timothy Spann
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 

Ähnlich wie Apache frameworks for Big and Fast Data (20)

Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
A Short Presentation on Kafka
A Short Presentation on KafkaA Short Presentation on Kafka
A Short Presentation on Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Cloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azureCloud lunch and learn real-time streaming in azure
Cloud lunch and learn real-time streaming in azure
 
Architecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & ManipulationArchitecting Big Data Ingest & Manipulation
Architecting Big Data Ingest & Manipulation
 
Data streaming
Data streamingData streaming
Data streaming
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
 
Introduction to GCP Data Flow Presentation
Introduction to GCP Data Flow PresentationIntroduction to GCP Data Flow Presentation
Introduction to GCP Data Flow Presentation
 
Introduction to GCP DataFlow Presentation
Introduction to GCP DataFlow PresentationIntroduction to GCP DataFlow Presentation
Introduction to GCP DataFlow Presentation
 
Data ingestion
Data ingestionData ingestion
Data ingestion
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 

Kürzlich hochgeladen

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 

Kürzlich hochgeladen (20)

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 

Apache frameworks for Big and Fast Data

  • 1. Apache Frameworks for Big and Fast Data - Naveen Korakoppa
  • 2. Traditional way of request / response
  • 3. Traditional way of request / response ● Request/response model — API consumers send requests to an API server and receive a response. ● Pull-based interaction — API consumers send an API request when data or functionality is required (e.g. user interface, at a pre-scheduled time). ● Synchronous — API consumers receive the response after a request is sent. ● Multiple content types — since REST APIs are built upon HTTP, responses may be JSON, XML, or other content types as necessary to support consumer needs (e.g. CSV, PDF). ● Internal and external access — REST APIs may be restricted for internal use or for external use by partners or public developers.
  • 4. Traditional way of request / response ● Flexible interactions — Building upon the available HTTP verbs, consumers may interact with REST-based APIs through resources in a variety of ways: queries/search, creating new resources, modifying existing resources, and deleting resources. We can also build complex workflows by combining these interactions into higher-level processes. ● Caching and concurrency protocol support — HTTP has caching semantics built-in, allow for caching servers to be placed between the consumer and API server, as well as cache control of responses and eTags for concurrency control to prevent overwriting content.
  • 5. Modern way of data stream
  • 6. Modern way of data stream
  • 7. Modern way of data stream ● Publish/subscribe model — Apps or APIs publish messages to a topic which may have zero, one, or many subscribers rather than a request/response model. ● Subscriber notification interaction — Apps receive notification when a new message is available, such as when data is modified or new data is available. ● Asynchronous — Unlike REST APIs, apps cannot use message streams to submit a request and receive a response back without complex coordination between parties. ● Single content-type — At Capital One, our message streaming is built upon Avro, a compact binary format useful for data serialization. Unlike HTTP, Avro doesn’t support other content types (e.g. CSV, PDF).
  • 8. Modern way of data stream ● Replay ability — Message streaming is built on Kafka, subscribers may revisit and replay previous messages sequentially. ● No caching or concurrency protocol support — Message streaming doesn’t offer caching semantics, cache-control, or concurrency control between publisher and subscriber. ● Internal access only — Subscribers must be internal to the organization, unlike HTTP which may be externalized to partner or public consumers.
  • 9. Important concepts for Big & Fast data architectures Big data architecture is the overarching system used to ingest and process enormous amounts of data (often referred to as "big data") so that it can be analyzed for business purposes. The architecture can be considered the blueprint for a big data solution based on the business needs of an organization. Big data architecture is designed to handle the following types of work: ● Batch processing of big data sources. ● Real-time processing of big data. ● Predictive analytics and machine learning. A well-designed big data architecture can save your company money and help you predict future trends so you can make good business decisions.
  • 10. Concepts of Big data architecture Types : ● Lambda Architecture ( Batch-first-approach ) - batching is used as the primary processing method with streams used to supplement and provide early but unrefined results ● Kappa Architecture ( Stream-first-approach ) - streams are used for everything, simplifies the model and has only recently become possible as stream processing engines have grown more sophisticated.
  • 12. Lambda Architecture This architecture was introduced by Nathan Marz in which we have three layers to provide real-time streaming and compensate any data error occurs if any. The three layers are Batch Layer, Speed layer, and Serving Layer. So data is routed to batch layer and speed layer by our data collector concurrently. So Hadoop is our batch layer and Apache Storm is our speed layer. And NoSQL datastore like Cassandra, MongoDB is our serving layer in which analyzed results will be stored. So the idea behind these layers was that the speed layer will be providing real-time results into serving layer and if any data errors or any data is missed while stream processing, then batch job will compensate that and the MapReduce job will run after the regular interval and updates our serving layer, so providing accurate results.
  • 14. Kappa Architecture Now the above lambda architecture solves our problem for data error and also provide flexibility to provide real-time and accurate results to the user. But Apache Kafka founders raises the question on this lambda architecture, they loved the benefits provide by the lambda architecture, but they also state that it is very hard to build the pipeline and maintain analysis logic in both batch and speed layer. So If we use frameworks like Apache spark streaming, Flink, Beam they provide support for both batch and real-time streaming. So it will be very easy for developers to maintain the logical part of the data pipeline.
  • 16. Data Ingestion tools List of all Data Ingestion tools as a open source : 1. Apache Kafka 2. Apache Flume 3. Apache sqoop 4. Apache NIFI
  • 17. Apache Flume ● Flume is a distributed system that can be used to collect, aggregate, and transfer streaming events into Hadoop. ● Flume is configuration-based and has interceptors to perform simple transformations on in-flight data. ● It comes with many built-in sources, channels, and sinks, for example, Kafka Channel and Avro sink. ● Flume data load can be driven by an event. ● In order to load streaming data such as tweets generated on Twitter or log files of a web server, Flume should be used. Flume agents are built for fetching streaming data.
  • 18. Apache Sqoop ● Sqoop is used for importing data from structured data sources such as RDBMS. ● Sqoop has a connector based architecture. Connectors know how to connect to the respective data source and fetch the data. ● HDFS is a destination for data import using Sqoop. ● Sqoop data load is not event-driven. ● In order to import data from structured data sources, one has to use Sqoop only, because its connectors know how to interact with structured data sources and fetch data from them.
  • 20. Apache Kafka ● Kafka is a distributed, high-throughput message bus that decouples data producers from consumers. Messages are organized into topics, topics are split into partitions, and partitions are replicated across the nodes — called brokers — in the cluster. ● Compared to Flume, Kafka offers better scalability and message durability. ● Kafka now comes in two flavors: the “classic” producer/consumer model, and the new Kafka-connect, which provides configurable connectors (sources/sinks) to external data stores. ● Kafka can be used for event processing and integration between components of large software systems. ● Because messages are persisted on disk as well as replicated within the cluster, data loss scenarios are less common than with Flume.
  • 22. Apache NIFI ● Unlike Flume and Kafka, NiFi can handle messages with arbitrary sizes. Behind a drag-and-drop Web-based UI, NiFi runs in a cluster and provides real-time control that makes it easy to manage the movement of data between any source and any destination. ● It supports disparate and distributed sources of differing formats, schema, protocols, speeds, and sizes. ● NiFi can be used in mission-critical data flows with rigorous security & compliance requirements, where we can visualize the entire process and make changes immediately, in real-time. ● Some of NiFi’s key features are prioritized queuing, data traceability and back-pressure threshold configuration per connection. ● Although it is used to create fault-tolerant production pipelines, NiFi does not yet replicate data like Kafka. If a node goes down, the flow can be directed to another node, but data queued for the failed node will have to wait until the node comes back up. ● NiFi is not a full-fledged ETL tool, nor ideal for complex computations and event processing (CEP). For that, it should instead connect to a streaming framework like Apache Flink, Spark Streaming or Storm.
  • 24. Data Computation and analytics tools Batch-only frameworks Stream-only frameworks Hybrid frameworks
  • 25. Data computation and analytics tools List of all tools available as a open source: 1. Apache Hadoop 2. Apache Storm 3. Apache Spark 4. Apache Samza 5. Apache Flink 6. Apache Beam 7. Esper tool
  • 26. Apache Hadoop - Batch-approach ● Distributed Batch processing of large volume and unstructured dataset. ● It has High Latency (Slow Computation). ● Processing framework used by Hadoop is a distributed batch processing which uses MapReduce engine for computation which follows a map, sort, shuffle, reduce algorithm. ● MapR jobs are executed in a sequential manner still it is completed. ● Architecture is based on a topology of Spouts and bolts. ● Speed: Due to batch processing on a large volume of data Hadoop take longer computation time which means latency is more hence Hadoop is relatively slow.
  • 27. Apache Storm - Stream-approach ● Distributed real-time processing of data having a large volume and high velocity. ● It has Low Latency (Fast Computation). ● Architecture consists of HDFS and MapReduce ● Processing framework used by Storm is distributed real-time data processing which uses DAGs in a framework to generate topologies which are composed of Stream, Spouts, and Bolts. ● Speed: Due to near real-time processing Storm handle data with very low latency to give a result with minimum delay.
  • 28. Apache Spark - Batch-first-approach ( 3G of bigdata ) ● Apache spark is Batch Processing as well as Real Time Data Processing. ( Lambda Architecture ) ● Apache Spark has the ability to support multiple languages like Java, Scala, Python and R ● Apache Spark streaming have higher latency comparing Apache Storm ● Speed: Apache Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. ● Apache Spark can integrate with all data sources and file formats supported by Hadoop cluster
  • 29. Apache Flink - Stream-first-approach ( 4G of bigdata ) ● Apache Flink is a stream processing framework that can also handle batch tasks.( Kappa architecture) ● Flink’s stream-first approach offers low latency, high throughput, and real entry-by-entry processing. ● Flink is currently a unique option in the processing framework world. While Spark performs batch and stream processing, its streaming is not appropriate for many use cases because of its micro-batch architecture ● Flink manages many things by itself. Somewhat unconventional, it manages its own memory instead of relying on the native Java garbage collection mechanisms for performance reasons. Unlike Spark, Flink does not require manual optimization and adjustment when the characteristics of the data it processes change. It handles data partitioning and caching automatically as well. ● Flink has less APIs compared with Spark. ● One of the largest drawbacks of Flink at the moment is that it is still a very young project. Large scale deployments in the wild are still not as common as other processing frameworks and there hasn’t been much research into Flink’s scaling limitations.
  • 30. EsperTech ● Esper is a streaming engine. ● Esper appears to be based primarily on streams, so of the two choices, it is most similar to Flink. ● Esper has data storage/database functionality integrated, while Flink and Spark are pure processing engines intended to work with external data stores ● Esper has reactive programming built in. Spark will have a very hard time supporting this, while Flink should make it somewhat easier but still nontrivial. ● Esper’s integrations appear to target the enterprise, while Flink’s integrations target open- source tools popular in Silicon Valley (e.g. Kafka) ● Esper appears to be much more mature, having had stable releases at least since 2008. I believe Flink’s first stable release was in 2015. ● Esper started as an enterprise product, while Flink started with open source, bringing many cultural differences
  • 31. Differences between frameworks : 1. Spark & Flink : https://www.educba.com/apache-spark-vs-apache-flink/ 2. Hadoop & Storm : https://www.educba.com/apache-hadoop-vs-apache-storm/ 3. Storm & Spark : https://www.educba.com/apache-storm-vs-apache-spark/ 4. Hadoop & Spark : https://www.educba.com/apache-storm-vs-apache-spark/ Hadoop , Spark & Flink : https://data-flair.training/blogs/hadoop-vs-spark-vs-flink/ ( IMPORTANT )
  • 32. Conclusions The most important part is choosing the best Streaming Framework And the honest answer is: it depends :) 1. For batch-only workloads that are not time-sensitive, Hadoop is a good choice that is likely less expensive to implement than some other solutions. 2. For stream-only workloads, Storm has wide language support and can deliver very low latency processing, but can deliver duplicates and cannot guarantee ordering in its default configuration. Samza integrates tightly with YARN and Kafka in order to provide flexibility, easy multi-team usage, and straightforward replication and state management. 3. For mixed workloads, Spark provides high speed batch processing and micro-batch processing for streaming. It has wide support, integrated libraries and tooling, and flexible integrations. Flink provides true stream processing with batch processing support. It is heavily optimized, can run tasks written for other platforms, and provides low latency processing, but is still in the early days of adoption.
  • 33.
  • 34. Lambda use case : Social media & network analysis
  • 35.
  • 36. Kappa use case : Fraud Detection
  • 38.
  • 39.
  • 40.
  • 41. References and blogs links • https://dzone.com/articles/big-data-ingestion-flume-kafka-and-nifi • https://www.digitalocean.com/community/tutorials/hadoop-storm-samza-spark-and- flink-big-data-frameworks-compared • https://www.quora.com/What-is-the-closest-option-to-Esper-Apache-Spark-or- Apache-Flink • https://medium.com/@chandanbaranwal/spark-streaming-vs-flink-vs-storm-vs-kafka- streams-vs-samza-choose-your-stream-processing-91ea3f04675b • https://www.slideshare.net/gschmutz/big-data-architecture- 53231252?qid=3785d5c7-bd9c-408b-8714-ef9064a2be3b&v=&b=&from_search=1 [ IMPORTANT ]
  • 42. System Requirements Apache Kafka : · at least 8 GB RAM · at least 500 GB Storage · Ubuntu 14.04 or later, RHEL 6, RHEL 7, or equivalent · Access to Kafka (specifically, the ability to consume messages and to communicate with Zookeeper) · Access to Kafka Connect instances (if you want to configure Kafka Connect) · Ability to connect to the server from the user’s web browser. Docker image : https://hub.docker.com/r/bitnami/kafka/ Apache Hadoop : System Requirements: Per Cloudera page, the VM takes 4GB RAM and 3GB of disk space. This means your laptop should have more than that (I'd recommend 8GB+). Storage-wise, as long as you have enough to test with small and medium-sized data sets (10s of GB), you'll be fine. As for the CPU, if your machine has that amount of RAM you'll most likely be fine. I'm using a single-node crappy Pentium G3210 with 4GB of ram for testing my small jobs and it works just fine. Docker image : https://hub.docker.com/r/apache/hadoop Apache NIFI : NiFi Registry has the following minimum system requirements: ● Requires Java Development Kit (JDK) 8, newer than 1.8.0_45 ● Supported Operating Systems: ○ Linux ○ Unix ○ Mac OS X ● Supported Web Browsers: ○ Google Chrome: Current & (Current - 1) ○ Mozilla FireFox: Current & (Current - 1) ○ Safari: Current & (Current - 1) Docker image : https://hub.docker.com/r/apache/nifi/
  • 43. System Requirements Apache Spark : Hardware We used a virtual machine with the following setup: * CPU core count: 32 virtual cores (16 physical cores), Intel Xeon CPU E5-2686 v4 @ 2.30GHz * System memory: 244 GB * Total local disk space for shuffle: 4 x 1900 GB NVMe SSD Software ● OS: Ubuntu 16.04 ● Spark: Apache Spark 2.3.0 in local cluster mode ● Pandas version: 0.20.3 ● Python version: 2.7.12 Docker image : https://hub.docker.com/r/sequenceiq/spark/ Apache Flink : Recommended Operating System ● Microsoft Windows 10 ● Ubuntu 16.04 LTS ● Apple macOS 10.13/High Sierra Memory Requirement ● Memory - Minimum 4 GB, Recommended 8 GB ● Storage Space - 30 GB Note − Java 8 must be available with environment variables already set. Docker image : https://hub.docker.com/_/flink