3. Apache Spark : an open source cluster computing
framework for real-time data processing
According to Spark Certified Experts: Sparks
performance is up to 100 times faster in memory and
10 times faster on disk when compared to Hadoop
The main feature of Apache Spark is its in-memory
cluster computing that increases the processing speed
of an application
16-04-2019 3
5. Speed:
Spark runs up to 100 times faster than Hadoop
MapReduce for large-scale data processing
Powerful Caching:
Simple programming layer provides powerful
caching and disk persistence capabilities.
Deployment:
It can be deployed through Mesos, Hadoop via
YARN, or Spark’s own cluster manager
16-04-2019 5
6. Real-Time:
It offers Real-time computation & low latency
because of in-memory computation
Polyglot:
Spark provides high-level APIs in Java, Scala,
Python, and R. Spark code can be written in any
of these four languages. It also provides a shell
in Scala and Python
16-04-2019 6
9. SPARK DRIVE :-
Separate process to execute user application
Creates SparkContext to schedual
Jobs execution & negotiate with cluster
manager
EXECUTORS :-
Run tasks scheduled by driver
Store computation result in memory,on disk
or off-heap
Interact with storage systems
16-04-2019 9
10. CLUSTER MANAGER :-
Spark context works with the cluster
manager to manage various jobs
The driver program & Spark context takes
care of the job execution within the cluster
16-04-2019 10
11. Apache Spark Architecture is based on two main
abstractions:
Resilient Distributed Dataset (RDD)
Directed Acyclic Graph (DAG)
16-04-2019 11
16. RDDs can perform two types of operations:
Transformations: They are the operations
that are applied to create a new RDD.
Actions: They are applied on an RDD to
instruct Apache Spark to apply computation
and pass the result back to the driver.
16-04-2019 16
19. ADVANTAGES:
Integration with Hadoop
Faster
Real time stream processing
DRAWBACKS:
No File Management system
No Support for Real-Time Processing
Cost Effective
Manual Optimization
16-04-2019 19
20. SPARK makes it easy to write and run complicated data
processing
It enables computation of tasks at a very large scale
Although spark has many limitations, it is still trending in
the big data world
Due to these drawbacks, many technologies are
overtaking Spark
Such as Flink offers complete real-time processing than
the spark
In this way somehow other technologies overcoming the
drawbacks of Spark
16-04-2019 20