The exponential growth of data is not a problem but processing/managing the huge diversity of data is a concern. In this session, we are gonna discuss about one of the most popular big data distributed processing framework "Spark" where we'll be developing API using "Scala", a gained prominence amid big data developers.
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Mute
Please keep your window on mute
Avoid Disturbance
Avoid leaving your window
unmuted after asking a question
3. Agenda
What, When & Why
Introduction to Apache Spark
01
Master-slave architecture
Spark Architecture
02
Situations where spark is helpful.
Use-cases for Spark
03
Components & API in Spark eco-system
Spark Eco-System
04
Spark Scala in Action
Demonstration
05
5. c
What is Spark
LEARN NOW
● A General Purpose Distributed Data Processing
Engine.
● One of the most popular big data distributed
processing framework.
● A multi-language engine for executing data
engineering, data science, and machine
learning on single-node machine.
6. c
Why Spark
LEARN NOW
● Supported Language (Java, Python, Scala, R)
● Support multiple languages and integrations
with other popular products.
● Offers much less reading and writing to and
from the disk.
7. c
When Spark
LEARN NOW
● Implements a full server- and client-side HTTP
stack on top of akka-actor and akka-stream.
● Works with Distributed data (S3, XD, HDFS),
NoSQL databases (HBase, Cassandra,
MongoDB).
● Machine Learning and Fog Computing.
9. Master Slave Architecture
Well defined layered architecture, components and layers are loosely coupled.
Cluster Manager
Spark Driver
● Control the execution of
Spark Application.
● Maintains all states of
Spark Cluster.
● Interface with Cluster
Manager.
Spark Executor
● Process that perform the
tasks assigned by the
Spark driver.
● Take the tasks assigned
by the driver, run them,
and report back their
state.
● Responsible for
maintaining a cluster of
machines that will run
your Spark Application.
● Have its own “Driver” and
“Worker” abstractions.
12. Ideal situation to use Spark
Batch and
Streaming
Supports
both batch
and real time
processing.
Big Data in
Cloud
Easy to setup
Spark with
Data lake
technologies
Finance
Industry
Analyse the
text inside the
regulatory
filling of their
own reports.
E-Commerce
Sector
Giants like
Ebay, Alibaba
uses Spark.