2. SPARK COMPONENTS
The Spark core is complemented by a set of powerful,
higher-level libraries
SparkSQL
MLlib (for machine learning)
GraphX
RDD(Resilient Distributed Dataset)
3. SparkSQL Introduction
Part of the core distribution since Spark 1.0 (2014)
Integrated with the Spark stack Supports querying
data either via SQL or via the Hive Query Language
Originated as the Apache Hive port to run on top of
Spark (in place of MapReduce)
Can weave SQL queries with code transformations
Capability to expose Spark datasets over JDBC API and
allow running the SQL like queries on Spark data
using traditional BI and visualization tools
Bindings in Python, Scala, and Java
7. Logical and Physical query plans
Both are trees representing query evaluation
Internal nodes are operators over the data
Logical plan is higher-level and algebraic
Physical plan is lower-level and operational
Logical plan operators –
Conceptually describe what operation needs to be
performed
Physical plan operators – Correspond to implemented
access methods
8. Key Features of MLib
Low level library in Spark
Built-in data analysis workflow
Free performance gains
Scalable
Python, Scala, JavaAPIs
Broad coverage of applications & algorithms
Rapid improvements in speed & robustness
Easy to use
Integrated workflow
9. MLlib
MLlib is a machine learning library that provides
various algorithms designed to scale out on a cluster
for classification, regression, clustering, collaborative
filtering, and so on.
These algorithms also work with streaming data, such
as linear regression using ordinary least squares or k-
means clustering (and more on the way).
Apache Mahout (a machine learning library for
Hadoop) has already turned away from MapReduce
and joined forces on Spark MLlib.
10.
11. GraphX
GraphX is an API for graphs and graph parallel
execution.
It is a network graph analytics engine.
GraphX is a library that performs graph-parallel
computation and manipulates graph.
It has various Spark RDD API so it can help to create
directed graphs with arbitrary properties linked to its
vertex and edges.
12. GraphX
GraphX also provides various operator and algorithms
to manipulate graph.
Clustering, classification, traversal, searching, and
pathfinding is possible in GraphX.
13. Spark GraphX Features
Flexibility:
works with both graphs and computations
unifies ETL (Extract, Transform & Load), exploratory analysis and
iterative graph computation within a single system.
We can view the same data as both graphs and collections, transform
and join graphs with RDDs efficiently and write custom iterative graph
algorithms
Speed:
provides comparable performance to the fastest specialized graph
processing systems.
It is comparable with the fastest graph systems while retaining Spark’s
flexibility, fault tolerance and ease of use.
14. Spark GraphX Features
Growing Algorithm Library:
We can choose from a growing library of graph
algorithms
Some of the popular algorithms are page rank,
connected components, label propagation, strongly
connected components and triangle count.
15. Spark Core
Shelter to API that contains the backbone of Spark i.e.
RDDs
The basic functionality of Spark is present in Spark
Core :
memory management
fault recovery
interaction with the storage system
I/O functionalities like task dispatching
16. Resilient Distributed Dataset(RDD)
Spark introduces the concept of an RDD , an
immutable fault-tolerant, distributed collection of
objects that can be operated on in parallel.
RDD can contain any type of object and is created by
loading an external dataset or distributing a collection
from the driver program.
17. RDD operation
RDDs support two types of operations:
Transformations : transform one data collection into
another (such as map, filter, join, union, and so on),
that are performed on an RDD and which yield a new
RDD containing the result. Means create a new dataset
from an existing one
Actions : require that the computation be performed
(such as reduce, count, first, collect, save and so on)
that return a value after running a computation on an
RDD. which return a value to the driver program or file
after running a computation on the dataset.
18. Properties for RDD
Immutability
Cacheable – linage – persist
Lazy evaluation (it different than execution)
Type Inferred
Two ways to create RDDs:
parallelizing an existing collection in your driver program,
referencing a dataset in an external storage system,
such as a shared file system, HDFS, Hbase, Cassandra or
any data source offering a Hadoop InputFormat.
19. Spark Streaming
Spark Streaming is the component of Spark which is
used to process real-time streaming data.
It enables high-throughput and fault-tolerant stream
processing of live data streams.