Spark is a fast, general processing engine that improves efficiency through in-memory computing and computation graphs. It offers APIs in Scala, Java, Python and R. Spark applications use Resilient Distributed Datasets (RDDs) which are immutable, partitioned objects that support fault tolerance. Spark also supports Spark SQL for structured data querying and Spark MLlib for machine learning.
2. Apache Spark
Apache Spark is very fast, general processing
engine.
Spark improves efficiency through in-memory
computing primitives and general computation
graphs.
Spark offers rich APIs is Scala, Java, Python and R,
which allow us to seamlessly combine components.
Written in Scala and runs on JVM (memory
management, fault recovery, storage interaction, ...).
Spark Core
Spark
SQL
Spark
Streaming MLlib GraphX
3. Running a Spark application
Interactive shell
Spark in a cluster mode
OR
4. Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs) are the
basic units of abstraction in Spark.
RDD is an immutable, partitioned set of
objects.
RDDs are lazy evaluated.
RDDs are fully fault-tolerant. Lost data can
be recovered using the lineage graph of
RDDs (by rerunning operations on the input
data).
val lines = sc.textFile("pathToMyFile")
RDD operations:
Transformations - Lazy evaluated (executed
by calling an action to improve pipelining)
-map, filter, groupByKey, join, ...
Actions - Runned immediately (to return the
value to application/storage)
-count, collect, reduce, save, ...
Don’t forget to cache()
5. Spark Dataframes
Dataframes are common abstraction that go across languages, and they represent a table, or
two-dimensional array with columns and rows.
Spark Datarames are distributed dataframes. They allow querying structured data using SQL
or DSL (for example in Python or Scala).
Like RDDs, Dataframes are also immutable structure.
They are executed in parallel.
val df = sqlContext.read.json"pathToMyFile.json")
6. Spark Datasets
Spark Dataset is a strongly-typed, immutable collection of objects that are mapped to a
relational schema. *
Encoder is responsible for converting between JVM objects and tabular representation.
API Preview in Spark 1.6
Main goal was to bring the object oriented programming style and type-safety, while
preserving performance.
Java and Scala APIs so far.
val lines = sqlContext.read.text("pathToMyFile").as[String]
*qoute: https://databricks.com/blog/2016/01/04/introducing-spark-datasets.html
7. Spark program lifecycle
Create RDD
(external data or parallelize
collection)
Transformation
(lazy evaluated)
Cache RDD
(for reuse)Action
(execute computation and
return results)