Introduction to Apache Spark 2.0

Introduction to Apache Spark 2.0
Himanshu Gupta
Sr. Software Consultant
Knoldus Software LLP

Agenda
Part 1
(SparkSession)
Part 2
(Structured Streaming)

What is Apache Spark ?
● A fast and general engine for large-scale
data processing.
● Offers a rich set of API(s) and Libraries
– In Scala, Java, Python and R
● Most active Apache Big Data project.
Img Src: https://www.google.com/

Spark Survey 2015
● Reflected answers and opinions
– Of over 1417 respondents from 842 organizations
● Indicated rapid growth of Spark community.
● Displayed positive attitude towards:
– Concise and Unified API for Big Data processing.
● https://databricks.com/blog/2015/09/24/spark-survey-2015-results
-are-now-available.html

Apache Spark 2.0
● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph
operations.

Apache Spark 2.0
● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph
operations.
SparkSession

What is SparkSession ?

SparkContext
For Core API

SparkContext StreamingContext
For Core API For Streaming API

SparkContext StreamingContext SQLContext
For Core API For Streaming API For SQL API

SparkContext StreamingContext SQLContext
For Core API For Streaming API For SQL API
SparkSession
Unified API

Benefits of Spark 2.0
● Unified DataFrames and Datasets
– DataFrames = Datasets[Row]
● 10X faster than Spark 1.6
– Due to Whole-Stage Code Generation.
● Smarter than Spark Streaming 1.6:
– As streaming is structured too.
Img Src: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Why Spark 2.0 is Faster ?

Why Spark 2.0 is Faster ?
Reason is
“Whole-Stage Code Generation”

Example
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

Example
Volcano Model

Whats wrong here ?
Volcano Model

For Answer
Lets compare same code with hand-written code
System Generated Hand Written

Volcano Model vs Hand-Written Code
Volcano
Hand-Written

Solution

Solution
Of Course
Whole-Stage Code Generation
Provides the performance of hand-written code with the functionality of a
general purpose engine.

What is Whole-Stage Code
Generation ?
● Same as Volcano Model
– As it generates code using the same process.
● The only difference is
– Earlier Spark applied code generation only to
expression evaluation (i.e., “1 + a”) but now it
generates code for the entire query.

Spark 1.x vs Spark 2.0

Agenda
Part 1
(SparkSession)
Part 2
(Structured Streaming)
Questions ??

Streaming Applications
Pros - ● Consistent
● In-Order Data
● No Shuffling
Cons - ● Non-Scalable
● No Fault Tolerance
Pros - ● Scalable
● Fault Tolerant
Cons - ● Inconsistent
● Out-of-Order Data
● Too much Shuffling

Continuous Application
Img Src: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html

How to Achieve it ?

Solution
Structured Streaming
Structured Streaming guarantees that at any time, the output of the
application is equivalent to executing a batch job on a prefix of the data.

How ?
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Conceptually, Structured Streaming treats all the data arriving as an infinite input table.

How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.

How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Incremental Execution

Output Modes
● Append
– Only the new rows are appended to the result table since the last
trigger will be written to the external storage.
● Complete
– The entire updated result table will be written to external storage
● Update
– Only the rows that were updated in the result table since the last
trigger will be changed in the external storage.

Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.

Other Benefits
● Easy to use

Other Benefits
● Easy to use
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.

Other Benefits
● Easy to use
● Join Streams with Static data
– To join a stream with a static DataFrame.

Other Benefits
● Easy to use

Other Benefits
● Easy to use
There are many more...

Requirements
● Input Sources must be replayable
– So that recent data can be re-read if the job
crashes.
● Output Source must support transactional updates
– So that the system make a set of records appear
atomically.

Comparison with Other Engines

Code
https://github.com/knoldus/Sparkathon

References
● https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-
0.html
● https://databricks.com/blog/2016/07/28/structured-streaming-in-apa
che-spark.html
● https://www.youtube.com/watch?v=ZFBgY0PwUeY
● http://spark.apache.org/docs/latest/

Introduction to Apache Spark 2.0

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Introduction to Apache Spark 2.0

Ähnlich wie Introduction to Apache Spark 2.0 (20)

Mehr von Knoldus Inc.

Mehr von Knoldus Inc. (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Apache Spark 2.0