Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
4. What is Apache Spark ?
● A fast and general engine for large-scale
data processing.
● Offers a rich set of API(s) and Libraries
– In Scala, Java, Python and R
● Most active Apache Big Data project.
Img Src: https://www.google.com/
5. Spark Survey 2015
● Reflected answers and opinions
– Of over 1417 respondents from 842 organizations
● Indicated rapid growth of Spark community.
● Displayed positive attitude towards:
– Concise and Unified API for Big Data processing.
● https://databricks.com/blog/2015/09/24/spark-survey-2015-results
-are-now-available.html
6. Apache Spark 2.0
● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph
operations.
7. Apache Spark 2.0
● Released in July this year
– In fact version 2.1.0 is already under development.
● Provides a Unified API for SQL, Streaming and Graph
operations.
SparkSession
20. Whats wrong here ?
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
Volcano Model
21. For Answer
Lets compare same code with hand-written code
System Generated Hand Written
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
22. Volcano Model vs Hand-Written Code
Volcano
Hand-Written
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
24. Solution
Of Course
Whole-Stage Code Generation
Provides the performance of hand-written code with the functionality of a
general purpose engine.
25. What is Whole-Stage Code
Generation ?
● Same as Volcano Model
– As it generates code using the same process.
● The only difference is
– Earlier Spark applied code generation only to
expression evaluation (i.e., “1 + a”) but now it
generates code for the entire query.
26. Spark 1.x vs Spark 2.0
Img Src: https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html
34. How ?
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Conceptually, Structured Streaming treats all the data arriving as an infinite input table.
35. How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
36. How ?
● Developer defines a query on the input table
– As if it were a static table.
● Results are computed in a Result Table
– Which are further written to an output sink.
● At last developers define triggers
– To control result modification.
Incremental Execution
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
37. Output Modes
● Append
– Only the new rows are appended to the result table since the last
trigger will be written to the external storage.
● Complete
– The entire updated result table will be written to external storage
● Update
– Only the rows that were updated in the result table since the last
trigger will be changed in the external storage.
39. Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
40. Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
41. Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
42. Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
43. Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
44. Other Benefits
● Easy to use
– As it is simple Spark’s DataFrame/Dataset API.
● Uses Spark’s DataFrame/Datasets existing API
– So we can map, filter and aggregate data as we do in Spark SQL.
● Join Streams with Static data
– To join a stream with a static DataFrame.
There are many more...
45. Requirements
● Input Sources must be replayable
– So that recent data can be re-read if the job
crashes.
● Output Source must support transactional updates
– So that the system make a set of records appear
atomically.
46. Comparison with Other Engines
Img Src: https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html