Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
2. We will talk about
• Scala
• Spark intro
• RDD in depth
• Spark architecture
• Shuffle and how it works
• Dive into the Dataset API
• Spark streaming
3. Spark
•Spark is a cluster computing engine.
•Provides high-level API in Scala, Java, Python and R.
•Provides high level tools:
•Spark SQL.
•MLib.
•GraphX.
•Spark Streaming.
4. RDD
•The basic abstraction in Spark is the RDD.
•Stands for: Resilient Distributed Dataset.
•It is a collection of items which their source may for example:
•Hadoop (HDFS).
•JDBC.
•ElasticSearch.
•And more…
5. D is for Partitioned
• Partition is a sub-collection of data that should fit into memory
• Partition + transformation = Task
• This is the distributed part of the RDD
• Partitions are recomputed in case of failure - Resilient
Foo bar ..
Line 2
Hello
…
…
Line 100..
Line #...
…
…
Line 200..
Line #...
…
…
Line 300..
Line #...
…
6. D is for dependency
• RDD can depend on other RDDs.
• RDD is lazy.
• For example rdd.map(String::toUppercase)
• Creates a new RDD which depends on the original one.
• Contains only meta-data (i.e., the computing function).
• Only on a specific command (knows as actions, like collect) the
flow will be computed.
7. The famous word count example
sc.textFile("src/main/resources/books.txt")
.flatMap(line => line.split(" "))
.map(w => (w,1))
.reduceByKey((c1, c2) => c1 + c2)
8. RDD from Collection
•You can create an RDD from a collection:
sc.parallelize(list)
•Takes a sequence from the driver and distributes it across the
nodes.
•Note, the distribution is lazy so be careful with mutable
collections!
•Important, if it’s a range collection, use the range method as it
does not create the collection on the driver.
9. RDD from file
•Spark supports reading files, directories and compressed files.
•The following out-of-the-box methods:
•textFile – retrieving an RDD[String] (lines).
•wholeTextFiles – retrieving an RDD[(String, String)] with
filename and content.
•sequenceFile – Hadoop sequence files RDD[(K,V)].
10. RDD Actions
•Return values by evaluating the RDD (not lazy):
•collect() – returns an list containing all the elements of the
RDD. This is the main method that evaluates the RDD.
•count() – returns the number of the elements in the RDD.
•first() – returns the first element of the RDD.
•foreach(f) – performs the function on each element of the
RDD.
11. RDD Transformations
•Return pointer to new RDD with transformation meta-data
•map(func) - Return a new distributed dataset formed by passing
each element of the source through a function func.
•filter(func) - Return a new dataset formed by selecting those
elements of the source on which func returns true.
•flatMap(func) - Similar to map, but each input item can be mapped
to 0 or more output items (so func should return a Seq rather than a
single item).
12. Per-partition Operations
•Spark provides several API for per-partition operations:
•mapPartitions.
•mapPartitionsWithIndex.
•You need to provide a function of type:
•Iterator[A] => Iterator[B].
•Example use-case: sort by values.
13. Caching
•One of the strongest features of Spark is the ability to cache an
RDD.
•Spark can cache the items of the RDD in memory or on disk.
•You can avoid expensive re-calculations this way.
•The cache() and persist() store the content in memory.
•You can provide a different storage level by supplying it to the
persist method.
14. Caching
•One of the strongest features of Spark is the ability to cache an
RDD.
•You can store the RDD in memory, disk or try in memory and if it
fails fallback to disk.
•In the memory it can be stored either deserialized or serialized.
•Note that serialized is more space-efficient but CPU works harder.
•You can also specify that the cache will be replicated, resulting in a
faster cache in case of partition loss.
15. In-Memory Caching
•When using memory as a cache storage, you have the option of
either using serialized or de-serialized state.
•Serialized state occupies less memory at the expense of more CPU.
•However, even in a de-serialized mode, Spark still needs to figure
the size of the RDD partition.
•Spark uses the SizeEstimator utility for this purpose.
•Sometimes can be quite expensive (for complicated data structures).
17. Map Reduce
• Map-reduce was introduced by Google and it is an interface that
can break a task to sub-tasks distribute them to be executed in
parallel (map) , and aggregate the results (reduce).
• Between the Map and Reduce parts, an alternative phase
called ‘shuffe’ can be introduced.
• In the Shuffle phase, the output of the Map operations is sorted
and aggregated for the Reduce part. (Hadoop)
18. Shuffle
•Shuffle operations repartition the data across the network.
•Can be very expensive operations in Spark.
•You must be aware where and why shuffle happens.
•Order is not guaranteed inside a partition.
•Popular operations that cause shuffle are: groupBy*,
reduceBy*, sort*, aggregateBy* and join/intersect operations on
multiple RDDs.
20. What actually happens in Shuffle?
Buffer
Block Files
Shuffle write –
shuffle map tasks
Shuffle read –
waits for all the
shuffle map tasks
to finish
AppendOnlyMap
if there is no space
ExternalAppendOnlyMap
21. Transformation that shuffle
*Taken from the official Apache Spark documentation
•distinct([numTasks]) - Return a new dataset that contains the
distinct elements of the source dataset.
•groupByKey([numTasks]) - When called on a dataset of (K,
V) pairs, returns a dataset of (K, Iterable<V>) pairs.
•reduceByKey(func, [numTasks]) -When called on a
dataset of (K, V) pairs, returns a dataset of (K, V) pairs where
the values for each key are aggregated using the given reduce
function func, which must be of type (V,V) => V.
22. Transformation that shuffle
*Taken from the official Apache Spark documentation
•join(otherDataset, [numTasks]) - When called on datasets of
type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs
with all pairs of elements for each key. Outer joins are
supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.
•sort, sortByKey…
•More at http://spark.apache.org/docs/latest/programming-
guide.html
23. How does it work?
•Cluster manager
•Responsible for the resource allocation
•Cluster management
•Orchestrates work
•Worker nodes:
•Tracks and spawn application processes
Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
24. How does it work?
•Driver:
•Executes the main program
•Creates the RDDs
•Collects the results
•Executors:
•Executes the RDD operations
•Participate in the shuffle
Image taken from https://spark.apache.org/docs/latest/cluster-overview.html
25. Spark Terminology
•Application – the main program that runs on the driver and
creates the RDDs and collects the results.
•Job – a sequence of transformations on RDD till action occurs.
•Stage – a sequence of transformations on RDD till shuffle
occurs.
•Task – a sequence of transformations on a single partition till
shuffle occurs.
26. Spark SQL
•Spark module for structured data
•Spark SQL provides a unified engine (catalyst) with 3 APIs:
•SQL. (out of topic today)
•Dataframes (untyped).
•Datasets (typed) – Only Scala and Java for Spark 2.x
27. DataFrame and DataSets
• Originally Spark provided only DataFrames.
• A DataFrame is conceptually a table with typed columns.
• However, it is not typed at compilation.
• Starting with Spark 1.6, Datasets were introduced.
• Represent a table with columns, but, the row is typed at
compilation.
28. Spark Session
•The entry-point to the Spark SQL.
•It doesn’t require a SparkContext (is managed within).
•Supports binding to the current thread (Inherited ThreadLocal)
for use with getOrCreate().
•Supports configuration settings (SparkConf, app name and
master).
29. Creating DataFrames
• DataFrames are created using spark session
• df = spark.read.json(“/data/people.json”)
• df = spark.read.csv(“/data/people.csv”)
• df = spark.read.x..
• Options for constructing the DataFrame
• spark.read.option(“key”,”value”).option…read.csv(..)
• To create a DataFrame from an existing rdd :
• import spark.implicits._
rdd.toDS()
30. Schema
•Spark knows in many cases to infer the schema of the data by
himself. (By using reflection or sampling)
•Use printSchema() to explore the schema that spark deduce
•Spark also enables to provide the schema yourself
•And to cast types to change existing schema –
df.withColumn(”c1",df(”c1”).cast(IntegerType()))
•
31. Exploring the DataFrame
•Filtering by columns
•df.filter(“column > 2”)
•df.filter(df.column > 2)
•df.filter(df(“column”) > 2)
•Selecting specific fields:
•df.select(df(“column1”),df(“column2”)):
33. SQL on DataFrame
•Register DF as an SQL Table
•df.createOrReplaceTempView(“table1”)
•Remember the spark session? SessionBuilder..
•Now we can use sql queries:
•spark.sql(“select count(*) from table1”)
34. Query planning
•All of the Spark SQL APIs (SQL, Datasets and DataFrames)
result in a Catalyst execution.
•Query execution is composed of 4 stages:
•Logical plan.
•Optimized logical plan.
•Physical plan.
•Code-gen (Tungesten).
35. Logical plan
•In the first stage, the Catalyst creates a logical execution plan.
•A tree composed of query primitives and combinators that
match the requested query.
36. Optimized logical plan
•Catalyst runs a set of rules for optimizing the logical plan.
•E.g.:
•Merge projections.
•Push down filters.
•Convert outer-joins to inner-joins.
37. Tungesten
•Spark 2 introduced a new optimization: Whole stage code
generation.
•It appears that in many scenarios, the bottleneck is the CPU.
•Wasted on virtual calls and cache-line swapping.
•Simple and query specific code should be favored.
38. Tungesten
•Spark tries to generate a specific Java code for each stage of
the query.
•Using the extremely fast JANINO Java compiler.
•Remember that we want to trade a per-row performance hit for
a single per-query performance hit (even if it is greater).
39. Micro batching with Spark Streaming
• Takes a partitioned stream of data
• Slices it up by time – usually seconds
• DStream – composed of RDD slices that contains a collection of
items
Taken from official spark documentation
40. Spark streaming application
•Define the time window when creating the StreamingContext.
•Define the input sources for the streaming pipeline.
•Combine transformations.
•Define the output operations.
•Invoke start().
•Wait for the computation to stop (awaitTermination()).
41. Important
•Once start() was invoked, no additional operations can be
added.
•A stopped context can’t be restarted.
42. Example
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint(checkpoint.toString())
val dstream: DStream[Int] =
ssc.textFileStream(s"file://$folder/").map(_.trim.toInt)
dstream.print()
ssc.start()
ssc.awaitTermination()
43. DSTream operations
• Similar to RDD operations with small changes
•map(func) - returns a new DSTream applying func on every
element of the original stream.
•filter(func) - returns a new DSTream formed by selecting those
elements of the source stream on which func returns true.
•reduce(func) – returns a new Dstream of single-element
RDDs by applying the reduce func on every source RDD
44. Windows
• DStream also supports window operations.
• You can define a sliding window that contains n number of
intervals.
• You’ll get a new DStream that represents the window.
45. Using your existing batch logic
• transform(func) - operation that creates a new DStream by a
applying func to DStream RDDs.
dstream.transform(existingBuisnessFunction)
46. Updating the state
• By default, there is nothing shared between window intervals.
• How do I accumulate results with the current batch?
• updateStateByKey(updateFunc) – a transformation that
creates a new DStream with key-value where the value is
updated according to the state and the new values.
def updateFunction(newValues: Seq[Int], count: Option[Int]): Option[Int] = {
runningCount.map(_ + newValues.sum).orElse(Some(newValues.sum))
}
47. Checkpoints
• Checkpoints – periodically saves to reliable storage
(HDFS/S3/…) necessary data to recover from failures
• Metadata checkpoints
• Configuration of the stream context
• DStream definition and operations
• Incomplete batches
• Data checkpoints
• saving stateful RDD data
48. Checkpoints
• To configure checkpoint usage :
• streamingContext.checkpoint(directory)
• To create a recoverable streaming application:
• StreamingContext.getOrCreate(checkpointDirectory,
functionToCreateContext)
49. Working with the foreach RDD
• A common practice is to use the foreachRDD(func) to push
data to an external system.
• Don’t do:
dstream.foreachRDD { rdd =>
val myExternalResource = ... // Created on the driver
rdd.foreachPartition { partition =>
myExternalResource.save(partition)
}
}
50. Working with the foreach RDD
• Instead do:
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partition =>
val myExternalResource = ... // Created on the executor
myExternalResource.save(partition)
}
}
51. Distributed streaming challenges
• Consistency – not always eventual consistency is tolerable
• Fault tolerance – the handling of errors writing to external
storage is on the user
• Out of order
52. Structured streaming
• Guarantee: at any time, the output of the application is
equivalent to executing a batch job on a prefix of the data.
• Treat all data as an unbound input table
53. Output modes
• Append: Only new rows
• Complete: the whole result rows
• Update: Only the rows that were updated