DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Beneath RDD in Apache Spark by Jacek Laskowski
1. BENEATH RDD
IN APACHE SPARK
USING SPARK-SHELL AND WEBUI
/ / /JACEK LASKOWSKI @JACEKLASKOWSKI GITHUB MASTERING APACHE SPARK NOTES
2. Jacek Laskowski is an independent consultant
Contact me at jacek@japila.pl or
Delivering Development Services | Consulting | Training
Building and leading development teams
Mostly and these days
Leader of and
Blogger at and
@JacekLaskowski
Apache Spark Scala
Warsaw Scala Enthusiasts Warsaw Apache
Spark
Java Champion
blog.jaceklaskowski.pl jaceklaskowski.pl
6. SPARKCONTEXT AND RDDS
An RDD belongs to one and only one Spark context.
You cannot share RDDs between contexts.
SparkContext tracks how many RDDs were created.
You may see it in toString output.
9. CREATING RDD - SC.PARALLELIZE
sc.parallelize(col, slices)to distribute a local
collection of any elements.
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at
Alternatively, sc.makeRDD(col, slices)
10. CREATING RDD - SC.RANGE
sc.range(start, end, step, slices)to create
RDD of long numbers.
scala> val rdd = sc.range(0, 100)
rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[14] at range at <console>:
11. CREATING RDD - SC.TEXTFILE
sc.textFile(name, partitions)to create a RDD of
lines from a file.
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[16] at textFil
12. CREATING RDD - SC.WHOLETEXTFILES
sc.wholeTextFiles(name, partitions)to create
a RDD of pairs of a file name and its content from a
directory.
scala> val rdd = sc.wholeTextFiles("tags")
rdd: org.apache.spark.rdd.RDD[(String, String)] = tags MapPartitionsRDD[18] at wh
13. There are many more more advanced functions in
SparkContextto create RDDs.
14. PARTITIONS (AND SLICES)
Did you notice the words slices and partitions as
parameters?
Partitions (aka slices) are the level of parallelism.
We're going to talk about the level of parallelism later.
15. CREATING RDD - DATAFRAMES
RDDs are so last year :-) Use DataFrames...early and often!
A DataFrame is a higher-level abstraction over RDDs and
semi-structured data.
DataFrames require a SQLContext.
16. FROM RDDS TO DATAFRAMES
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at
scala> val df = rdd.toDF
df: org.apache.spark.sql.DataFrame = [_1: int]
scala> val df = rdd.toDF("numbers")
df: org.apache.spark.sql.DataFrame = [numbers: int]
17. ...AND VICE VERSA
scala> val rdd = sc.parallelize(0 to 10)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[51] at parallelize at
scala> val df = rdd.toDF("numbers")
df: org.apache.spark.sql.DataFrame = [numbers: int]
scala> df.rdd
res23: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[70]
19. CREATING DATAFRAMES - SQLCONTEXT.READ
sqlContext.readis the modern yet experimental way.
sqlContext.read.format(f).load(path), where f
is:
jdbc
json
orc
parquet
text
21. PARTITIONS AND LEVEL OF PARALLELISM
The number of partitions of a RDD is (roughly) the number
of tasks.
Partitions are the hint to size jobs.
Tasks are the smallest unit of execution.
Tasks belong to TaskSets.
TaskSets belong to Stages.
Stages belong to Jobs.
Jobs, stages, and tasks are displayed in web UI.
We're going to talk about the web UI later.
22. PARTITIONS AND LEVEL OF PARALLELISM CD.
In local[*] mode, the number of partitions equals the
number of cores (the default in spark-shell)
scala> sc.defaultParallelism
res0: Int = 8
scala> sc.master
res1: String = local[*]
Not necessarily true when you use local or local[n] master
URLs.
23. LEVEL OF PARALLELISM IN SPARK CLUSTERS
TaskScheduler controls the level of parallelism
DAGScheduler, TaskScheduler, SchedulerBackend work
in tandem
DAGScheduler manages a "DAG" of RDDs (aka RDD
lineage)
SchedulerBackends manage TaskSets
26. RDD LINEAGE
RDD lineage is a graph of RDD dependencies.
Use toDebugString to know the lineage.
Be careful with the hops - they introduce shuffle barriers.
Why is the RDD lineage important?
This is the R in RDD - resiliency.
But deep lineage costs processing time, doesn't it?
Persist (aka cache) it early and often!
27. RDD LINEAGE - DEMO
What does the following do?
val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
28. RDD LINEAGE - DEMO CD.
How many stages are there?
// val rdd = sc.parallelize(0 to 10).map(n => (n % 2, n)).groupBy(_._1)
scala> rdd.toDebugString
res2: String =
(2) ShuffledRDD[3] at groupBy at <console>:24 []
+-(2) MapPartitionsRDD[2] at groupBy at <console>:24 []
| MapPartitionsRDD[1] at map at <console>:24 []
| ParallelCollectionRDD[0] at parallelize at <console>:24 []
Nothing happens yet - processing time-wise.
29. SPARK CLUSTERS
Spark supports the following clusters:
one-JVM local cluster
Spark Standalone
Apache Mesos
Hadoop YARN
You use --master to select the cluster
spark://hostname:port is for Spark Standalone
And you know the local master URL, ain't you?
local, local[n], or local[*]
30. MANDATORY PROPERTIES OF SPARK APP
Your task: Fill in the gaps below.
Any Spark application must specify application name (aka
appName ) and master URL.
Demo time! => spark-shell is a Spark app, too!
31. SPARK STANDALONE CLUSTER
The built-in Spark cluster
Start standalone Master with sbin/start-master
Use -h to control the host name to bind to.
Start standalone Worker with sbin/start-slave
Run single worker per machine (aka node)
= web UI for Standalone cluster
Don't confuse it with the web UI of Spark application
Demo time! => Run Standalone cluster
http://localhost:8080/
33. SPARK-SHELL AND SPARK STANDALONE
You can connect to Spark Standalone using spark-shell
through --master command-line option.
Demo time! => we've already started the Standalone
cluster.
35. WEBUI
It is available under
You can disable it using spark.ui.enabled flag.
All the events are captured by Spark listeners
You can register your own Spark listener.
Demo time! => webUI in action with different master URLs
http://localhost:4040/
36. QUESTIONS?
- Visit
- Follow at twitter
- Use
- Read notes.
Jacek Laskowski's blog
@jaceklaskowski
Jacek's projects at GitHub
Mastering Apache Spark