Contact:
https://www.linkedin.com/in/brandonjobrien
@hakczar
Code examples available at https://github.com/br4nd0n/spark-streaming and https://github.com/br4nd0n/spark-viz
A demo and explanation of building a streaming application using Spark Streaming, Node.js and Redis with a real time visualization. Includes discussion of internals of Spark and Spark streaming including RDD partitioning and code and data distribution and cluster resource allocation.
7. Spark Streaming: Concepts
Application:
• Driver program
• RDD
• Partition
• Elements
• DStream
• InputReceiver
• 1 JVM for driver
program
• 1 JVM per executor
Cluster:
• Master
• Executors
• Resources
• Cores
• Gigs RAM
• Cluster Types:
• Standalone
• Mesos
• YARN
8. Spark Streaming: Lazy execution
//Allocate resources on cluster
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
//Lazy definition of logical processing (transformations)
val textFile = sc.textFile("README.md")
.filter(line=> {line.length> 10})
//foreachPartition() triggers execution (actions)
textFile.foreachPartition(partition=> {
partition.foreach(line => {
println(line)
})
})
• Use rdd.persist() when multiple actions are called on the same RDD
9. Spark Streaming: Execution Env
• Distributed data, distributed code
• RDD partitions are distributed across executors
• Actions trigger execution and return results to the driver program
• Code is executed on either the driver or executors
• Be careful of function closures!
//Function arguments to transformations executed on executors
val textFile = sc.textFile("README.md")
.filter(line=> {line.length> 10})
//collect() triggers execution (actions)
//executed on driver. foreachPartition executed on executors
textFile.collect().foreach(line => {
println(line)
})
11. Spark Streaming: Parallelism
• RDD partitions are processed in parallel
• Elements in a single partition are processed serially
• You control the number of partitions in an RDD
• If you need to guarantee any particular ordering of processing, use
groupByKey() to force all elements with the same key onto the same
partitions
• Be careful of shuffles
val textFile = sc.textFile("README.md”)
val singlePartitionRDD = textFile.repartition(1)
val linesByKey = shopResultsEnriched
.map(line => (getPartitionKey(line), line))
.groupByKey()
Streaming data, streaming customer behavior, up to thousands per second, Kafka cluster, new apps produce/consume streams, Spark higher level than Storm.