This is first part of Spark 2 new features overview
This topic covers API changes; Structured Streaming; Encoders; Memory Management in Spark; Tungsten issues; Catalyst features)
2. About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With EPAM since 2015
3. 3Spark 2 from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
4. 4Spark 2 from Zinoviev Alexey
Sprk Dvlprs! Let’s start!
22. 22Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
23. 23Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
24. 24Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
25. 25Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
26. 26Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
27. 27Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
28. 28Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
29. 29Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
30. 30Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
31. 31Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
32. 32Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
33. 33Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
35. 35Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
36. 36Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
37. 37Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
38. 38Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39. 39Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
40. 40Spark 2 from Zinoviev Alexey
It’s very easy to understand Graph algorithms
def pregel[A](initialMsg: A, maxIterations: Int, activeDirection:
EdgeDirection)(
vprog: (VertexID, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] = { <Your Pregel Computations> }
46. 46Spark 2 from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
49. 49Spark 2 from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
50. 50Spark 2 from Zinoviev Alexey
Batch
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
51. 51Spark 2 from Zinoviev Alexey
Real
Time
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
53. 53Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
54. 54Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
55. 55Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
56. 56Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
57. 57Spark 2 from Zinoviev Alexey
WordCount with Structured Streaming
58. 58Spark 2 from Zinoviev Alexey
Structured Streaming provides …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
59. 59Spark 2 from Zinoviev Alexey
Structured Streaming provides (in dreams) …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
60. 60Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
61. 61Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
org.apache.spark.sql.
AnalysisException:
Union between streaming
and batch
DataFrames/Datasets is not
supported;
62. 62Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
Go to UnsupportedOperationChecker.scala and check your
operation
63. 63Spark 2 from Zinoviev Alexey
Case #1 : We should think about optimization in RDD terms
83. 83Spark 2 from Zinoviev Alexey
Unified API in Spark 2.0
DataFrame = Dataset[Row]
Dataframe is a schemaless (untyped) Dataset now
84. 84Spark 2 from Zinoviev Alexey
Define
case class
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
85. 85Spark 2 from Zinoviev Alexey
Read JSON
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
86. 86Spark 2 from Zinoviev Alexey
Filter by
Field
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
87. 87Spark 2 from Zinoviev Alexey
Case #3 : Spark has many contexts
88. 88Spark 2 from Zinoviev Alexey
Spark Session
• New entry point in spark for creating datasets
• Replaces SQLContext, HiveContext and StreamingContext
• Move from SparkContext to SparkSession signifies move
away from RDD
89. 89Spark 2 from Zinoviev Alexey
Spark
Session
val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = sparkSession.read
.option("header","true")
.csv("src/main/resources/names.csv")
df.show()
90. 90Spark 2 from Zinoviev Alexey
No, I want to create my lovely RDDs
91. 91Spark 2 from Zinoviev Alexey
Where’s parallelize() method?
92. 92Spark 2 from Zinoviev Alexey
RDD?
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
93. 93Spark 2 from Zinoviev Alexey
Case #4 : Spark uses Java serialization A LOT
94. 94Spark 2 from Zinoviev Alexey
Two choices to distribute data across cluster
• Java serialization
By default with ObjectOutputStream
• Kryo serialization
Should register classes (no support of Serialazible)
95. 95Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
96. 96Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
Don’t forget about GC
97. 97Spark 2 from Zinoviev Alexey
Tungsten Compact Encoding
98. 98Spark 2 from Zinoviev Alexey
Encoder’s concept
Generate bytecode to interact with off-heap
&
Give access to attributes without ser/deser
101. 101Spark 2 from Zinoviev Alexey
Case #5 : Not enough storage levels
102. 102Spark 2 from Zinoviev Alexey
Caching in Spark
• Frequently used RDD can be stored in memory
• One method, one short-cut: persist(), cache()
• SparkContext keeps track of cached RDD
• Serialized or deserialized Java objects
103. 103Spark 2 from Zinoviev Alexey
Full list of options
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
128. 128Spark 2 from Zinoviev Alexey
UnsafeRowFormat
• Bit set for tracking null values
• Small values are inlined
• For variable-length values are stored relative offset into the
variablelength data section
• Rows are always 8-byte word aligned
• Equality comparison and hashing can be performed on raw
bytes without requiring additional interpretation
129. 129Spark 2 from Zinoviev Alexey
Case #10 : Can I influence on Memory Management in Spark?
130. 130Spark 2 from Zinoviev Alexey
Case #11 : Should I tune generation’s stuff?
136. 136Spark 2 from Zinoviev Alexey
We have no ability…
• join structured streaming and other sources to handle it
• one unified ML API
• GraphX rethinking and redesign
• Custom encoders
• Datasets everywhere
• integrate with something important
137. 137Spark 2 from Zinoviev Alexey
Roadmap
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• Tungsten: custom encoders
• The RDD-based API is expected to be removed in Spark
3.0
138. 138Spark 2 from Zinoviev Alexey
And we can DO IT!
Real-Time Data-Marts
Batch Data-Marts
Relations Graph
Ontology Metadata
Search Index
Events & Alarms
Real-time
Dashboarding
Events & Alarms
All Raw Data backup
is stored here
Real-time Data
Ingestion
Batch Data
Ingestion
Real-Time ETL & CEP
Batch ETL & Raw Area
Scheduler
Internal
External
Social
HDFS → CFS
as an option
Time-Series Data
Titan & KairosDB
store data in Cassandra
Push Events & Alarms (Email, SNMP etc.)
141. 141Spark 2 from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs