Spark Datasets are an evolution of Spark DataFrames which allow us to work with both functional and relational transformations on big data with the speed of Spark.
2. Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Software Engineer at IBM’s Spark Technology Center
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
3. Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core
● Interested in using Spark Datasets
● Familiar-ish with Scala or Java or Python
Amanda
4. What we are going to explore together!
● What is Spark SQL
● Where it fits into the Spark ecosystem
● How DataFrames & Datasets are different from RDDs
● How Datasets are different than DataFrames
Ryan McGilchrist
5. The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross
6. The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
Jon Ross
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames
8. Why are Datasets so awesome?
● Get to mix functional style and relational style
● Nice performance of Spark SQL flexibility of RDDs
● Strongly typed
Will Folsom
10. How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
Andrew Skudder
11. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
12. Getting started:
Our window to the world:
● Core Spark has the SparkContext
● Spark Streaming has the StreamingContext
● SQL has:
○ SQLContext and HiveContext (pre-2.0)
○ Unified in SparkSession post 2.0
Petful
13. DataFrames, Datasets, and RDDs oh my!
Spark Core:
● RDDs
○ Templated on type, lazily evaluated, distributed collections, arbitrary
data types
Spark SQL:
● DataFrames (e.g. Datasset[Row])
○ Lazily evaluated data, eagerly evaluated schema, relational
● Datasets
○ templated on type, have a matching schema, support both relational
and functional operations
14. Spark SQL Data Types
● Requires types have Spark SQL encoder
○ Many common basic types already have encoders, nested classes of
common types don’t require their own encoder
○ RDDs support any serializable object
● Many common data types are directly supported
● Can add encoders for others
loiez Deniel
15. Where to start?
● Load Data in DataFrames & Datasets - use
SparkSession
○ Using the new DataSource API, raw SQL queries, etc.
● Register tables
○ Run SQL queries against them
● Write programmatic queries against DataFrames
● Apply functional transformations to Datasets
U-nagi
16. Loading with sparkSQL
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc.
● load(“path”)
Jess Johnson
18. What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources%
22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully
22. Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes: Array
[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs
23. Then from DF to DS
val pandas: Dataset[RawPanda] = df.as[RawPanda]
24. We can also convert RDDs
def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = {
rdd.toDS
}
Nesster
25. So what can we do with a DataFrame
● Relational style transformations
● Register it as a table and write raw SQL queries
○ df.registerTable(“murh”) sqlContext.sql(“select * from murh”)
● Write it out (with a similar API as for loading)
● Turn it into an RDD (& back again if needed)
● Turn it into a Dataset
● If you are coming from R or Pandas adjust your
expectations
sebastien batardy
26. What do our relational queries look like?
Many familiar faces are back with a twist:
● filter
● join
● groupBy - Now safe!
And some new ones:
● select
● window
● etc.
27. How do we write a relational query?
SQL expressions:
df.select(df("place"))
df.filter(df("happyPandas") >= minHappyPandas)
28. So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it super easy to perform multiple aggregations at
the same time
● Built in shortcuts for aggregates like avg, min, max
● Longer list at http://spark.apache.
org/docs/latest/api/scala/index.html#org.apache.spark.
sql.functions$
Sherrie Thai
29. Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))
34. UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x: len
(x), IntegerType())
Yağmur Adam
35. Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol, strLen
(stringCol) from myTable")
36. Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol), dateTimeFunction(format)(df
(unixTimeStamp).cast(TimestampType))
37. Introducing Datasets
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● DataFrames are essentially Datasets of Row objects (e.g. not strongly typed)
with less operations
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
○ Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Daisyree Bakker
38. Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
39. So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)
40. And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood
41. What is functional perf like?
● Generally not as good - can’t introspect normally
● SPARK-14083 is working on doing bytecode analysis
● Can still be faster than RDD transformations because of
serialization improvements
42. Why we should consider Datasets:
● We can solve problems tricky to solve with RDDs
○ Window operations
○ Multiple aggregations
● Fast
○ Awesome optimizer
○ Super space efficient data format
● We can solve problems tricky/frustrating to solve with
Dataframes
○ Writing UDFs and UDAFs can really break your flow
43. Where to go from here?
● SQL docs
● DataFrame & Dataset API
● High Performance Spark Early Release
45. Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Early Release
High Performance Spark
46. And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
47. And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Chapter 2 free preview thanks to Pepper Data (until
May 21st)
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
48. And some upcoming talks & office hours
● Tomorrow - Workshop
● June
○ Strata London - Spark Performance
○ Datapalooza Tokyo
○ Scala Days Berlin
● July
○ Data Day Seattle
49. Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau