Introduction to Spark Datasets - Functional and relational together at last

Introduction to Spark
Datasets
Functional and relational together at last

Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Software Engineer at IBM’s Spark Technology Center
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau

Who do I think you all are?
● Nice people*
● Some knowledge of Apache Spark core
● Interested in using Spark Datasets
● Familiar-ish with Scala or Java or Python
Amanda

What we are going to explore together!
● What is Spark SQL
● Where it fits into the Spark ecosystem
● How DataFrames & Datasets are different from RDDs
● How Datasets are different than DataFrames
Ryan McGilchrist

The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML
bagel &
Graph X
MLLib
Community
Packages
Jon Ross

The different pieces of Spark
Apache Spark
SQL, DataFrames & Datasets
Structured
Streaming
Scala,
Java,
Python, &
R
Spark ML
Jon Ross
bagel &
Graph X
MLLib
Scala,
Java,
PythonStreaming
Graph
Frames

Why should we consider Spark SQL?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
Rikki's Refuge

Why are Datasets so awesome?
● Get to mix functional style and relational style
● Nice performance of Spark SQL flexibility of RDDs
● Strongly typed
Will Folsom

What is the performance like?
Andrew Skudder

How is it so fast?
● Optimizer has more information (schema & operations)
● More efficient storage formats
● Faster serialization
● Some operations directly on serialized data formats
Andrew Skudder

Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

Getting started:
Our window to the world:
● Core Spark has the SparkContext
● Spark Streaming has the StreamingContext
● SQL has:
○ SQLContext and HiveContext (pre-2.0)
○ Unified in SparkSession post 2.0
Petful

DataFrames, Datasets, and RDDs oh my!
Spark Core:
● RDDs
○ Templated on type, lazily evaluated, distributed collections, arbitrary
data types
Spark SQL:
● DataFrames (e.g. Datasset[Row])
○ Lazily evaluated data, eagerly evaluated schema, relational
● Datasets
○ templated on type, have a matching schema, support both relational
and functional operations

Spark SQL Data Types
● Requires types have Spark SQL encoder
○ Many common basic types already have encoders, nested classes of
common types don’t require their own encoder
○ RDDs support any serializable object
● Many common data types are directly supported
● Can add encoders for others
loiez Deniel

Where to start?
● Load Data in DataFrames & Datasets - use
SparkSession
○ Using the new DataSource API, raw SQL queries, etc.
● Register tables
○ Run SQL queries against them
● Write programmatic queries against DataFrames
● Apply functional transformations to Datasets
U-nagi

Loading with sparkSQL
sqlContext.read returns a DataFrameReader
We can specify general properties & data specific options
● option(“key”, “value”)
○ spark-csv ones we will use are header & inferSchema
● format(“formatName”)
○ built in formats include parquet, jdbc, etc.
● load(“path”)
Jess Johnson

Loading with sparkSQL
df = sqlContext.read
.format("json")
.load("sample.json")
Jess Johnson

What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources%
22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully

Sample json record
{"name":"mission",
"pandas":[{"id":1,"zip":"94110","pt":"giant",
"happy":true, "attributes":[0.4,0.5]}]}
Xiahong Chen

Getting the schema
● printSchema() for human readable
● schema for machine readable

Sample case class for schema:
case class RawPanda(id: Long, zip: String, pt:
String, happy: Boolean, attributes: Array
[Double])
case class PandaPlace(name: String, pandas:
Array[RawPanda])
Orangeaurochs

Then from DF to DS
val pandas: Dataset[RawPanda] = df.as[RawPanda]

We can also convert RDDs
def fromRDD(rdd: RDD[RawPanda]): Dataset[RawPanda] = {
rdd.toDS
}
Nesster

So what can we do with a DataFrame
● Relational style transformations
● Register it as a table and write raw SQL queries
○ df.registerTable(“murh”) sqlContext.sql(“select * from murh”)
● Write it out (with a similar API as for loading)
● Turn it into an RDD (& back again if needed)
● Turn it into a Dataset
● If you are coming from R or Pandas adjust your
expectations
sebastien batardy

What do our relational queries look like?
Many familiar faces are back with a twist:
● filter
● join
● groupBy - Now safe!
And some new ones:
● select
● window
● etc.

How do we write a relational query?
SQL expressions:
df.select(df("place"))
df.filter(df("happyPandas") >= minHappyPandas)

So whats this new groupBy?
● No longer causes explosions like RDD groupBy
○ Able to introspect and pipeline the aggregation
● Returns a GroupedData (or GroupedDataset)
● Makes it super easy to perform multiple aggregations at
the same time
● Built in shortcuts for aggregates like avg, min, max
● Longer list at http://spark.apache.
org/docs/latest/api/scala/index.html#org.apache.spark.
sql.functions$
Sherrie Thai

Computing some aggregates by age code:
df.groupBy("age").min("hours-per-week")
OR
import org.apache.spark.sql.catalyst.expressions.aggregate._
df.groupBy("age").agg(min("hours-per-week"))

Easily compute multiple aggregates:
df.groupBy("age").agg(min("hours-per-week"),
avg("hours-per-week"),
max("capital-gain"))

Windowed operations
● Can compute over the past K and next J
● Really hard to do in regular Spark, super easy in SQL
Lucie Provencher
1
3
2
1

Window specs
import org.apache.spark.sql.expressions.Window
val spec = Window.partitionBy("age").orderBy
("capital-gain").rowsBetween(-10, 10)
val rez = df.select(avg("capital-gain").over
(spec))
Ryo Chijiiwa

UDFS: Adding custom code
sqlContext.udf.register("strLen", (s: String) =>
s.length())
sqlCtx.registerFunction("strLen", lambda x: len
(x), IntegerType())
Yağmur Adam

Using UDF on a table:
First Register the table:
df.registerTempTable("myTable")
sqlContext.sql("SELECT firstCol, strLen
(stringCol) from myTable")

Using UDFs Programmatically
def dateTimeFunction(format : String ): UserDefinedFunction = {
import org.apache.spark.sql.functions.udf
udf((time : Long) => new Timestamp(time * 1000))
}
val format = "dd-mm-yyyy"
df.select(df(firstCol), dateTimeFunction(format)(df
(unixTimeStamp).cast(TimestampType))

Introducing Datasets
● New in Spark 1.6
● Provide templated compile time strongly typed version of DataFrames
● DataFrames are essentially Datasets of Row objects (e.g. not strongly typed)
with less operations
● Make it easier to intermix functional & relational code
○ Do you hate writing UDFS? So do I!
● Still an experimental component (API will change in future versions)
○ Although the next major version seems likely to be 2.0 anyways so lots of things may change
regardless
Daisyree Bakker

Using Datasets to mix functional & relational style:
val ds: Dataset[RawPanda] = ...
val happiness = ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)

So what was that?
ds.toDF().filter($"happy" === true).as[RawPanda].
select($"attributes"(0).as[Double]).
reduce((x, y) => x + y)
convert a Dataset to a
DataFrame to access
more DataFrame
functions (pre-2.0)
Convert DataFrame
back to a Dataset
A typed query (specifies the
return type).Traditional functional
reduction:
arbitrary scala code :)

And functional style maps:
/**
* Functional map + Dataset, sums the positive attributes for the
pandas
*/
def funMap(ds: Dataset[RawPanda]): Dataset[Double] = {
ds.map{rp => rp.attributes.filter(_ > 0).sum}
}
Chris Isherwood

What is functional perf like?
● Generally not as good - can’t introspect normally
● SPARK-14083 is working on doing bytecode analysis
● Can still be faster than RDD transformations because of
serialization improvements

Why we should consider Datasets:
● We can solve problems tricky to solve with RDDs
○ Window operations
○ Multiple aggregations
● Fast
○ Awesome optimizer
○ Super space efficient data format
● We can solve problems tricky/frustrating to solve with
Dataframes
○ Writing UDFs and UDAFs can really break your flow

Where to go from here?
● SQL docs
● DataFrame & Dataset API
● High Performance Spark Early Release

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark

Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analytics with
Spark
Coming soon:
Spark in Action
Early Release
High Performance Spark

And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.

And the next book…..
First four chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
● Chapter 2 free preview thanks to Pepper Data (until
May 21st)
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.

And some upcoming talks & office hours
● Tomorrow - Workshop
● June
○ Strata London - Spark Performance
○ Datapalooza Tokyo
○ Scala Days Berlin
● July
○ Data Day Seattle

Cat wave photo by Quinn Dombrowski
k thnx bye!
If you <3 testing & want to fill out
survey: http://bit.ly/holdenTestingSpark
Will use update results in
Strata Presentation & tweet
eventually at @holdenkarau

Introduction to Spark Datasets - Functional and relational together at last

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to Spark Datasets - Functional and relational together at last

Ähnlich wie Introduction to Spark Datasets - Functional and relational together at last (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Spark Datasets - Functional and relational together at last